Designing better frog call recognition models

Abstract Advances in bioacoustic technology, such as the use of automatic recording devices, allow wildlife monitoring at large spatial scales. However, such technology can produce enormous amounts of audio data that must be processed and analyzed. One potential solution to this problem is the use of automated sound recognition tools, but we lack a general framework for developing and validating these tools. Recognizers are computer models of an animal sound assembled from “training data” (i.e., actual samples of vocalizations). The settings of variables used to create recognizers can impact performance, and the use of different settings can result in large differences in error rates that can be exploited for different monitoring objectives. We used Song Scope (Wildlife Acoustics Inc.) to build recognizers and vocalizations of the wood frog (Lithobates sylvaticus) to test how different settings and amounts of training data influence recognizer performance. Performance was evaluated using precision (the probability of a recognizer match being a true match) and sensitivity (the proportion of vocalizations detected) based on a receiver operating characteristic (ROC) curve‐determined score threshold. Evaluations were conducted using recordings not used to build the recognizer. Wood frog recognizer performance was sensitive to setting changes in four out of nine variables, and small improvements were achieved by using additional training data from different sites and from the same recording, but not from different recordings from the same site. Overall, the effect of changes to variable settings was much greater than the effect of increasing training data. Additionally, by testing the performance of the recognizer on vocalizations not used to build the recognizer, we discovered that Type I error rates appear idiosyncratic and do not recommend extrapolation from training to new data, whereas Type II errors showed more consistency and extrapolation can be justified. Optimizing variable settings on independent recordings led to a better match between recognizer performance and monitoring objectives. We provide general recommendations for application of this methodology with other species and make some suggestions for improvements.


| INTRODUCTION
Acoustic surveys are commonly used to monitor the status or activity of animals that vocalize. Several groups of organisms, such as anuran amphibians, bats, birds, and marine mammals, are particularly suited to acoustic monitoring because of their dependence on vocalizations for major components of their life history including attracting mates, defending territories, and locating prey (Capp & Searcy, 1991;Kalko, 1995;Wells, 1977;Winn & Winn, 1978). Depending on the species and habitat, acoustic surveys can be more efficient at identifying vocalizing individuals to species, rather than attempting to observe the organism directly (Clark, Brown, & Corkeron, 2010;Heyer, Donnelly, McDiarmid, Hayek, & Foster, 1994). Knowledge about the vocal repertoire of a species can help us understand where the organisms occur (Weir, Fiske, & Royle, 2009), the conditions under which they perform certain behaviors (Klaus & Lougheed, 2013;Villanueva-Rivera, Pijanowski, Doucette, & Pekin, 2011), as well as provide estimates of abundance (Borker et al., 2014;Buxton & Jones, 2012).
Traditionally, acoustic surveys have been conducted by humans present at the field site listening for vocalizations. Over the last few decades, however, the use of automated recording devices (ARDs) to assist or replace manual acoustic surveys has become more common (Digby, Towsey, Bell, & Teal, 2013;Hutto & Stutzman, 2009;Peterson & Dorcas, 1992;Venier, Holmes, Holborn, McIlwrick, & Brown, 2012).
Whereas manual surveys are limited by the amount of time a human can be present at a field site, ARDs can be deployed and automatically record sound at remote locations for long periods of time on userdefined schedules (Acevedo & Villanueva-Rivera, 2006). The main advantage in the use of ARDs over manual surveys is the increase in the amount and scope of environmental recordings and, therefore, an increase in the likelihood of detecting a species if it is present at the site (i.e., the detection probability). The probability of detecting a species has been shown to vary by, among others things, the time of year, time of the day, temperature, humidity, and abundance (Jackson, Weckerly, Swannack, & Forstner, 2006;Tanadini & Schmidt, 2011;Weir, Royle, Nanjappa, & Jung, 2005). If surveys are conducted when detection probabilities are low, the species could be missed when it is actually present.
Concern about the consequences of incorrectly concluding that a species is absent from a site (i.e., a false negative) is a topic of considerable interest in ecology (MacKenzie et al., 2006). It has been documented that estimating site occupancy without controlling for detection probability can result in a negative bias in the estimate of the occupied area (MacKenzie et al., 2002) as well as biased extinction and colonization rates (MacKenzie, Nichols, Hines, Knutson, & Franklin, 2003), and species distribution models (Comte & Grenouillet, 2013). Automated recording devices can help alleviate the problem of low detection probabilities and therefore increase the usefulness of survey data, by rapidly increasing the cumulative detection probability because of the additional listening time. For example, if the probability of detecting a rare frog during a five-minute acoustic survey is 0.2, then a single manual survey at any site will detect the species when it is present about 20% of the time. With an ARD deployed at the site with a recording schedule of five minutes every 30 min from 7 p.m. to 7 a.m., the 25 recordings will yield a cumulative detection probability of .996 (using the equation 1 − (1 − p) N where p is the detection probability and N is the number of surveys). However, this only means there is a good chance that if the species is present it was recorded-it must still be detected on the recording.
The quantity of samples generated by ARDs is often overwhelming. With the deployment of just five ARDs on a standard recording schedule, the recordings generated during a week of deployment easily exceed the hours of a typical work week. Several methods have been suggested to extract the required information from the large quantity of recordings. One approach to processing large amounts of recorded data is to use automated sound recognition algorithms that allow researchers to search all the recordings with a custombuilt model of the vocalization of interest (Acevedo, Corrada-Bravo, Corrada-Bravo, Villanueva-Rivera, & Aide, 2009;Brandes, 2008). The goal of automated sound recognition is to identify the vocalization of a target species within the recordings among the other animal and environmental noises. Software programs can batch process hundreds of digital recording files, saving tremendous amounts of time in extracting information from the recordings (Waddle, Thigpen, & Glorioso, 2009;Willacy, Mahony, & Newell, 2015).
The vast quantity of acoustic samples obtainable from the deployment of ARDs coupled with the automated analysis of the recordings is a powerful tool for developing robust estimates of occupancy, extinction and colonization rates, and activity/phenology patterns.
Several off-the-shelf software programs are available for researchers to conduct automated analysis of sound files for vocalizations of interest. However, the lack of published research utilizing these tools to answer questions at large scales hints at the difficulty in extracting information from the acoustic samples (Swiston & Mennill, 2009), and/ or the reluctance of ecologists and wildlife managers to trust results from a fully automated process.
We used the software program Song Scope V 4.1.3A (Wildlife Acoustics, Concord, MA, USA) in our study. Song Scope is a commercially available, multi-purpose sound analysis software program that has been used by ecologists to develop automated vocalization recognition models, or "recognizers" (Buxton & Jones, 2012;Holmes, McIlwrick, & Venier, 2014;Waddle et al., 2009;Willacy et al., 2015).
Developing recognizers in Song Scope involves two steps. The first step is locating vocalizations from existing recordings (i.e., "annotating" the recordings) to be used as training data upon which the model is to be based. The second step is selecting the settings of the variables used to create the recognizer model. At the first step, we need to answer questions about how much, and what kinds of training data provide the best recognizers. At the second step, we need to identify the variable settings that build the best model (i.e., low false-positive rates, low false-negative rates, and good discriminatory ability). The manufacturers of Song Scope provide a general overview of and recommendations for the creation of recognizer models, but deciding on the quantity of training data and settings of the variables for model creation is a largely trial-and-error procedure, and we have found no published guidance. Their process emphasizes model performance on the training data (rather than new data where it will invariably be used).
The primary purpose of this article is to provide guidance for designing and validating recognizers. More specifically, we asked (1) how does increasing training data influence recognizer performance and does the source of the training data matter, (2) is there an objective and repeatable way to choose variable settings and design a recognizer, which explicitly considers Type I and II errors in the process, and (3) can we extrapolate recognizer performance from the training dataset so that we can use it on new data with any degree of confidence? We use vocalizations of the wood frog (Lithobates sylvaticus) for all our experiments. Wood frogs are a common, North American pond-breeding anuran and are a model organism for research into wetland ecosystem structure, amphibian population, and community ecology ( Figure 1).

| Recognizer development
Recognition of the target vocalization is accomplished in a two-step process. The first is detection, during which the recognizer model scans all sounds within the recording to identify the sounds that are potential target vocalizations. We define "sound" as any signal or noise in the recording-it may or may not be the target call. We use "vocalization" and "call" synonymously to refer to the true signal of the target species and "match" or "hit" when the recognizer model identifies a sound. Identifying the target vocalizations is done by comparing sounds to a model created by the program from annotated calls provided by the user. Signals are detected if they stand out against background noise and have roughly the same frequency range and temporal properties (call length, syllable structure, etc.) as the model. The second part involves computing a "score" statistic on sounds identified as potential target vocalizations at the detection step. This is a measure of similarity between the sound and the model (the similarity score can vary from 0 to 100, with 100 being a perfect match) and is generated by the Viterbi algorithm (Agranat, 2009). When the model encounters a sound, one of four outcomes occurs-true positive, false positive, true negative, or false negative (Figure 2).
True and false positives can be estimated by manually verifying the matches in the output, and false negatives can be determined by subtracting the number of true positives from the total number of vocalizations in the recording. True negatives are sounds that are not calls and not misidentified as calls.
The objective of recognizer development is to minimize the number of false positives and false negatives. There is an inevitable tradeoff between false positives and false negatives because to reduce false positives we must set a threshold so that only high-quality vocalizations are matched, and many lower-quality vocalizations are ignored.
When reducing false negatives, the threshold for concluding a sound is a call is lower, and therefore, many lower-quality sounds are included.
Score values can be used to distinguish between true-and falsepositive matches. Ideally a threshold should be established, above which the match is certain to be a true positive, and below which the match is certain to be a false positive. However, this is rarely attained in practice, and the objective is to set a threshold that results in large reductions in Type I errors with only small increases in Type II errors.

| Recognizer metrics
In the following experiments, we use precision and sensitivity as our common metrics of recognizer performance. Precision is also known as the positive predictive value in signal detection theory (Fawcett, 2006) and is calculated as the number of true positives divided by the  (Miller et al., 2012). We conditioned the estimates of precision and sensitivity on an optimal score threshold, determined using the area under a receiver operating characteristic (ROC) curve. The optimum threshold for each recognizer was determined using Youden's J statistic (Youden, 1950), where J = sensitivity + true-negative rate -1. We used the term "conditional" when referring to precision and sensitivity derived using the ROC threshold because if a different threshold was used, the precision and sensitivity would change. We estimated the "conditional" precision as the number of true-positive matches above the ROC-determined threshold divided by the total number of matches above the threshold (i.e., 1-precision at the optimal score threshold = the Type I error rate). Similarly, the "conditional" sensitivity is estimated by the number of true-positive matches above the threshold divided by the total number of calls in the recording (i.e., 1-sensitivity at the optimal score threshold = the Type II error rate).

| Increasing training data
We assessed the effect of increasing training data on the identification of wood frog vocalizations. We started by collecting training data by annotating wood frog vocalizations from 28 sites in southern New Brunswick, Canada, recorded in 2012. The recordings from which the annotations were extracted were collected as part of a monitoring program and were recorded by Song Meter 1, SM2, and SM2 + units (Wildlife Acoustics, Concord, MA, USA). We annotated a total of 4,080 wood frog vocalizations with the primary objective to determine what training data to use to create a good recognizer. We made the assumption that there is variability in wood frog vocalizations among individuals, and this variability potentially affects the performance of the recognizer. There are three different levels across which training data can be collected and thus variability in vocalizations captured.
These are (1) within recordings (i.e., same five-minute recording at same site), (2) among recordings (i.e., different five-minute recordings at same site), and (3) among sites. There should be more variability in vocalizations among different sites, as they will all be different individuals, than within a recording, as they are likely to be the same individuals.
For within-recording variability, we used 1, 2, 4, 5, 6, 7, 8, 10, 11, and 12 calls (with number of recordings held at 15 and number of sites held at 28). For among-recording variability, we used 1, 2, 4, 6,8,9,10,11,13,14, and 15 recordings from each site (with number of calls per recording held at 12 and number of sites held at 28). For among-site variability, we used 1, 2,5,8,11,14,17,20,23,25, and 28 sites (with number of calls per recording held at 12 and number of calls per site held at 15). This allowed us to examine which sources of variability had the largest impact on recognizer performance. This initial approach resulted in 32 recognizer models. We then created another 11 recognizers to explore interactions among sources of variability, among-site (n = 3), among-recording (n = 6), and within-recording (n = 2) training data. This resulted in a total of 43 recognizer models. Occasionally, we were unable to find as many calls as we had targeted so, the exact number of calls per recording and recordings per site used in the recognizers had some variability, and the total number of calls was often lower than our target. We report the average achieved number of calls per recording and recordings per site. The details of the targeted and achieved training data sources and totals for each recognizer can be found in Appendix S1.
Each recognizer was tested on 40 different five-minute recordings (datasets Train and A-D, Table 1), and we manually reviewed all matches. To estimate the effect of increasing the total and type of training data on the recognizer performance, we used beta regressions (because precision and sensitivity are values between zero and one) and a logit link function. The mean conditional precision and mean conditional sensitivity across the 40 recordings were dependent variables, and the amount and type of training data were independent variables. We used Akaike's information criterion (AICc) adjusted for small sample sizes for model selection (Burnham & Anderson, 2002). Analysis was done in R 3.1.3 using the packages betareg (Cribari-Neto & Zeileis, 2010) and AICcmodavg (Mazerolle, 2015). Plots were created using ggplot2 (Hadley, 2009).

| Variable sensitivity analysis
We assessed the effect of changes in the variable settings of recognizer models on the identification of wood frog vocalizations. Song vocalizations. We estimated the signal-to-noise ratio by randomly selecting two groups of 10 one-second segments, one group with wood frog calls in them (signal + noise) and the other without wood frog calls (noise only). We measured the dB level and subtracted the mean of the noise only dB measurements from the mean of the signal + noise dB measurements to estimate the signal-to-noise ratio (Table 1). We used the mean conditional precision and conditional sensitivity from these recordings in the dataset to evaluate recognizer performance.
After each recognizer was built and the training file set scanned, we manually reviewed all the matches. A subset of the recordings was reviewed by two people so that observer error could be estimated.
The observer error rate was estimated to be 0.1%. We used the coefficient of variation (CV = standard deviation/mean) of the conditional precision and conditional sensitivity to evaluate changes in the variables settings. The size of the CV is positively related to the sensitivity of recognizer metrics to changes in the variable setting.
Due to the multiple, almost identical, syllables in wood frog vocalizations, some variable settings resulted in conditional sensitivity values exceeding one, indicating that the recognizers were making multiple "true-positive" matches by matching multiple syllables in a single true call. Using these high sensitivity values to select variables would change the focus of the recognizer to syllables, rather than calls.
However, some true wood frog calls are single syllable calls so there is no single "correct" call type to model. To attempt to penalize these variable settings for making excess true matches, instead of concluding these multi-syllable matches were false positives (which technically they are not) we randomly sampled N true-positive matches without replacement, where N equals the number of real calls in the recording, and recalculated the precision and sensitivity using a subset of the true positives. For example, if there were 1,000 true calls but 1,500 true positives (i.e., the recognizer matched the two different syllables in 500 true calls), we sampled N = 1,000 of the true-positive matches, scaling sensitivity between zero and one. We repeated this 1,000 times for each recording and used the mean of these randomizations as the conditional precision and conditional sensitivity values for the variable setting selection process. we also assessed a recognizer model that was developed by the more conventional trial-and-error approach where the best variable settings were chosen based on the training data used in the model. We termed this the "original" recognizer. The effort invested in evaluating all the variable settings was substantial, and we wanted to compare a labor-intensive approach with a "quick-and-dirty" approach (i.e., the "original" recognizer) to see whether the extra effort was warranted.
We report mean error rates and bootstrapped 95% confidence intervals of the means.
All data analyses were done in R.3.1.3 (R Core Team 2015). The area under the receiver operating characteristic curve (AUROCC) was determined using the pROC package (Robin et al., 2011). Bootstrapped confidence intervals (Bias-corrected and accelerated) were calculated using the boot package (Canty & Ripley, 2015). Plots were created using ggplot2 (Hadley, 2009).

| Evaluating recognizer performance
The primary goal in using recognizers is to accurately identify vocalizations at new sites and times. The acoustic characteristics of anuran calls have been shown to vary within a species among years (Wagner & Sullivan, 1995) and systematically among populations (Wycherley, Doran, & Beebee, 2002). In addition, it is of interest to know how well a recognizer created in one place/time performs at others to know whether centralized automated monitoring programs are feasible. We compared the three recognizers built in part 2 above and optimized on independent training data with a recognizer built using the more conventional approach of maximizing the fit to the training data used for calls in the model. To explore how the performance of these four recognizers varied, we used them on four additional datasets not used in recognizer creation (Table 1). We used the score thresholds determined from the training data to make the predic-

| Increasing training data
To  (Table 2A). For conditional sensitivity, only the number of sites had a positive effect (Table 2B).
The effect of among-recording variation on both conditional precision and conditional sensitivity was negative, but the 95% confidence interval overlapped zero. The model including only among-site variation in training calls had the most support, but unexpectedly the full model, including within-recording, among-recording and among-site variation training calls, had almost as much support ( The small difference in AICc and log likelihood values between the top models for conditional precision indicates that among-site variability is driving the relationship, but there was a minor role for additional training data from within recordings. For conditional sensitivity, the effect of among-site training data was clear as seen by the change in AIC between the top models. In summary, the most efficient way to increase recognizer performance was to include vocalizations from more sites in the recognizer model, but even this effect was weak.

| Variable sensitivity analysis
All nine variables affected the performance of the recognizers, but Final recognizers performed according to the weights placed on the errors when the training dataset was reassessed, but there was considerable overlap in confidence intervals (Figures 8 and 9). The Type I recognizer had the highest mean precision of 0.87 (CI 0.5 -1) closely followed by the balanced recognizer at 0.85 (CI 0.46-0.98).
The Type II recognizer had the highest sensitivity (mean 1.54, CI 1.35-1.74), and despite our efforts to impose a penalty for this at the setting selection stage, it overestimated the number of real calls by making separate hits on different syllables of the same call (i.e., we did not "correct" the results of the recognizer to reduce the sensitivity to below 1; we attempted to prevent this from occurring at the variable selection stage but failed). The ranks of the recognizer models for conditional sensitivity indicate that selecting variable settings empirically can lead to reductions in the Type II errors beyond that of using a trial-and-error approach on the training data. This was not the case for conditional precision.

| Evaluating recognizer performance
Precision (i.e., reduced Type I errors) varied across the test datasets A-D but showed only small differences among recognizers (Figure 8).
Not surprisingly, the recognizer designed to maximize sensitivity (i.e., reduced Type II errors) had the highest sensitivity across all four datasets ( Figure 9). Confidence intervals extending well above 1 again show the propensity of the recognizer to overestimate calls is not limited to the training data. There was little consistency in recognizer performance across datasets, and mean error rates were generally higher when recognizers were applied to new data. The errors were not related in any obvious way to the differences in sites and times between the training set and test datasets. In fact, the highest precision occurred at the sites furthest away from the sites the recognizer was developed from (dataset D, Figure 8).

| DISCUSSION
Our objective here was to improve the utility of sound recognition tools for surveying vocalizing anurans and to try to remove some of the barriers to widespread use in ecology. The need to monitor biological diversity at large spatial and temporal scales is becoming increasingly important (Yoccoz, Nichols, & Boulinier, 2001). While citizen science (Weir et al., 2009) and manual surveys by professional biologists are widely used, fully automated platforms using ARDs and sound recognition software could help to meet new monitoring challenges. We examined three critical components in sound recognition (training data, variable setting selection, and prediction to new data), and this provides guidance for the future use of recognizers for monitoring projects in general. The specific findings (i.e., the settings) for optimized wood frogs recognizers are unlikely to apply to other species, but the process can be generalized and used to build optimal recognizers for other species.

| Increasing training data
We found that increasing training data resulted in only slight improvements to recognizer performance. The most rapid increases in performance were achieved by adding training data from different sites.
Adding calls from additional sites into the model could have resulted in small improvements in performance in two ways. First, adding additional sites to the model, especially when the recordings are from the same breeding season (i.e., same year), could help capture variation in wood frog vocalization characteristics by including more unique individuals. The male anuran vocalization contains signals to conspecific females and males that are indicators of competitive ability and fitness, such as size (Giacoma, Cinzia, & Laura, 1997;Howard & Young, 1998) and survivorship (Forsman, Hagman, & Pfenning, 2006), and are subject to sexual selection. In many anuran families, such as ranids (Bee, Perrill, & Owen, 2000) and bufonids (Zweifel, 1968), the dominant frequency of the call is negatively related to the size of the male.
Although data on wood frogs specifically are unavailable, it is quite probable that size and age add variability to call characteristics and that training data from different sites could capture more of that variability. Collecting training data from recordings made repeatedly at the same site and calls from within the same recording is increasingly T A B L E 2 Beta-distributed generalized linear models with A) conditional precision and B) sensitivity as the dependent variable. All intercepts, coefficients, and standard errors are in their untransformed logit linked state  Another explanation for additional training calls being of little value is that wood frogs have a relatively simple call; the optimal feature vector length in Song Scope was four, meaning that the call could be described with four features. More complex anuran, bird, and mammal calls would require more features and could require additional training data to model well. Future research into the relationship between call complexity, variability within and between individuals, and optimal quantities of training data would provide additional insight and guidance for new monitoring programs.

| Variable sensitivity analysis
Our objective with the variable sensitivity analysis was to evaluate a method of choosing recognizer variable settings that was reproducible and considers Type I and Type II errors. By having an independent set of recordings upon which to evaluate the different models, we were able to reduce the mean Type I and II errors below that of a recognizer that was created using the trial-and-error approach of maximizing fit to the training data.
Maximum syllable gap had more influence on the sensitivity than any other variable. In the Type II error recognizer, this was set at 10 ms.
Although this setting was the only one that came close to detecting all wood frog calls at a variety of different chorus sizes, it resulted in an overestimation of the total number of calls as the recognizer made multiple correct hits on different syllables of the same call. This is because wood frog calls are made of 1-4 almost identical syllables. Any recognizer that detects all calls would have to detect the single syllable calls and thus risk making several matches on the multi-syllable calls.
All three other recognizers had overlapping confidence intervals, and the original recognizer had the next best sensitivity of 0.3. Overall, The mean (open circles) and bootstrapped 95% confidence intervals (error bars) of precision (1-Type I error rate) for the four final recognizer models. The balanced recognizer is the recognizer where Type I and II errors are weighted equally, Type II is the recognizer designed to minimize Type II errors, Type I is the recognizer designed to minimize Type I errors, and original is the recognizer designed using the trial-and-error approach. The "Train" dataset is the recordings used to select the variable settings. Test set A used the same sites to build the recognizer and select the variable settings but from a different year. Test set B has different sites but from the same year. Test set C is from different sites and a different year. Test set D contains recordings from outside the study area in the USA recorded in 2015, specifically the states Connecticut, Massachusetts, Michigan, and New York (see Table 1). The "Total" dataset is the combined datasets A-D the recording sample length. Therefore, the utility of the recognizer is likely to vary, not only as a function of the spectral properties of the species call, but the intra-and interspecific context within which the recordings are made.
Other studies with Song Scope and species recognizers have reported "adjusting" the settings (Waddle et al., 2009), or did not refer to this part of the recognizer design process at all in the methods section (Brauer, Donovan, Mickey, Katz, & Mitchell, 2016;Holmes et al., 2014;Willacy et al., 2015). Under the assumption that wood frogs are not unique in the sensitivity of recognizer performance to variable settings, the field of recognizer development would be advanced by researchers describing what the final recognizer setting were (Brauer et al., 2016;Willacy et al., 2015) and how they arrived at the final recognizer settings (Buxton & Jones, 2012). In an analogous study with species distribution models, Radosavljevic and Anderson (2014) tuned MaxEnt program settings with the goal to minimize overfitting. They discovered that settings 2-4 times higher than the default setting were required to reduce overfitting. In situations such as these where there is no clear intuitive or theoretical way to determine appropriate program settings beforehand, experimentally manipulating the settings and evaluating on new data represent the most likely way of arriving at the optimal choice.
While we are confident the variable sensitivity analysis identified the optimal settings for a wood frog recognizer based on our recording set, it is unknown how general the results are. Had we used a different set of recordings upon which to evaluate Type I errors we could have arrived at different optimal settings for the Type I error recognizer due to differences in background noise. We are also confident that we identified the most sensitive variables, and even if a different set of recordings was used, this list should not change. However, it is unlikely that recognizers for other species of anurans with very different call structures (i.e., Bufonids, Hylids, etc.) will follow the same rank of sensitivities, and they will certainly require different settings of those variables to construct optimal models. Researchers should use the process presented here to fine-tune their recognizers.

| Evaluating recognizer performance
Although transferability is often an implicit objective of ecological The mean (open circles) and bootstrapped 95% confidence intervals (error bars) of sensitivity (1-Type II error rate) for the four final recognizer models. The balanced recognizer is the recognizer where Type I and II errors are weighted equally, Type II is the recognizer designed to minimize Type II errors, Type I is the recognizer designed to minimize Type I errors, and original is the recognizer designed using the trial-and-error approach. The "Train" dataset is the recordings used to select the variable settings. Test set A used the same sites to build the recognizer and select the variable settings but from a different year. Test set B has different sites but from the same year. Test set C is from different sites and a different year. Test set D contains recordings from outside the study area in the USA recorded in 2015, specifically the states Connecticut, Massachusetts, Michigan, and New York (see Table 1). The "Total" dataset is the combined datasets A-D.  In summary, our data support the conclusions of Clement et al. (2014) that recognizers trained at one place and/or time will rarely be as effective in avoiding Type I errors when used at other places and/or times. However, models designed to reduce Type II error rates were almost as effective at different places and/or times. Researchers should, at minimum, hold some recordings back from use in recognizer creation for use in validation to guard against extreme overfitting (Guthery, Brennan, Peterson, & Lusk, 2005) and obtain estimates of sensitivity/Type II error rates. The type of data held back for validation should be relevant to the objectives of the monitoring program and include sites and times that are of appropriate scales (temporal or geographic) to be a genuine test of out-of-sample predictive performance.

| Recommendations for recognizer creation
Our results provide a preliminary framework and some recommendations that researchers can use to develop recognizers for wood frogs specifically and, more generally, how our approach might be used for other species. We provide a flowchart to allow readers to visualize the entire process ( Figure 10). To summarize the main findings, we found variable settings to be far more important than training data in creating good recognizers. Adding training data from more sites seems to be the most effective way to increase recognizer performance, but again this effect was small when compared to investment in evaluating different variable settings. We showed that considering what recognizer error rates are important and selecting variable settings to match these goals can reduce the rate of false negatives but not the false-positive rate. Overall, we believe Type I errors are a function of the environmental recording conditions and largely, but not entirely, outside the control of the researcher, whereas Type II errors can be reduced or even eliminated with effort into selecting appropriate variable settings. Extrapolation of Type I error rates for recognizers built on training data from one place and/ or time to other places and/or times is probably unjustified under most circumstances.

| Future research in automated bioacoustic monitoring
Bioacoustics research is advancing rapidly as improvements to hard- other machine learning methods (e.g., Gingras & Fitch, 2013). The recent recognition of the problems caused by false-positive detections and development of occupancy models that incorporate false positives as well as false negatives (McClintock, Bailey, Pollock, & Simons, 2010;Miller et al., 2011) provides a solid link between data collected and analyzed using an automated acoustic platform and occupancy models where uncertainty in parameter estimates can be quantified (Bailey, MacKenzie, & Nichols, 2014). Approaches that compare the quality of methods for obtaining data from recordings are needed to bridge this gap. Finally, our recommendations arise from work done exclusively on wood frogs, and a similar approach should be used on other taxa that are being monitored using automated sound detection to assess how general our conclusions are.

ACKNOWLEDGMENTS
We wish to thank the Department of National Defense/CFB Thompson for assistance in the field and for collecting/sharing recordings with us.