Automated call detection for acoustic surveys with structured calls of varying length

When recorders are used to survey acoustically conspicuous species, identification calls of the target species in recordings is essential for estimating density and abundance. We investigate how well deep neural networks identify vocalisations consisting of phrases of varying lengths, each containing a variable number of syllables. We use recordings of Hainan gibbon Nomascus hainanus vocalisations to develop and test the methods. We propose two methods for exploiting the two‐level structure of such data. The first combines convolutional neural network (CNN) models with a hidden Markov model (HMM) and the second uses a convolutional recurrent neural network (CRNN). Both models learn acoustic features of syllables via a CNN and temporal correlations of syllables into phrases either via an HMM or recurrent network. We compare their performance to commonly used CNNs LeNet and VGGNet, and support vector machine (SVM). We also propose a dynamic programming method to evaluate how well phrases are predicted. This is useful for evaluating performance when vocalisations are labelled by phrases, not syllables. Our methods perform substantially better than the commonly used methods when applied to the gibbon acoustic recordings. The CRNN has an F‐score of 90% on phrase prediction, which is 18% higher than the best of the SVM or LeNet and VGGNet methods. HMM post‐processing raised the F‐score of these last three methods to as much as 87%. The number of phrases is overestimated by CNNs and SVM, leading to error rates between 49% and 54%. With HMM, these error rates can be reduced to 0.4% at the lowest. Similarly, the error rate of CRNN's prediction is no more than 0.5%. CRNNs are better at identifying phrases of varying lengths composed of a varying number of syllables than simpler CNN or SVM models. We find a CRNN model to be best at this task, with a CNN combined with an HMM performing almost as well. We recommend that these kinds of models are used for species whose vocalisations are structured into phrases of varying lengths.


| INTRODUC TI ON
Acoustic surveys are being used increasingly for wildlife surveys (Gibb et al., 2019). Acoustic recordings are now commonplace (Priyadarshani et al., 2018;Usman et al., 2020) and it is appropriate to ask how best to identify animal calls in these recordings.
Identifying calls manually is time-consuming and labour-intensive, can be subjective, and must be done by trained professionals (Chen & Maher, 2006;Somervuo et al., 2006). A variety of machine learning techniques has been used for automatic call detection or classification. However, these methods are either designed to detect simple calls of short duration (e.g. anuran calls; Alonso et al., 2017;Colonna et al., 2015) or calls of roughly fixed length (Stiffler et al., 2018;Zhang & Li, 2015). In this paper, we develop methods to tackle structured calls of the sort illustrated in Figure 1. Calls can be viewed as phrases, each of which contains a number of syllables. Both syllables and phrases can be of varying lengths, and there may be different numbers of syllables in each phrase. In general, the intervals between syllables are shorter than those between phrases, but this difference can be small. All these characteristics make call detection a challenging task.
We aim to use phrase detection to estimate animal abundance and density, which are key quantities required for management and conservation. To do this, one either has to identify which detected vocalisations came from which animals (in order to use the animal as the sampling unit) or to estimate vocalisation density per unit time without identifying which vocalisations come from which animals, and then separately estimate vocalisation rate in order to convert phrase density into animal density (Buckland, 2006;Buckland et al., 2001). When phrases are made up of multiple syllables of varying length, the variance in syllable production rate over any period is greater than that of phrase rate and so it is convenient to work with the phrase as the sampling unit, rather than the syllable. In addition, when vocalisations are identified to animals, this is often done by localising the vocalisation source using (primarily) estimates of direction to source from multiple detectors (Stevenson et al., 2015), and in this case, there is negligible additional information about location in individual syllables beyond that contained in the phrase as a whole, so here too it is convenient to work with the phrase as the sampling unit rather than the syllable.
Furthermore, because we are interested in abundance and density, we are interested in identifying all phrases, not just identifying whether there was at least one phrase. While presence/absence data (e.g. whether or not the species of interest was heard at least once in some time window) are useful for estimating occupancy and species range, it is a very inferior kind of data for estimating abundance and density.
However, detecting phrases in hierarchically structured vocalisations is challenging. Many existing methods perform segment-based prediction; that is, they divide audio files into segments with a small time window (e.g. 1 s) and then predict labels for each segment. The performance of these methods is sensitive to the segment length.
If the segment length is set shorter than the intervals between syllables within a phrase, then there might be many segments within phrases that are labelled positive but do not contain any vocalisations. This can result in a high false negative rate for phrase prediction even if each segment is correctly predicted. On the other hand, if the segment length is set too long, then a segment may contain multiple phrases, which leads to the underestimation of phrase prediction. This problem is exacerbated when phrases may be of varying lengths. Therefore, selecting the appropriate segment length is often a difficult design decision to make.
We propose instead two methods for automatic identification of structured phrases of the sort described above. The first combines convolutional neural network (CNN) (LeCun et al., 2015) models with a hidden Markov model (HMM) and the second uses a convolutional recurrent neural network (CRNN). Both models learn acoustic features of syllables via a CNN and temporal correlations of syllables into phrases either via HMM post-processing or a recurrent network. The novelty of our proposed methods is that we adopt a sequence-to-sequence  prediction strategy to address the hierarchical vocalisation problem. That is, we input K E Y W O R D S acoustic survey, automated call detection, convolutional recurrent neural network, gibbon calls, hidden Markov model, machine learning F I G U R E 1 An illustrative example of phrases and syllables. A phrase consists of a number of syllables. Both syllables (in blue boxes) and phrases (in pink boxes) can be of varying lengths. In general, there exists a shorter interval between syllables and a longer interval between phrases, but their difference can be small Syllable Phrase

Interval between phrases
Interval between syllables within a phrase

Duration of a phrase
Duration of a syllable a sequence of segments (comprising a fixed number of consecutive segments) and output a sequence of predictions, corresponding to each segment in input. This strategy is widely used in machine translation (Wu et al., 2016) and text recognition (Shi et al., 2017). Unlike many existing methods, our methods do not require predefined time intervals for phrases, and are less sensitive to segment or sequence length, as long as the segment length is reasonably small (e.g. 1 s) and the sequence length is larger than the maximum duration of a phrase. In addition, a CRNN is trained in an end-to-end manner to optimise the learning of acoustic and temporal features altogether and HMM post-processing can be added to any pre-trained method.
We also propose a new event-based evaluation method to assess the performance of methods at the phrase level.
The rest of the paper is organised as follows. Section 2 introduces the background to our work, followed by a description of the methods, the data, data pre-processing, our evaluation metrics and the computational experiments setting that we use to evaluate methods. Section 3 compares the performance of our methods to a number of the existing machine learning methods. In Section 4 we draw conclusions from the application of these methods to the Hainan gibbon data.

| MATERIAL S AND ME THODS
We propose two methods to identify phrases: a CNN with HMM post-processing, and a CRNN. To the best of our knowledge, we are the first to propose such methods for detecting phrases in recordings of calls with a two-level structure. Our CRNN combines a CNN and a recurrent neural network (RNN) and trains it in an end-to-end, sequence-to-sequence manner. We first describe the background of our work and introduce these methods below, followed by a description of the Hainan gibbon dataset that we use to develop and test the methods, and the acoustic data pre-processing. We then introduce the evaluation metrics used to assess performance, including the new method we propose for evaluating matching between predicted phrases and labelled phrases. Finally, we describe the computational experiments we conduct to test performance and robustness.

| Background
CNNs have become popular for acoustic signal processing because they enable automatic feature extraction, without manual feature selection. A variety of such methods has been used for animal sound detection and classification, including pre-trained CNNs 1D multi-view CNNs (Xu et al., 2020). CNNs tend to perform better than traditional machine learning techniques like SVMs, random forests and K-nearest neighbour (Florentin et al., 2020;Kiskin et al., 2020;Lostanlen et al., 2018;Mac Aodha et al., 2018;Salamon et al., 2017;Xu et al., 2020).
However, CNNs still require audio recordings to be split into segments of fixed length and they produce a single prediction for each segment. Longer audio segments contain more information and so tend to result in better classification by CNN. Shorter segments allow CNNs to make predictions at a higher temporal resolution, but these predictions tend to have worse classification performance because each segment contains less information. For example, in the analysis of the Hainan gibbon data that we consider, Dufourq et al. (2020a) have found that a CNN with 10-s segments (which is close to the length of the longest phrases in the data) achieved better performance than one with segments of 1, 1.5 or 2 s (which is shorter than the length of most phrases in the data). In the analy- CNNs tend to gather information at high resolution but are unable to use broader-scale contextual information, such as whether or not the current point in a recording is preceded by a call signal.
RNNs, on the other hand, integrate information from a sequence of time windows but they lack the ability to extract information directly from recordings, or spectrogram representations of recordings. Like RNNs, HMMs can model long-term temporal correlation in animal sound recordings (Putland et al., 2018;Stowell et al., 2017), although they are unable to extract features from the acoustic data and require this to be done beforehand.
CNNs and RNNs are often combined to leverage the strength of both. Applications include sound event detection in daily life (Çakır et al., 2017), vocabulary tasks (Sainath et al., 2015), text recognition (Shi et al., 2017), and in ecology, bird sound, koala sound and whale sound detection (Cakir et al., 2017;Himawan et al., 2018;Madhusudhana et al., 2021). A variety of architectures have been used; for example, (Madhusudhana et al., 2021) proposed using a CNN to extract acoustic features on whale's notes and then employing an RNN to learn temporal features. To do so, they input a fixedlength sequence of segments and predict a single label. However, this requires some prior knowledge to select an appropriate sequence length, taking into account the duration of a vocalisation and intervals between vocalisations. In this paper, we propose two methods to overcome the above limitations. They facilitate learning temporal correlations of small segments that are combined into syllables and then into phrases.

| Method 1: CNN + HMM
Our first method is to employ CNNs to extract visual features from spectrogram images and predict the presence of gibbon phrases on each 1-s segment. However, this method treats each segment independently, which can result in many false positives and false negatives. To capture temporal patterns between these small segments, we use an HMM to model temporal correlations between consecutive segments to improve the predictions from CNN.
Normally, the CNN takes spectrogram images as input and outputs a probability (see Figure 2) measuring the confidence of a segment containing a gibbon syllable, or a binary classification (e.g. 1 or 0 indicating detecting a gibbon syllable or not) obtained by thresholding this output. We also consider using the linear output from the last dense layer of CNN, as the values are real and we may assume Gaussian distribution for these values. The latter two forms of output (depending on which method is being used) are then treated as the observations in an HMM to model the temporal correlation in the occurrence of positive segments, as detailed below.

| CNN
We adapt the two commonly used CNN architectures for animal call detection: VGGNet (Simonyan & Zisserman, 2015) and LeNet (Lecun et al., 1998). As shown in Figure 2, both CNNs are configured with the same input dimension for the grey-scale spectrogram images; that is, 32 × 32 × 1. They contain convolutional layers and dense layers. Features are extracted from the input spectrogram images by convolutional layers and sent to dense layers for classification. A sigmoid function following the last dense layer produces a probability output, indicating the confidence of a segment containing a vocalisation or not. A confidence threshold can be set to decide the class membership. If the confidence is greater than the threshold, we infer a positive detection. Here we consider the threshold as a hyperparameter and use grid search to find the optimal values automatically.
The search range goes from 0.1 to 0.9 with a step size of 0.1.
The LeNet has two convolutional layers with 32 and 64 filters respectively and a kernel size of 5 × 5. To reduce the computational cost for further processing, each convolutional layer is followed by a 2 × 2 max-pooling layer to halve the size of the convolutional features. After the last max-pooling layer, LeNet has three dense layers. Each layer except the output layer has a rectified linear unit  Figure 2 shows the architecture of LeNet and our customised VGGNet.

| Post-processing with a Hidden Markov Model
Approaches like those described above ignore the temporal correlation in vocalisations that is present in hierarchically structured data as described in the Introduction. To deal with this, we add a postprocessing step where a supervised HMM is employed to learn temporal correlations between consecutive segments. This is then used to refine the predictions obtained from the previous techniques. We describe the process in more detail below.
HMMs contain Markov chains to model the evolution of unob- (1) F I G U R E 2 LeNet and VGGNet architectures. The structure contains convolutional and max-pooling layers to extract information from images, and dense layers followed a sigmoid function for a probability output label (state) is l t . We set our HMM time step to 1 s to correspond to the CNN segment lengths.
We consider two types of observations o t : one is binary prediction output and the other is the output extracted from the last dense layer of a CNN before the sigmoid function. We assume the second type will give finer-grained information on the prediction. For binary o t , we assume the emission probability q o t |l t to be a Bernoulli distribution.
where the l t is the probability of observing o t when the true state is l t .
When o t is the output from the last dense layer of the CNN, we assume either that o t has a Gaussian distribution or a Gaussian mixture distribution with emission probabilities.
where kl t and kl t are the mean and variance of the kth distribution when in state l t , and w kl t is the kth mixture weight when in state l t . We have considered various numbers of components in the mixture Gaussian distributions and the searching range is 2,4,8,16,32,64,128,256,512, 1, 024}. We have used a Bayesian information criterion (BIC) (Neath & Cavanaugh, 2012) to find the best K automatically during the training stage. The Viterbi algorithm is then used with these estimated parameters to predict the unobserved states in survey data in which l is unknown and o is obtained by applying the CNN to the survey data.
In order to capture the overall temporal patterns, we set the HMM sequence length to be the same as a whole audio length. We set the initial state probabilities l 0 based on the first label of each labelled recording in the training data.

| Method 2: Convolutional recurrent neural network
Here we present a CRNN that can learn temporal features in spectrogram images. We adopt a sequence-to-sequence CRNN; that is, we input a sequence of T seconds of spectrogram images and output the corresponding T seconds of predictions of gibbon phrase presence in each segment. The CRNN architecture is presented in Figure 3. It consists of convolutional layers learning features, recurrent layers learning temporal correlations between features, and a fully connected layer for phrase prediction.
For the convolutional layers, we adopt the above customised VGGNet architecture, as VGGNet has shown promising performance when integrated with an RNN (Shi et al., 2017). To speed up the convergence of the CRNN, we add a batch normalisation layer after each of the last four convolutional layers (Ioffe & Szegedy, 2015). The output from the last convolutional layer has the form F ∈ ℝ T×W×H , where T is the number of frames in time axis, and W and H are the dimensions of feature outputs for each second of spectrogram image from the convolutional layers; that is, 1 × 512.
For the recurrent layers, we first stack the CNN output, building T feature vectors with the length of 512, corresponding to T seconds in the spectrograms. The feature vectors are then fed as input to recurrent layers for further processing. The RNN has a strong ability to capture contextual information within a sequence (Shi et al., 2017), which greatly mitigates the problem that a segment may only cover part of a gibbon phrase. We use two stacked gated recurrent units (GRUs) (Chung et al., 2014) with 256 hidden units. The GRU is a simplified version of long-short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997), which is specially designed to address the vanishing gradient problem (Shi et al., 2017), allowing the RNN to store a long-range of context information. The sequence length of the GRU is also T, which will eventually produce T feature vectors corresponding to T seconds in the spectrogram. In order to mitigate against overfitting and improve the generalisation capability of the network, we apply the dropout to each recurrent layer with a rate of 0.5.
The fully connected layer is built on the top of the GRU. It takes features from the GRU and outputs a sequence of predictions that are real numbers corresponding to T seconds of the input spectrogram. A sigmoid function is then used to map these numbers into the interval 0, 1 . If the output is greater than a threshold, then a positive label is inferred.
For Methods 1 and 2, phrase prediction is conducted on segmentbased predictions; that is, consecutive segments that share the same label are combined into phrases.

| Configuration and hyperparameter tuning
We adopt the architecture of Shi et al. (2017) for our CRNN in terms of the number of neurons within recurrent layer, and configure CRNN with the following hyperparameters and settings: sequence length, the choice of uni-or bi-directional GRU, different optimisers, dropout rate, learning rate, batch size and the number of epochs.
We run a grid search on combinations of different values of each hyperparameter and choose the ones with the best F-score. We consider pairs of sequence length T, batch size and epochs since a long sequence length will take up more memory and take a longer time to converge, which results in smaller batch size and larger epochs. to 13 hr, and in total, the 72 combinations take about 400 hr to run.
(2) q o t |l t = Ber o t ; l t , A bi-directional GRU in a CRNN takes much more time to train and does not improve the performance in our experiments. Thus we only apply a uni-directional GRU in the CRNN. We adopt a robust, commonly used optimiser ADAM (Kingma & Ba, 2017). Table 1 lists the optimised configurations.

| Sound data description
We evaluate our methods using recordings of Hainan gibbons Nomascus hainanus, whose vocalisations are made up of phrases of varying lengths containing a variable number of syllables. Like most gibbon species, they live in a forest that is too thick for visual surveys to be an effective means of surveying them. They are therefore surveyed acoustically (Deng et al., 2014), typically by people listening for their calls (Kidney et al., 2016), although surveying is increasingly being done by digital devices. We used the open-access dataset of Dufourq et al. (2020b), which contains 28 8-hr recordings of Hainan gibbon calls collected in Bawangling National Nature Reserve, Hainan, China with eight Song Meter SM3 recorders. Recordings start from 5 a.m. or 6 a.m. and last 8 hr each day, with an acoustic sampling rate at 9.6 KHz and a bit depth of 16. A short vocal syllable (called a by Dufourq et al., 2020a) that lasts between 0.2 and 2.75 s is the smallest acoustic unit we consider. A phrase, as demonstrated F I G U R E 3 Overview of the CRNN network structure with components (1) convolutional layers, which extract features from a Tsecond spectrogram image sequence, (2) recurrent layers, which take convolutional layers' output feature vectors stacked over a channel axis and (3) a fully connected layer as output layer with a sigmoid function to map raw output into a probability in Figure 4, consists of one to six consecutive syllables separated by short intervals, and lasting between 1 and 11 s (see Figure 5a).

TA B L E 1 Hyperparameter configuration
There are typically longer pauses between phrases than between syllables within a phrase (Dufourq et al., 2020a). Without taking account of the structured nature of phrases, it is difficult to detect and separate adjacent phrases successfully in a fully automatic manner, as the duration of a phrase and intervals between phrases may vary substantially.
According to Dufourq et al. (2020a), there are ambient noises such as bird calls and rain events, which could affect the classifier performance. Following (Lin et al., 2013), we calculated approximate gibbon phrase signal-to-noise ratio (SNR) using SNR = P phrase_signal − P noise , where P phrase_signal represents the root mean square (RMS) amplitude of each gibbon phrase, and P noise represents the RMS amplitude over 1 s segment prior to the beginning of each gibbon phrase. The SNRs of the gibbon phrases are well distributed from around −20 to 30 dB with a mean of around 2.6 dB, which makes the detection task tricky as only a few phrases were observed with high SNR. (see Figure 5b). Phrases would typically be the unit of interest for monitoring and therefore, in these data, it is the phrases, not syllables that are labelled. Our task here is to predict phrases. There is a total of 9,199 s of gibbon sound, consisting of 1,858 labelled phrases in the 28 acoustic files, and 797,201 s without gibbon sounds.

| Data pre-processing
We divide each 8-hr recording into 1-s segments without overlap.
The audio datasets are labelled per phrase so that the start and

| Evaluation metrics
The performance of the gibbon vocalisation detection algorithm can be evaluated in a variety of ways. The most appropriate measure of performance will depend on the intended use. We focus here on the use of acoustic detectors to monitor populations and estimate their distribution and abundance. At the simplest level, monitoring involves counting the number of phrases per unit time, the encounter rate. If, as is common, the unit of detection is a phrase, then we want our method to accurately predict the number of phrases in a recording in order to monitor the phrase encounter rate.
Most (but not all) methods of estimating absolute wildlife abundance are designed to cope with false negatives (e.g. missed calls) but are sensitive to false positives (e.g. using sounds that are not calls of the target type). This means that these methods are generally not biased by low recall, but they may be biased by low precision.
When evaluating our methods we need to take into account both recall and precision.
We evaluate the methods in three ways. For the first two, we adopt commonly used metrics: precision, recall and F-score: We use these measures when predicting at the segment level, and when predicting at the event (phrase) level. One of the widely used sound event evaluation methods is collar, which aims to match the start and end of a predicted event to a true event (Mesaros et al., 2016). Here, an event, or a phrase, refers to a sequence of consecutive segments that share the same predicted label. This method is suitable for tasks requiring precise start (and Therefore, we propose a new way to evaluate predicted against observed phrases with dynamic programming used for sequence alignment (Eddy, 2004;Sellers, 1974). Our third measure is simply how well we predict the total number of phrases in an audio file (the encounter rate accuracy). For each method, we calculate the encounter error rate:

| Computational experiments to test performance
We use fourfold cross-validation that involves splitting our 28 separate audio recordings equally into fourfold. We use onefold for testing and the remaining three for training and validation. We iterate four times so each fold is used for testing once. The results are averaged over these four runs. Our splitting method on audio files prevents data leakage. For the CRNN, when forming segments into sequences, we use 50% overlapping during training.
That is, the consecutive sequences have half of their segments in common. This is used to increase the amount of training data. Even though our dataset is imbalanced with the majority being nongibbon calls, we have not employed any over-and downsampling techniques, as they might either generate noisy data or break the temporal correlation.
We compare the performance of our proposed methods with a classic technique commonly used in animal sound detection; that is, SVM with a radial basis function (RBF) kernel (Salamon et al., 2017) on Mel-frequency cepstral coefficients (MFCCs  for phrase prediction. We also consider the CNNs with spectrogram images as another baseline. This aims to demonstrate the strength of HMM post-processing in learning temporal relationships, thus leading to improved performance.
Since the main purpose of HMM post-processing is to smooth out some of the variations in predictions, we also compare the HMM against a simple moving average method, which averages the probability outputs on a sequence of segments from the CNN. We applied a grid search with a training/validation set to decide the sequence length, with the search range {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1, 024} together with the confidence thresholds from {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} . However, the experiments show that the moving average postprocessing does not improve performance because the optimised sequence length is always learned to be 1 and thus it leads to the same result as the CNN or SVM.
Experiments are done on a machine running Ubuntu 20.04 LTS with an Intel i7-9700 CPU, 32GB of RAM and an Nvidia RTX 2060 super 8 GB Graphic Processing Unit.

| RE SULTS
In this section, we compare methods with segment-based evaluation metrics, the proposed phrase-based evaluation metrics, and the encounter error rate based on the original gibbon dataset. We also compare the performance of different methods with simulated datasets, as described in the previous section.

| Segment-based performance
As shown in Figure 6, a CRNN with a sequence length of 400 achieves the best F-score, precision and recall among all the methods. Its Fscore is between 10% and 37% higher than VGGNet, LeNet and SVM with MFCC features. This result demonstrates the strength of a CRNN in learning temporal relationships between segments.
In contrast, the HMM, which also has the ability to learn temporal correlation, only improved the performance modestly. More specifically, an HMM with a Bernoulli emission probability increases F-scores on VGGNet, LeNet and SVM only by about 4%. The HMM with a Gaussian mixture model (GMM) emission probability increases the recall for the VGGNet and LeNet models but decreases the precision as well as the F-score. Therefore, we conclude that an HMM with a Bernoulli emission probability is better than an HMM with a GMM distributed emission probability for refining segment predictions.

F I G U R E 6
The segment-based precision, recall and F-score for CRNN structures with sequence length 400, which has the best performance among different sequence lengths in our experiment, 2 CNNs such as VGGNet and LeNet, CNNs with HMM post-processing, a baseline SVM model and SVM with HMM (Bernoulli) post-processing The effect of sequence length on the performance of the CRNN is shown in Figure 7. The F-score increases gradually with the sequence lengths, up to a length of 400. The CRNN with a sequence length of 400 (CRNN-400) achieves the best F-score of 87.77%. There is a trade-off between precision and recall as functions of sequence lengths, while both precision and recall show an upward trend. Figure 8 compares the phrase prediction performance of each method. The CRNN achieves the best F-score and recall, and the second-best precision. The HMM post-processing method greatly improves the performance of CNN. Specifically, the HMM with a Bernoulli emission probability boosts LeNet's and SVM's precision from 60% and 45% to 95% and 86% respectively. This makes it attractive for use with common abundance estimation methods that do not account for false positives. As with the segment-based results, all the CNNs perform better than the classic SVM.

| Robustness
First, we assess the impact of the SNR on performance. To do this, we collect gibbon phrases together and categorise their SNRs into bins. Then we calculate the detection and non-detection frequencies for phrases in each bin. Intuitively, a higher SNR value indicates a stronger signal, which will lead to a higher proportion of detections. As we can see in Figure 10, the CRNN outperforms all the other methods at all levels of SNRs. It has lower detection proportions when the SNR is lower but is still much better than the other methods. The SVM is worst; for example, when the SNR is around −20, the detection proportion may drop to zero. The HMM lowers the detection rates on LeNet and SVM. On investigating further, we found that although the HMM is better at dealing with false positive phrase predictions, it also smooths out true positive phrase predictions.
As we do not have access to other suitably hierarchically structured datasets, we created simulated datasets of varying degrees we employed commonly used data transformation methods including time stretching, pitch shifting and random cropping of the test data only to simulate adverse conditions in the real world, such as missing acoustic signals, or malfunctioning microphones. We train our models on non-augmented data and then test them on the augmented data to see to what degree our methods will be affected. We present our results in Figure 11 and observe that our methods are more robust to these perturbations than the LeNet or SVM. For pitch shifting, the LeNet with an HMM performs the best; its F-score drops by no more than 8% on both segment-and phrase-based prediction when pitch shifts to 1.  The methods we develop above perform substantially better at this task than CNNs and SVMs. A CRNN performs best at predicting both segments and phrases, and a CNN combined with an HMM performs next best. We note that an HMM is much more computationally efficient than a CRNN in that the post-processing HMM with a Bernoulli emission probability takes 3.52 s to train and 0.31 s to predict and the HMM with a GMM takes 1,205 s to train and 0.56 s to predict. The HMM with a GMM takes much longer than the HMM with a Bernoulli because we need to decide the number of Gaussian components to use in the GMM with BIC through a grid search. In comparison, the CRNN-400 takes about 10 hr to train and 62 s to predict. Also, HMMs can be added as a post-processing step to any pre-trained segment-based machine learning or deep learning method with no additional modelling.
CNNs and SVMs perform well if an appropriate segment can be selected (one that is long enough to include phrases but not so long as to include multiple phrases). However, when intervals between phrases are variable and not consistently larger than intervals within phrases, the choice of segment length can be difficult and case specific, and an incorrect choice may lead to either overestimating or underestimating the number of phrases. Our methods perform well in such cases and are shown to be more robust to perturbations of the acoustic recordings.
We also proposed and implemented a way of evaluating prediction performance that measures how well phrases are predicted, rather than how good predictions are at the somewhat arbitrary time unit that acoustic files are segmented into for the application of machine learning (ML) methods. The method is also preferable to the commonly used collar-based method because it does not require phrase starting time to be identified (something that can be error-prone) and it is less sensitive to annotation ambiguity. Users can customise the prediction performance evaluation method by changing the threshold that defines the overlap of predicted and labelled phrases, to be more or less strict in defining the matching of phrases.
Finally, although we have only applied ML methods to phrases of variable length that contain a variable number of syllables, we anticipate that the CRNN and CNN with HMM methods will perform well on phrases comprising continuous vocalisations of variable length too, as there is nothing in these methods that requires phrases to be composed of separate syllables. When phrases are of variable length, methods based on recognising vocalisations in segments or windows of fixed length will tend to break phrases into multiple parts if segments or windows are small, and so over-estimate the number of phrases, or to combine phrases with periods of non-vocalisation if the segments or windows are large, and so under-estimate the number of phrases.
The CRNN and CNN with HMM methods do not suffer from this problem. Our method might also be useful for other animal species whose calls share similar acoustic characteristics, including birds (Chen & Maher, 2006;Somervuo et al., 2006) and whales (Bergler et al., 2019;Jiang et al., 2019); however, this will need further validation.

CO N FLI C T O F I NTE R E S T
None of the authors have a conflict of interest.

PE E R R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/2041-210X.13873.