Large-scale analysis of frequency modulation in birdsong databases

Birdsong often contains large amounts of rapid frequency modulation (FM). It is believed that the use or otherwise of FM is adaptive to the acoustic environment, and also that there are specific social uses of FM such as trills in aggressive territorial encounters. Yet temporal fine detail of FM is often absent or obscured in standard audio signal analysis methods such as Fourier analysis or linear prediction. Hence it is important to consider high resolution signal processing techniques for analysis of FM in bird vocalisations. If such methods can be applied at big data scales, this offers a further advantage as large datasets become available. We introduce methods from the signal processing literature which can go beyond spectrogram representations to analyse the fine modulations present in a signal at very short timescales. Focusing primarily on the genus Phylloscopus, we investigate which of a set of four analysis methods most strongly captures the species signal encoded in birdsong. In order to find tools useful in practical analysis of large databases, we also study the computational time taken by the methods, and their robustness to additive noise and MP3 compression. We find three methods which can robustly represent species-correlated FM attributes, and that the simplest method tested also appears to perform the best. We find that features representing the extremes of FM encode species identity supplementary to that captured in frequency features, whereas bandwidth features do not encode additional information. Large-scale FM analysis can efficiently extract information useful for bioacoustic studies, in addition to measures more commonly used to characterise vocalisations.


Introduction
Frequency modulation (FM) is an important component of much birdsong: various species of bird can discriminate the fine detail of frequency-chirped signals (Dooling et al., 2002;Lohr et al., 2006), and use fine FM information as part of their social interactions (Trillo & Vehrencamp, 2005;de Kort et al., 2009).Use of FM is also strongly species-dependent, in part due to adaptation of birds to their acoustic environment (Brumm & Naguib, 2009;Ey & Fischer, 2009) .Songbirds have specific musculature around the syrinx which endows them with independent fine control over frequency (Goller & Riede, 2012).They can control the two sides of their syrinx largely independently: a sequence of two tones might be produced by each side separately, or by one side alone, a difference shown by the absence/presence of brief FM "slurs" between notes (Marler & Slabbekoorn, 2004, e.g. Figure 9.8).Therefore, if we can analyse bird vocalisation recordings to characterise the use of FM across species and situations, this information could cast light upon acoustic adaptations and communicative issues in bird vocalisations.As Slabbekoorn et al. (2002) concluded, "Measuring note slopes [FM], as well as other more traditional acoustic measures, may be important for comparative studies addressing these evolutionary processes in the future." Frequency analysis of birdsong is typically carried out using the short-time Fourier transform (STFT) and displayed as a spectrogram.FM can be observed implicitly in spectrograms, especially at slower modulation rates.However, FM data are rarely explicitly quantified in bioacoustics analyses of birdsong (one exception is Gall et al. (2012)), although the amount of FM is partly implicit in measurements such as the rate of syllables and the bandwidth (e.g. in Podos (1997); Vehrencamp et al. (2013)).
The relative absence of fine FM analysis may be due to the difficulty in extracting good estimates of FM rates from spectrograms, especially with large data volumes.Some previous work has indicated that the FM data extracted from a chirplet representation can improve the accuracy of a bird species classifier (Stowell & Plumbley, 2012).However, there exists a variety of signal processing techniques which can characterise frequency-modulated sounds, and no formal study has considered their relative merits for bird vocalisation analysis.
In the present work we aim to facilitate the use of direct FM measurements in bird bioacoustics, by conducting a formal comparison of four methods for characterising FM.Each of these methods goes beyond the standard spectrogram analysis to capture detail of local modulations in a signal on a fine time-scale.To explore the merits of these methods we will use the machine learning technique of feature selection (Witten & Frank, 2005) for a species classification task.
In the present work our focus is on methods that can be used with large bird vocalisation databases.Many hypotheses about vocalisations could be explored using FM information, most fruitfully if data can be analysed at relatively large scale.For this reason, we will describe an analysis workflow for audio which is simple enough to be fully automatic and to run across a large number of files.We will consider the runtime of the analysis techniques as well as the characteristics of the statistics they extract.
The genus Phylloscopus (leaf warblers) has been studied previously for evidence of adaptive song variation.For example Irwin et al. (2008) studied divergence of vocalisation in a ring species (Phylloscopus trochiloides), suggesting that stochastic genetic drift may be a major factor in diversity of vocalisations.Mahler & Gil (2009) found correlations between aspects of frequency range and body size across the Phylloscopus genus.They also considered character displacement effects, which one might expect to cause the song of sympatric species to diverge, but found no significant such effect on the song features they measured.Linhart et al. (2012) studied Phylloscopus collybita, also finding a connection between song frequency and body size.Such research context motivated our choice to use Phylloscopus as our primary focus in this study, in order to develop signal analysis methods that might provide further data on song structure.However, we also conducted a larger-scale FM analysis using a database with samples representing species across the wider order of Passeriformes.We first discuss the FM analysis methods to be considered.

FM analysis methods
For many purposes, the standard representation of audio signals is the spectrogram, calculated from the magnitudes of the windowed short-time Fourier transform (STFT).The STFT is applied to each windowed "frame" of the signal (of duration typically 10 or 20 ms), resulting in a representation of variations across time and frequency.The spectrogram is widely used in bioacoustics, and a wide variety of measures are derived from this, manually or automatically: it is common to measure the minimum and maximum frequencies in each recording or each syllable, as well as durations, amplitudes and so forth (Marler & Slabbekoorn, 2004).Notable for the present work is the FM rate measure of Gall et al. (2012), derived from manual identification of frequency inflection points (i.e. points at which the modulation changes from upward to downward, or downward to upward) on a spectrogram.Trillo & Vehrencamp (2005) characterise "trill vigour" in a related manner but applicable only to trilled syllables.For fully automatic analysis, in Section 2.2 we will describe a method related to that of Gall et al. (2012) but with no manual intervention.
The spectrogram is a widespread tool, but it does come with some limitations.Analysing a 10 or 20 ms frame with the STFT implies the assumption that the signal is locally stationary (or pseudo-stationary), meaning it is produced by a process whose parameters (such as the fundamental frequency) do not change across the duration of the individual frame (Mallat, 1999, Section 10.6.3).However, many songbirds sing with very dramatic and fast FM (as well as AM), which may mean that the local stationarity assumption is violated and that there is fine-resolution FM which cannot be represented with a standard spectrogram.
Signal analysis is under-determined in general: many different processes can in principle produce the same audio signal.Hence the representations derived by STFT and LPC analysis are but two families of possible "explanation" for the observed signal.A large body of research in signal processing has considered alternative representations, tailored to various classes of signal including signals with fast FM.One recent example which was specifically described in the context of birdsong is that of Stowell & Plumbley (2012), which uses a kind of chirplet analysis to add an extra chirp-rate dimension to a spectrogram.A "chirplet" is a short-time packet of signal having a central frequency, amplitude, and a parametric chirp-rate which modulates the frequency over time.
More generally, the field of sparse representations allows one to define a "dictionary" of a large number of elements from which a signal may be composed, and then to analyse the signal into a small number of components selected from the dictionary (Plumbley et al., 2010).For the present purposes, notable is the method of Gribonval (2001) which applies an accelerated version of a technique known as Matching Pursuit specifically adapted to analyse a signal as a sparse combination of chirplets.
Alternative paradigms are also candidates for performing high-resolution FM analysis.One paradigm is that of spectral reassignment, based on the idea that after performing an STFT analysis it is possible to "reassign" the resulting list of frequencies and magnitudes to shift them to positions which are in some sense a better fit to the evidence (Fulop & Fitz, 2006).The distribution derivative method (DDM) of Muševič (2013, Chapter 10) is one such approach which is able to reassign a spectrum to find the best-matching parameters on the assumption that the signal is composed of amplitude-and frequency-modulated sinusoids.
Another approach is that of Badeau et al. (2006) which uses a subspace model to achieve high-resolution characterisation of signals with smooth modulations.However, there may be limitations on the rate of FM that can be reflected faithfully: this method relies on a smoothness assumption in the frameto-frame evolution of the sound which means that it is most suited to relatively moderate rates of FM, such as the vibrato in human singing.
In the following we will apply a selection of analysis techniques to birdsong recordings, and study whether the FM information extracted is a reliable signal of species identity.This is not the only application for which FM information is relevant: our aim is that this exploration will encourage other researchers to add high-resolution FM analysis to their toolbox.

Data
We first collected a set of recordings of birds in the genus Phylloscopus from a dataset made available by the Animal Sound Archive in Berlin.1This consisted of 45 recordings over 5 species, in WAV format, with durations ranging from 34 seconds to 19 minutes.In the following we will refer to this dataset as PhyllASA.
As a second dataset, we also considered a broader set of audio from the Animal Sound Archive, not confined to Phylloscopus but across the order Passeriformes (762 recordings over 84 species).We will refer to this as PassaASA.
Thirdly we collected a larger Phylloscopus dataset from the online archive Xeno Canto. 2 This consisted of 1390 recordings across 56 species, ranging widely in duration from one second to seven minutes.Our criteria for selecting files from the larger Xeno Canto archive were: genus Phylloscopus; quality level A or B (the top two quality ratings); not flagged as having uncertain species identity.In the following we will refer to this dataset as PhyllXC .
Note that the "crowdsourced" Xeno Canto dataset is qualitatively different from PhyllASA.Firstly it was compiled from various contributors online, and so is not as tightly controlled.The noise conditions and recording quality can vary widely.Secondly, all audio content is compressed in MP3 format (with original uncompressed audio typically unavailable).The MP3 format reduces filesize by discarding information which is considered unnecessary for audio quality as judged by human perception (International Standards Organisation, 1993).However, human and avian audition differ in important ways, including time and frequency resolution, and we cannot assume that MP3 compression is "transparent" regarding the species-specific information that might be important in bird communication.Hence in our study we used this large crowdsourced MP3 dataset only after testing experimentally the impact of compression and signal degradation on the features we measured (using the PhyllASA data).
For each dataset considered here, we resampled audio files to 48 kHz mono WAV format before processing, and truncated long files to a maximum duration of 5 minutes.All of the datasets contain an uneven distribution, with some species represented in more recordings than others (Table 1).This is quite common but carries implications for the evaluation of automatic classification, as will be discussed below.

Method
For all analysis methods we used a frame size of 512 samples (10.7 milliseconds, at 48 kHz), with Hann windowing for STFT, and the frequency range of interest was restricted to 2-10 kHz.For each recording in each dataset, we applied a fully automatic analysis using each of four signal processing techniques.Our requirement of full automation excludes a preprocessing step of manually segmenting of birdsong syllables from the background.We chose to use the simplest form of automatic segmentation, simply to select the 10% of highestenergy frames in each recording.More sophisticated procedures can be applied in future; however, as well as simplicity this method has an advantage of speed when analysing large databases.We analysed each recording using each of the following techniques (which we assign two-letter identifiers for reference): ss: a spectrographic method related to the method of Gall et al. (2012) but with no manual intervention, as follows.Given a sample of birdsong, for every temporal frame we identify the frequency having peak energy, within the frequency region of interest.We calculate the absolute value of the first difference, i.e. the magnitude of the frequency jump between successive frames.We then summarise this by the median or other statistics, to characterise the distribution over the depth of FM present in each recording.This method relies on the peak-energy within each frame rather than manual identification of inflection points in the pitch trace, which means that it is potentially susceptible to noise and other corruptions, but it remains a relatively robust technique which can be applied to a standard spectrogram representation.In the following we will refer to this method as the "simple spectrographic" method.
rm: the heterodyne (ring modulation) chirplet analysis of Stowell & Plumbley (2012), taking information from the peak-energy detection in each frame.3mp: the Matching Pursuit technique of Gribonval (2001), implemented using the open-source Matching Pursuit ToolKit (MPTK) v0.7. 4 For this technique the 10% highest-energy threshold is not applicable, since the method is iterative and could return many more results than there are signal frames: we automatically set a threshold at a number of results which recovers roughly the same amount of signal as the 10% threshold.
dd: the distribution derivative method (DDM) of Muševič (2013, Chapter 10), taking information from the peak-energy sinusoid detected in each frame.5 We also conducted a preliminary test with the subspace method of Badeau et al. (2006), but this proved to be inappropriate for the rapid FM modulations found in birdsong because of an assumption of smooth FM variation inherent in the method (Badeau, pers. comm.).
Each of these methods resulted in a list of "frames" or "atoms" for a recording, each with an associated frequency and FM rate.In order to characterise each recording as a whole, we selected summary statistics over these frames in a recording to use as features.We summarised the frequency data by their median, and by their 5-and 95-percentiles.The 5-and 95-percentiles are robust measures of minimum and maximum frequency; we also calculated the "bandwidth" as the difference between the 5-and 95-percentile.We summarised the FM data by their median, and also by their 75-and 95-percentiles.These percentiles were chosen to explore whether information about the relative extremes of FM found in the recording provide useful extra information.
So, for each recording and each analysis method we can extract a set of frequency and FM summary features.It remains to determine which of these features might be most useful in looking for signals of species identity in recorded bird vocalisations.We explored this through two interrelated approaches: feature selection, and automatic classification experiments.Through these two approaches, we were able to compare the different features against each other, and also compare the features as extracted by each of the four signal-processing techniques given above.
One approach that has been used to explore the value of different features is principal components analysis (PCA) applied to the features, to determine axes that represent the strongest dimensions of variance in the features (see e.g.Mahler & Gil (2009); Handford & Lougheed (1991)).This method is widespread and well-understood.However, it is a purely linear analysis which may fail to reflect nonlinear information-carrying patterns in the data; and more importantly for our purposes, PCA does not take into account the known species labels, and so can only ever serve as indirect illumination on questions about which features might carry such information.
In the field of data mining/machine learning, researchers instead use feature selection techniques to evaluate directly the predictive power that a feature (or a set of features) has with respect to some attribute (Witten & Frank, 2005).We used an information-theoretic feature selection technique from that field.In information gain feature selection, each of our features is evaluated by measuring the information gain with respect to the species label, which is the amount by which the feature reduces our uncertainty in the label: where H(•) is the Shannon entropy.The value H(Species) represents the number of binary bits of information that must typically be conveyed in order to identify the species of an individual (from a fixed set of species).The information gain IG(Species, Feature) then tells us how many of those binary bits are already encoded in a particular feature, i.e. the extent to which that feature reduces the uncertainty of the species identity.If a feature is repeatedly ranked highly, this means that it contains a stronger signal of species identity than lowerranked features and thus suggests it should be a useful measure.The approach just described is reminiscent of the information-theoretic method introduced by Beecher (1989), except that his concern was with signals of individual identity rather than species identity.
Having performed feature selection, we were then able to choose promising subsets of features which might concisely represent species information.To evaluate these subsets concretely we conducted an experiment in automatic species classification.For this we used a leading classification algorithm, the Support Vector Machine (SVM), implemented in the libsvm library version 3.1, choosing the standard radial basis function SVM classifier.The evaluation statistic we used was the weighted "area under the receiver operating characteristics curve" (the weighted AUC ), which summarises the rates of true-positive and falsepositive detections made (Fawcett, 2006).This measure is more appropriate than raw accuracy, when analysing datasets with wide variation in numbers per class as in the present case (ibid.).The AUC yields the same information as the Wilcoxon signed-rank statistic (Hanley & McNeil, 1982).The feature selection and classification experiments were all performed using Weka 3.6.0(Witten & Frank, 2005), and analysed using R version 2.13.1 (R Development Core Team, 2010).
An important issue when considering automatic feature extraction is the robustness of the features to corruptions that may be found in audio databases, such as background noise or MP3 compression artifacts.This has particular pertinence for the crowdsourced PhyllXC dataset, as discussed above.For this reason, we also studied our first dataset after putting the audio files through two corruption processes: added white noise (−45 dB relative to full-scale, judged by ear to be noticeable but not overwhelming), and MP3 compression (64 kbps, using the lame software library version 3.99.5).To quantify whether an audio feature was badly impacted by such corruption, we measured the Pearson correlations of the features measured on the original dataset with their corrupted equivalent.This test does not depend on species identity as in our main experimental tests, but simply on the numerical stability of the summary statistics we consider.
In this study we focussed on frequency and FM characteristics of sounds, both of which can be extracted completely automatically from short time frames.We did not include macro-level features such as syllable lengths or syllable rates, because reliable automatic extraction of these is complex.Rather, we compared the fine-detail FM analyses against frequency measures, the latter being common in the bioacoustics literature: our feature set included features corresponding to the lower, central and upper frequency, and frequency bandwidth.

Results
We first illustrate the data which is produced by the analysis methods tested, using a recording of Phylloscopus collybita (Chiffchaff) from PhyllASA as an example.Figure 1 shows a conventional spectrogram plot for our chosen excerpt.We can infer FM characteristics visually, but the underlying data (a grid of intensity "pixels") does not directly present FM for analysis.Figure 2 represents the same excerpt analysed by each of the methods we consider.Each of the plots appears similar to a conventional spectrogram, showing the presence of energy at particular time and frequency locations.However, instead of a uniform grid the image is created from a set of line segments, each segment having a location in time and frequency but also a slope.It is clear from Figure 2 that each of the methods can build up a portrait of the birdsong syllables, although some are more readable than others.The plot from mp appears more fragmented than the others.This can be traced back to the details of the method used, but for now we merely note that the apparent neatness of each representation does not necessarily indicate which method most usefully captures species-specific FM characteristics.
The relative speeds of the analysis methods described here are given in Table 2.The simple spectrogram method is by far the fastest, as is to be expected given its simplicity.All but one of the methods run much faster than realtime, though the difference in speed between the simple spectrogram and the more advanced methods is notable, and certainly pertinent when considering the analysis of large databases.
Features extracted by methods ss rm and dd were highly robust to the noise and MP3 degradations applied, in all cases having a correlation with the original features better than 0.95 (Figure 3).Method rm showed particularly strong robustness.The mp method, on the other hand, yielded features of very  collybita).The FM can be seen by eye but is not explicit in the underlying data, being spread across many "pixels".low robustness: correlation with the original features was never above 0.95, in some cases going as low as to be around zero.This indicates that features from the mp method may be generally unreliable when applied to the PhyllXC dataset considered next.
Our feature selection experiments revealed notable trends in the information gain (IG) values associated with certain features, with broad commonalities across the three datasets tested (see Appendix for details).In particular, the bandwidth features achieve very low IG values in all cases.Conversely, the median frequency feature performs strongly for all datasets and all methods.The FM features perform relatively strongly on PhyllASA, appearing generally stronger than frequency features, but this pattern does not persist into the other (larger) datasets.However, the 75-percentile of FM did generally rank highly in the feature selection results.
Based on the results of feature selection, we chose to take the following four feature sets forward to the classification experiment: • Three FM features (fm med, fm 75pc, fm 95pc); • Three frequency-based features (freq 05pc, freq med, freq 95pc); Table 2: Time taken to run each analysis method on our first dataset PhyllASA, expressed as a proportion of the total duration of the audio files (so that any number below 1 indicates faster than real-time processing).Times were measured on a laptop with Intel i5 2.5 GHz processor.
Method Time taken (relative to audio duration) ss 0.02 rm 0.40 mp 0.58 dd 1.22 • The "Top-2" performing features (freq med, fm 75pc); • All six FM and frequency-based features together.
We did not include the poorly-performing bandwidth features.This yielded an advantage that the FM and frequency-based features had the same cardinality, ensuring the fairness of our experimental comparison of the two feature types.
Results for the classification experiment with different extraction methods and different feature subsets are shown in Figure 4 and Table 3.This is a difficult classification task (across 56 species), and the average AUC score in this case peaks at around 70%.A repeated-measures factorial ANOVA confirmed, for both datasets, a significant effect on accuracy for both feature set (p < 2 × 10 −16 ) and method (p ≤ 1.2 × 10 −6 ), with no significant interaction term found (p > 0.07).
We conducted post-hoc tests for differences in AUC between pairs of methods and pairs of feature-sets, using paired t-tests with Bonferroni correction for all pairwise comparisons (this is a repeated-measures alternative to the Tukey HSD test).Means were found to be different (p < 0.0035) for all pairs of methods except ss vs. dd (ss ≈ dd > rm > mp ).For the choice of feature set, means were found to be different (p < 2.2 × 10 −6 ) for all pairs of feature sets except Top-2 vs. Freq (FM+Freq > Freq ≈ Top-2 > FM).

Discussion
The fine detail of frequency modulation (FM) is known to be used by various songbird species to carry information (Marler & Slabbekoorn (2004, Chapter 7); Brumm & Naguib (2009); Sprau et al. (2010); Vehrencamp et al. (2013)), but automatic tools for analysis of such FM are not yet commonly used.Our experiments have demonstrated that FM information can be extracted efficiently from large datasets, in a fashion which captures species-related information despite the simplicity of method (we used no source-separation, syllable segmentation or pitch tracking).This was explicitly designed for application on large collections: our experiments used up to 1390 individual recordings, larger numbers than in many bioacoustic studies.

Noise MP3
Figure 3: Squared Pearson correlation between audio features and their values after applying audio degradation, across the PhyllASA dataset.Each point represents one feature; features are grouped by analysis method and degradation type.We inspected the variation according to feature, and found no general tendencies; therefore features are collapsed into a single column per analysis method in order to visualise the differences in range.Note that the vertical axis is warped to enhance visibility at the top end of the scale.
Our results show an effect of the choice of summary features, both for frequency and for FM data.The consistently strongest-performing summary feature was the median frequency, which is similar to measures of central tendency used elsewhere in the literature and can be held to represent a bird's central "typical" frequency.On the contrary, we were surprised to find that bandwidth measurements as implemented in our study showed rather little predictive power for species identity, since bandwidth has often been discussed with respect to the variation in vocal capacities across avian species (Podos, 1997;Trillo & Vehrencamp, 2005;Mahler & Gil, 2009).In our case the upper frequency extent alone (represented by the 95-percentile) appears more reliable, which may reflect the importance of production limits in the highest frequencies in song.
The FM features, taken alone, were not as predictive of species identity as were the frequency features.However, they provided a significant boost in predictive power when appended to the frequency features.This tells us not only that FM features encode aspects of species identity, but they encode complementary information which is not captured in the frequency measurements.
In light of our results we note that Trillo & Vehrencamp (2005) explored a measure of "trill vigour": "Because of the known production constraint tradeoff between note rate and bandwidth of trilled songs (Podos 1997), we derived an index of trill vigour by multiplying the standardized scores of these two q q q q ss rm mp   parameters" (Trillo & Vehrencamp, 2005, p. 925).This index was not further pursued since in their study it yielded similar results as the raw bandwidth data.However, if we assume for the moment that each note in the trills studied by Trillo & Vehrencamp is one full sweep of the bandwidth of the trill (this is the case for all except "hooked" trills), then multiplying the bandwidth (in Hz) by the note rate (in sec −1 ) yields exactly the mean value of the instantaneous absolute FM rate (in Hz/sec).This "trill vigour" calculation is thus very close in spirit to our measurement of the median FM rate.Their comparison of bandwidth features against trill vigour features served for them as a kind of feature selection, although in their case the focus was on trills in a single species.
A further aspect of our study is the comparison of four different methods for extracting FM data.A clear result emerges from this, which is that the simplest method (ss) attains the strongest classification results (tied with method dd), and is sufficiently robust to the degradations we tested.This should be taken together with the observation that it runs at least 20 times faster than any of the other methods on the same audio data, to yield a strong recommendation for the ss method.
This outcome came as a surprise to us, especially considering the simplifying assumptions implicit in the ss method.It considers the peak-amplitude frequencies found in adjacent STFT frames (i.e. in adjacent "slices" of a spectrogram), which may in many cases relate to the fundamental frequency of the bird vocalisation, but can often happen to relate to a harmonic, or a chance fluctuation in background noise.It contains no larger-scale corrections for continuity, as might be used in pitch-tracking-type methods (though note that as we found with the method of Badeau et al. (2006), those methods can incur difficulties tracking fast modulations).
The statistical strength of simple methods has been studied elsewhere in the literature.For example Kershenbaum et al. (2013) found that bottlenose dolphin signature whistles could usefully be summarised by a strongly decimated representation of the pitch track: a so-called "Parsons code" based on whether the pitch is rising or falling at a particular timescale, and which completely omits the magnitude of such rises or falls.The method is not analogous to ours, but has in common that it uses suprisingly simple statistics to summarise temporal variation.Audio "fingerprinting" systems such as Shazam (Wang, 2003) also rely on highly-reduced summary data, customised to the audio domain of interest.
Our ss method relies on finding a temporal difference between adjacent frames, as does that of Kershenbaum et al. (2013).This is partly reminiscent of the "delta" features often added to MFCCs to reflect how they may be changing.Such deltas are common in speech recognition and are also used in some automatic species classification (for example Trifa et al. (2008)).However note that MFCC "deltas" represent differences in magnitude, not in frequency.
Separately from the classification experiment, we studied the effects of noise and MP3 degradation on our summary features.Such issues are pertinent for crowdsourced datasets such as PhyllXC.
Measures such as minimum and maximum frequency carry some risk of dependence on recording conditions, particularly when derived from manual inspection of spectrograms (Zollinger et al., 2012;Cardoso & Atwell, 2012).We have demonstrated that our automatic FM measures using methods rm, dd or ss are robust against two common types of degradation (noise and compression), with rm particularly robust.They are therefore suitable tools to explore the variation in songbirds' use of FM in the laboratory and in the field.
Future work: in this study we did not use any higher-level temporal modelling such as the temporal structure of trill syllables, nor did we use advanced methods for segmenting song/call syllables from background.We have demonstrated the utility of fully automatic extraction of fine temporal structure information, and in future work we aim to combine this with richer modelling of other aspects of vocalisation.We also look forward to combining fine FM analysis with physiological models of the songbird vocal production mechanism-as has already been done with linear prediction for the source-filter model (Markel, 1972)-but explicitly accounting for songbirds' capacity for rapid nonstationary modulation and their use of two separate sound sources in the syrinx.

Conclusions
In much research involving acoustic analysis of birdsong, frequency modulation (FM) has been measured manually, described qualitatively or left implicit in other measurements such as bandwidth.We have demonstrated that it is possible to extract data about FM on a fine temporal scale, from large audio databases, in fully automatic fashion, and that this data encodes aspects of ecologically pertinent information such as species identity.Further, we have demonstrated that a relatively simple technique based on spectrogram data is sufficient to extract information pertinent to species, which one might expect could only be extracted with more advanced signal-processing techniques.Our study provides evidence that researchers can and should measure such FM characteristics when analysing the acoustic characteristics of bird vocalisations.

Appendix: Feature selection results
We performed feature selection on each of our three datasets, evaluated using Information Gain (IG) as described in the main text (Table 4, Figure 5).The overall pattern of IG values shows broad similarities between PhyllASA and PhyllXC, indicating commonalities in species-dependent features.The IG values evaluated on PhyllXC are consistently lower than those in PhyllASA, suggesting that the species information in the former may be affected by noise and MP3 effects.However, the tendency to lower IG values may also be an artefact of differences in species distribution within each dataset.

Figure 1 :
Figure 1: Standard spectrogram for a short excerpt of Chiffchaff (Phylloscopuscollybita).The FM can be seen by eye but is not explicit in the underlying data, being spread across many "pixels".

Figure 2 :
Figure2: Time-frequency plots of the "chirp" data recovered by each method, for the same excerpt as in Figure1.

Figure 4 :
Figure 4: Performance of species classification across 56 species, evaluated using datasets PassaASA (upper) and PhyllXC (lower).Results are shown for each analysis method, and for four different subsets of the available features (see text for details).The horizontal dashed line indicates the baseline chance performance at 50%.

Table 1 :
Counts of species occurrence in our three datasets.Note that PhyllASA is a subset of PassaASA, as reflected in the counts.

Table 3 :
Marginal mean of the weighted area under the curve (AUC) scores for the results shown in Figure4.

Table 4 :
Ranked results of information-gain (IG) feature selection applied to each of our three datasets.Features are ranked in order of how strongly they predict species identity.Left to right: PhyllASA, PassaASA, PhyllXC.Overview of information gain (IG) values calculated during feature selection; as in Table4but ordered by feature type.See Table4for numerical values