How index selection, compression, and recording schedule impact the description of ecological soundscapes

Abstract Acoustic indices derived from environmental soundscape recordings are being used to monitor ecosystem health and vocal animal biodiversity. Soundscape data can quickly become very expensive and difficult to manage, so data compression or temporal down‐sampling are sometimes employed to reduce data storage and transmission costs. These parameters vary widely between experiments, with the consequences of this variation remaining mostly unknown. We analyse field recordings from North‐Eastern Borneo across a gradient of historical land use. We quantify the impact of experimental parameters (MP3 compression, recording length and temporal subsetting) on soundscape descriptors (Analytical Indices and a convolutional neural net derived AudioSet Fingerprint). Both descriptor types were tested for their robustness to parameter alteration and their usability in a soundscape classification task. We find that compression and recording length both drive considerable variation in calculated index values. However, we find that the effects of this variation and temporal subsetting on the performance of classification models is minor: performance is much more strongly determined by acoustic index choice, with Audioset fingerprinting offering substantially greater (12%–16%) levels of classifier accuracy, precision and recall. We advise using the AudioSet Fingerprint in soundscape analysis, finding superior and consistent performance even on small pools of data. If data storage is a bottleneck to a study, we recommend Variable Bit Rate encoded compression (quality = 0) to reduce file size to 23% file size without affecting most Analytical Index values. The AudioSet Fingerprint can be compressed further to a Constant Bit Rate encoding of 64 kb/s (8% file size) without any detectable effect. These recommendations allow the efficient use of restricted data storage whilst permitting comparability of results between different studies.


| INTRODUC TI ON
Animal vocalizations come together with abiotic and human-made sounds to form soundscapes. These soundscapes can be recorded and quantified across large temporal and spatial dimensions to monitor species populations or infer community-level metrics such as biodiversity (Eldridge et al., 2018;Gómez et al., 2018;Roca & Proulx, 2016). Monitoring is crucial to effectively respond to threats such as disease, species loss, and overlogging (Rapport, 1989;Rapport et al., 1998). Previously, the use of in situ expert listeners to monitor species presence and abundance was common (Huff et al., 2000) but is costly and time-consuming; can damage habitats; and is prone to narrow focus and observer bias (Costello et al., 2016;Fitzpatrick et al., 2009). Advances in portable computing now permit remote recording of soundscapes, but produce a volume of data that is very time-consuming to review manually, leading to the development of automated, or semiautomated, methods of analysis Towsey et al., 2016).
Soundscape composition is primarily assessed using acoustic indices which describe the soundscape in an abstracted form.
Analytical Indices are a type of acoustic index which are summary statistics that describe the distribution of acoustic energy within the recording (Towsey et al., 2014)-over 60 of which have been designed to capture aspects of biodiversity (Buxton et al., 2018;Sueur et al., 2014). These are commonly used in combination to compare the occupancy of acoustic niches, temporal variation, and the general level of acoustic activity (Bradfer-Lawrence et al., 2019) across ecological gradients or in classification tasks (Gómez et al., 2018). These approaches have provided novel insight into ecosystems across the world (Buxton et al., 2018;Eldridge et al., 2018;Fuller et al., 2015;Sueur et al., 2019) but are not foolproof and often have poor transferability (Bohnenstiehl et al., 2018;Mammides et al., 2017). This may result from a lack of standardization: differing index selection, data storage methods, and recording protocols, which all lead to unassessed variation in experimental outputs (Araya-Salas et al., 2019;Bradfer-Lawrence et al., 2019;Sugai et al., 2019).
The output vector from the AudioSet convolutional neural net (CNN; Gemmeke et al., 2017;Hershey et al., 2017) is an attractive replacement for Analytical Indices. This pretrained, general-purpose audio classification algorithm generates a multidimensional acoustic fingerprint of a soundscape which can be used as a more effective suite of acoustic indices . The AudioSet CNN is trained on two million humanlabeled anthropogenic and environmental audio samples, potentially giving it both greater transferability and discrimination than typical ecoacoustic training datasets. Unlike Analytical Indices, however, extra analysis (such as training classifiers/predictive models) is necessary to relate the AudioSet Fingerprint to ecological processes and states.
In ecoacoustics, a continuous uncompressed or lossless recording is generally recommended (Browning et al., 2017;Villanueva-Rivera et al., 2011), but generates huge files. We considered two commonly used approaches to reducing storage requirements (Towsey, 2018). Firstly, MP3 compression, which is widely used in ecoacoustic studies (e.g., Saito et al., 2015;Sethi, Jones, et al., 2018;Zhang et al., 2016): This lossy encoding removes acoustic information inaudible to human listeners (Sterne, 2012) but is suspected of removing ecologically important data (e.g., Sugai et al., 2019;Towsey et al., 2016). Araya-Salas et al. (2019) have recently shown that ecological information is lost under high compression from recordings of isolated animal calls; however, it is not known if this extends to recordings of noisier whole soundscapes.
Secondly, recording schedules also vary in ecoacoustic studies (Sugai et al., 2019). Bradfer-Lawrence et al. (2019) showed that longer and more continuous schedules give more stable Analytical Index values. However, ecoacoustic composition varies with time of day (Bradfer-Lawrence et al., 2019;Fuller et al., 2015; and so reducing recording periods with temporal subsetting may reduce temporal variation and improve classification (Sugai et al., 2019) even with reduced data. Similarly, index calculation on longer recordings may average away anomalous calls and short-term patterns. In describing how well ecological information is stored in acoustic data under different recording decisions, we identified stronger standards to improve classifier accuracy, precision, and recall and provided a basis for comparison among studies.

| Study area
Acoustic samples were collected in Sabah, Malaysian Borneo, at the Stability of Altered Forest Ecosystems (SAFE) project: a large-scale ecological experiment on habitat loss and fragmentation effects on tropical forests (Ewers et al., 2011) which included sites in the Kalabakan Forest Reserve (KFR). Historically, logging within KFR has been heterogeneous, reflecting habitat modifications in the wider area (Struebig et al., 2013), with higher than typical timber extraction rates. This is a diverse forest type from which we have recorded at least 175 species of bird and at least 50 species of amphibian from 26 sites . Habitat ranges from areas of grass and low shrub, through logged forest to almost undisturbed primary forest.
We recorded continuously from a single recorder for a mean of 72 hr at each site (range: 70 to 75) during February and March 2019 (Appendix S1: Supplementary 2a). No rain fell during the recording period, so no recordings were excluded due to confounding geophony (Zhang et al., 2016). In all three sites, we placed individual omnidirectional (Hill et al., 2018) recorders, which were attached to trees (~50 cm diameter and 1-2 m above the ground) and recorded 20-min samples with no break period and stored them as uncompressed files ("raw,".wav format) at 44.1kHz and 16 bits.

| Compressing and resizing the raw audio
Continuous 20-min recordings were first split into recordings with a length of 2.5, 5.0, and 10.0 min, using the python package pydub (Robert & Webbie, 2018; Figure 1b) resulting in 8, 4, and 2 times as many recordings, respectively. The audio was then converted to lossy MP3 format using the fre:ac LAME encoder (Kausch, 2019) under two standard LAME MP3 encoding techniques: constant bit rate (CBR) and variable bit rate (VBR) compressions ( Figure 1c). CBR reduces the file size to a specified number of kilobits per second; VBR varies bitrate per second depending on the analysis of the acoustic content and a quality setting (0, highest quality, larger bitrate; 9 lowest quality, smaller bitrate). Since bitrates are not directly comparable between VBR and CBR-and because storage savings are often the principal driver of compression choices-we used compressed file size as our measure of compression level. We used VBR0 and CBR320, CBR256, CBR128, CBR64, CBR32, CBR16, and CBR8, which resulted in file sizes ranging between 41.6% (CBR320) and 1.04% (CBR8) of the original raw file size and some reductions in Nyquist frequency (Table 1). We do not consider lossless compression, as the storage capacity is much higher and the files are obligatorily the same postdecompression. Previous studies have also found that the lossless compressed audio is largely identical to raw audio (Linke & Deretic, 2020).

| Analytical indices
We used the seewave (ver 2.1.6) (Sueur, Aubin, et al., 2008)  (c) all audio is compressed using nine lossy nine MP3 encoding techniques; (d) Analytical Indices and CNN Derived AudioSet Fingerprint are calculated from audio of all lengths and compressions. Data Analysis: (e) Index covariance is found per index type and correlation with maximum frequency is found; (f) likefor-like differences of indices calculated from compressed versus uncompressed counterparts are found; (g) intragroup variance compared for the recording lengths; (h) the indices of both types, lengths, and compressions are tested with a supervised random forest classification task; (i) the dataset is split into temporal sections and classification accuracy is found Index (NDSI; Appendix S1: Supplementary 3). These have been shown to capture diel phases, seasonality, and habitat type (Bradfer-Lawrence et al., 2019). These indices could not be calculated for all recordings due to file reading errors; however, this fault occurred in 0.3% of all recordings (Appendix S1: Supplementary 2b).

| AudioSet fingerprint
The audio was converted to a log-scaled Mel-frequency spectrogram after 16 kHz downsampling and then passed through the "VGG- D was not normally distributed (Appendix S1: Supplementary 5a), so median and interquartile ranges were reported. We determined that an index has been altered as a result of compression to be when: (a) the interquartile range of D did not include zero difference or (b) median D was more than ±5% of the R raw . We used Spearman rank correlation to test for a consistent trend in D with increasing compression. To reflect their common use cases, D for Analytical Indices was calculated from the univariate values, while for AudioSet Fingerprints-which is intended as a multidimensional metric-D was calculated separately for each dimension and then given as a mean of all 128 values.

| Impact of recording schedule: recording length
Recordings of longer length may have a reduced variance due to the smoothing of potentially important transient audio anomalies (such as nearby bird or cicada calls). We tested this by comparing the variance of the recording groups at different commonly used recording lengths. The index values are non-normally distributed so we used Levene's test for homogeneity of variance ( Figure 1g).

| Impact of parameter alteration on classification task
We used random forest classification models to assess how well the soundscapes were represented by each index type under each different experimental parameter, using the RandomForest (ver 4.6-14) (Liaw & Wiener, 2002) package in R (Figure 1h). Models were trained on a 24-hr period of data from each site and tested on the remaining 46 + h of audio. We used 2,000 decision trees to ensure accuracy had stabilized. The model was trained and

| Impact of temporal subsetting
Soundscapes typically show considerable diel variation in both abiotic and biotic components. To assess the impact of this variance on model performance, we split our recordings into four 6-hr sections centered on the key periods of Dawn (06:00), Noon (12:00), Dusk (18:00), and Midnight (00:00) and then further subdivided these into 3-hr (8 sections) and 2-hr (12 sections) blocks to test how further reductions affected the model (Figure 1i). We trained and tested the random forest model again on each of the temporally subset recordings, with each section used to build models individually, and determined accuracy, precision, and recall as before.
2.5.6 | Modeling the impact of index selection, compression, and recording length on the accuracy metrics As the accuracy metrics are bound between 0% and 100%, we used a beta regression to model the relationship between each of the experimental parameters and performance metrics (Douma & Weedon, 2019).
The model included pairwise interactions between file size, temporal subsetting, and recording length, and then all interactions of main effects and those pairwise terms with the index selection. We observed that variance in performance measures varied as an interaction of both index choice and a temporal subsetting (Appendix S1: Supplementary 8a), so tested the inclusion of these terms in the precision component of the model. We first treated recording length and temporal subsetting as factors, but also tested a model considering these as continuous variables. We found the Akaike information criterion (AIC) was markedly lower in a beta regression model using factors and including the precision component (Appendix S1: Supplementary 8b).

| RE SULTS
Although Spearman pairwise correlations of Analytical Indices and Nyquist frequency were low on average (mean = 0.32, IQR = 0.22), we found some strongly correlated sets of indices ( Figure 2). ADI, Bio and NDSI all show strong similarities and were closely correlated with maximum recordable frequency; AEve and H were also strongly correlated ( Figure 2). Some features of the AudioSet Fingerprint correlated with each other and maximum frequency, but in general, these features were more weakly correlated (mean = 0.14, IQR = 0.18; figure in Appendix S1: Supplementary 4b).

| Impact of compression
3.1.1 | Impact of compression: like-for-like differences Levene's test for homogeneity of variance; Appendix S1: Supplementary 6b).

| Impact of index selection
Confirming prior findings , we showed that habitat classifiers derived from 5-min recordings using raw For both index types, this reflected a decreased ability to differentiate logged and primary forest. Interestingly, classifiers from both index types showed better discrimination between cleared land and logged forest under strong compression. These patterns were repeated across recording lengths (Appendix S1: Supplementary 5a).

| Impact of temporal subsetting
Temporally subsetting poses a trade-off as when diel variation is reduced, so too are the recording hours available for analysis.
Temporally subsetting the day into quarters (Figure 4) Figure 4g,h,m,n), and the performance difference between AudioSet Fingerprints and Analytical Indices was largely maintained.

| Combined effects of parameter alterations on classification performance
Confirming prior findings  to increase as the days were cut into smaller temporal subsections; however, this effect was small compared with the contribution of index type ( Figure 5). Temporal subsetting appeared to have minimal effect on the accuracy of the AudioSet Fingerprint classifier, which kept consistently high (70%-100%; Figure 5). The classifier trained on Analytical Indices, however, became much more unpredictable when temporal subsetting is used (20%-100%; Figure 5).

| D ISCUSS I ON
Ecoacoustics is a new and rapidly expanding field of ecology, with great power to describe ecological systems (e.g., , but methodological choices have proliferated that have poorly known impacts on ecoacoustic analysis. We have shown that the choice of acoustic index is key and confirm ) that a multidimensional generalist classifier (AudioSet ACI and Bio both shared a dependence on high frequency or quieter sounds and were generally most severely affected by F I G U R E 5 Classifier accuracy model predictions as a function of file size (x-axis), index type (columns), temporal subsetting (rows), and frame size (colors, see legend). Hexagon binning is used to show the distribution and density of the underlying data compression. ACI measures frequency band-dependent changes in amplitude over time (Pieretti et al., 2011) and is reduced when there is minimal variation between time steps. Loss of "masked" sounds under low compression and then 16-24 kHz sound under CBR16 may reflect the loss of ecoacoustic temporal variation: This band includes the calling range of many invertebrates, birds, mammals, and amphibians (Browning et al., 2017). The Bio index similarly quantifies the spread of frequencies in the range 2 kHz-11 kHz, all relative to the quietest 1 kHz band (Boelman et al., 2007) We found that even the highest rate of compression caused a comparatively small reduction in the overall accuracy of the classification task (5.8% and 3% for Analytical Indices and the AudioSet Fingerprint, respectively, for the 5-min recordings without temporal subsetting). In both cases, the reduction in accuracy was explained by a higher degree of overlap between primary and logged forests.
When audio is compressed, the whole signal is altered but higher frequencies and quieter sounds are more severely altered and reduced than others (Sterne, 2012). Higher and quieter frequencies (akin to specific animal vocalizations) may therefore be more important for separating logged and primary-but less so for discerning cleared from other forest types (which may be more dependent on overall amplitude). These proportionally small differences, while somewhat reassuring, should be considered with caution they may be due to the large differences in habitat structure among our three habitat classes. Combining this with our relatively small sample size, we would like to emphasize that these findings may therefore not be generalizable to areas of more closely related forest.
Both Analytical Indices and AudioSet Fingerprint had similar changes in variance as a result of recording length. Transient vocalizers are therefore likely somewhat important in the determination of the AudioSet Fingerprint and variable importance in some Analytical Indices. The ACI index was not impacted by recording length despite specifically quantifying how the soundscape changes over time (Pieretti et al., 2011). The ADI, AEve, and H all did incur an alteration in variance as recording length changed; interestingly, these indices do not consider any temporal value but rather just the spread of frequency (Sueur, Pavoine, et al., 2008;Villanueva-Rivera et al., 2011)  3. If further compression is a necessity, use indices which describe the general energy of the system rather than those which are dependent on high frequency or quieter sounds, such as ACI.

| RECOMMENDATI ON S AND CON CLUS ION
4. Temporal subsetting may be a useful alternative for capturing soundscape descriptors with AudioSet Fingerprinting when data storage costs are a bottleneck. However, temporal subsetting should be used with caution when using Analytical Indices owing to the variation in classification accuracy, precision, and recall.
There exists a trade-off between the quality and volume of data that can be stored in ecoacoustics. We have investigated the impact of compression along a gradient of habitat disturbance, providing evidence that compressed audio can be used without severely affecting either of the index type. The ability to use compression may reduce experimental costs, remove bottlenecks in study design, and help remote ecoacoustic recorders reach true autonomy.
Moreover, by providing a quantified description of how individual indices, and more broadly grouped index categories, respond to compression, we have enabled comparisons to be drawn between studies of compressed and noncompressed audio. Increasing comparability of studies will become progressively important as global ecoacoustic databases, and recording sites grow and open up novel opportunities to explore datasets across huge temporal and geographic scales.

ACK N OWLED G M ENTS
We firstly thank Dr Henry Bernard at the Sustainability of Altered

CO N FLI C T O F I NTE R E S T
No conflict of interest to declare.

O PEN R E S E A RCH BA D G E S
This article has earned an Open Data Badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available at AudioSet/ Analytical Index