Drone audition for bioacoustic monitoring

Multi‐rotor drones equipped with acoustic sensors have great potential for bioacoustically monitoring vocal species in the environment for biodiversity conservation. The bottleneck of this emerging technology is the ego‐noise from the rotating motors and propellers, which can completely mask the target sound and make sound recordings unusable for further analysis. The ego‐noise not only degrades the performance of bioacoustic monitoring but also impacts the behaviour of target species if the drone is too close to the target area. In this paper, we address this challenging problem by combining hardware and software solutions that minimize the impact of drone ego‐noise on bioacoustic monitoring. To collect the target sound from the ground, we used a shotgun microphone recording system suspended underneath the drone body with a wire rope (steel fishing line) of length 2 m. The suspended rope puts a large distance between the drone and the recorder, reducing the propeller sound perceived by the microphone. The shotgun microphone enables the sound to be picked up from the ground effectively while rejecting the drone sound from above. We further developed a software solution that aims to automatically recognize the bird species from the bird call recording and we proposed a noise‐augmented training scheme to improve the robustness of bird recognition in the presence of strong drone noise. We evaluated the performance of the system in a test problem of recognizing 20 bird species with in‐flight recordings, where a loudspeaker on the ground simulates bird calls. The recordings were obtained using a drone hovering at various altitudes ranging from 5 to 30 m. By combining the hardware and software solutions, the system recognizes birds robustly at an altitude of 30 m and signal‐to‐noise ratio −25 dB. This demonstrates the feasibility of our drone audition system for bioacoustic monitoring. The proposed method overcomes a long‐standing bottleneck problem in drone audition and promises new applications of bioacoustic monitoring in research and management.


| INTRODUC TI ON
The proliferation of affordable off-the-shelf drones offers great opportunities for wildlife population monitoring and biodiversity conservation (Besson et al., 2022;Gonzalez et al., 2016;Stephenson, 2020).
The use of drones for wildlife surveys has the advantages of low operational cost, low risk for operators and rapid data collection over large areas.Traditional approaches using drones for wildlife surveys are based on computer vision techniques (Corcoran et al., 2021;Ward et al., 2016).Drone-mounted optical sensors, such as high-resolution cameras and infrared thermal cameras, capture species and habitats in terrestrial environments from high altitudes.Computer vision techniques are then employed to detect, classify and count animals appearing in the video or image (Fu et al., 2018;Hodgson et al., 2018).
While effective, computer vision-based approaches have their own drawbacks such as target occlusion by flora, illumination changes and camera movement.In particular, computer vision techniques are not able to detect inconspicuous species that are not appearing in the camera image, for example small birds perching below the forest canopy (Christie et al., 2016).
Passive acoustic monitoring is another approach to surveying wildlife in surrounding environments (Gibb et al., 2019).With distributed acoustic sensors picking up sound, machine listening algorithms are employed to analyse the sound recordings and estimate wildlife occupancy, density, behaviour and response to environmental change (Bradfer-Lawrence et al., 2019).Acoustic monitoring can effectively detect wildlife vocalizations of taxa such as birds, bats, frogs or insects, which are difficult to detect with cameras.Acoustic monitoring thus complements computer vision surveys.However, the placement of a large number of acoustic sensors at fixed points in the environment can be expensive and time-consuming when covering a large area or areas that are difficult to reach (Perez-Granados & Traba, 2021).
Drone audition techniques emerged in recent years to fill the gap between computer vision and passive acoustic approaches for wildlife monitoring (Wilson et al., 2017).The sound captured by a flying drone carrying acoustic sensors can be used to detect and identify vocal species in the environment (Somers et al., 2020).The drone audition approach can be quick, flexible and cost-effective (especially when considering deployment and maintenance costs), and so provides a good complement to long-term acoustic monitoring at selected fixed locations.The mobility of the drone can easily cover a large area or areas that are difficult to reach.Recent studies show great potential for using drones in bioacoustic monitoring of birds and bats (Broset, 2018;Michez et al., 2021).
The main challenge for drone audition techniques is the egonoise produced by the rotating motors and propellers during drone flight (Wang & Cavallaro, 2018).Airborne microphones are typically mounted much closer to motors and propellers than target sound sources in the air or on the ground, leading to extremely noisy recordings with signal-to-noise ratio (SNR) lower than −20 dB.Consequently, the sound recording by drones is unusable for further analysis.The target sound is completely masked by the ego-noise.This challenge has been recognized widely in previous studies (Michez et al., 2021).
To minimize the impact of ego-noise, some researchers proposed to suspend the microphone far away from the drone body, for example using a rope of length 8 m.This, however, degrades the manoeuvrability of the drone during operation (Wilson et al., 2017).Other general challenges include the weight of the payload, the limited flight time and the impact of weather conditions such as wind and rain.
The ego-noise not only degrades the performance of bioacoustic monitoring but also impacts the behaviour of target species if the drone is too close to the target area (Ednie et al., 2021;Wilson et al., 2022).To minimize this impact, the drone has to fly far above the target area ranging from 10 to 100 m, depending on the type of drone and the target species (Duporge et al., 2021).This large distance further increases the challenges of acoustic monitoring.
In this paper, we aimed to address this challenging ego-noise reduction problem.We proposed a system that combines three strategies to improve the bioacoustic monitoring performance: (i) suspending the microphone below the drone body to reduce the ego-noise level; (ii) using a shotgun microphone that effectively suppresses the ego-noise coming from the opposite direction; and (iii) developing a noise-robust machine learning algorithm for bioacoustic monitoring in the presence of ego-noise.We used bird recognition as a case study to show the effectiveness of the proposed drone audition system.To the best of our knowledge, this is the first time that machine learning algorithms were applied for bioacoustic monitoring of birds with drones.

| REL ATED WORK
In this section, we survey recent works relevant to our application, including drone audition techniques, the use of drones for bioacoustic monitoring and its impact on wildlife and the advances of machine learning for automatic bird call analysis.

| Drone audition
Drone audition algorithms were first introduced for tasks such as search and rescue, aerial filming and human-drone interactions.The drone was equipped with multiple microphones to improve the acoustic sensing performance (Deleforge et al., 2019;Hioka et al., 2019).By exploiting the spatial discrimination between the ego-noise and the target sound source, microphone array techniques were employed for sound enhancement and source localization with time-frequency spatial filtering, beamforming and blind source separation algorithms (Wang & Cavallaro, 2018, 2020, 2021).In addition to algorithm design, some studies focussed on the optimization of the microphone array configuration to improve the acoustic sensing performance.For instance, the microphone array can be mounted on top of the drone body (Clayton et al., 2023;Wang et al., 2018), beneath the drone body (Salvati et al., 2020), surrounding the drone body (Hoshiba et al., 2017) or extending away from the drone body (Hioka et al., 2019).While microphone array techniques work effectively to suppress the egonoise, the algorithm typically requires that the drone remains in the hovering status to ensure a static acoustic scenario.Sound processing in dynamic scenarios when the drone is flying around is still an open problem (Wang & Cavallaro, 2022).In addition, microphone array techniques require a multichannel sound recording system, which are not currently available in commercial drone platforms.Due to these challenges, the application of microphone array approaches for bioacoustic monitoring is left to future work.

| Bioacoustic monitoring using drones
Table 1 lists existing studies that used drones for bioacoustic monitoring.Having emerged only in recent years, this approach was mainly applied to the survey of birds and bats.
As for the drone, most studies (Ednie et al., 2021;Fischer et al., 2021Fischer et al., , 2023;;Michez et al., 2021;Wilson et al., 2017) used DJI Phantom for the monitoring task, a popular multi-rotor model on the commercial market.One exception (Broset, 2018) used a self-made fixed-wing drone, which generates less ego-noise but is not able to hover in the air.Multi-rotor drones, in contrast, can hover stably in the air but produce stronger ego-noise.For the recording device, all current studies of bird monitoring (Broset, 2018;Fischer et al., 2021Fischer et al., , 2023;;Michez et al., 2021;Wilson et al., 2017) used Zoom H1 to record the bird calls, benefiting from its low weight (<100 g) and ease of operation.Recent studies of bat monitoring (Ednie et al., 2021;Michez et al., 2021) used AudioMoth and Echometer to record the bat call, due to their wide range of frequency responses.For instance, AudioMoth is capable of recording sound with a sampling rate of up to 384 kHz (Hill et al., 2018), thus capturing most of the frequency range of bat sounds (9-200 kHz).
Interestingly, to reduce the impact of the ego-noise on sound recording, all studies on bird monitoring (Broset, 2018;Fischer et al., 2021Fischer et al., , 2023;;Michez et al., 2021;Wilson et al., 2017) suspended the recorder beneath the drone body with a long rope of 8 m.This strategy was initiated by (Wilson et al., 2017) and followed in other works published later.Since the intensity of sound decays with distance, this set-up can suppress the ego-noise by up to 30 dB in comparison with a recorder mounted on the drone body (assuming the recorder-propeller distance is roughly 0.25 m and the power declines quadratically with distance).The disadvantage is that it is difficult to operate the drone with a recorder suspended at such a large distance, and it is susceptible to wind in the natural environment.
All the studies listed in Table 1 relied on subjective listening for bird sound analysis and subjective inspection of the spectrogram for bat sound analysis.All studies, except (Ednie et al., 2021), simulated wildlife vocalization with a loudspeaker playing pre-recorded sound, and used subjective listening and inspection as the primary means of data analysis.The employment of more advanced techniques, for example machine learning for automatic analysis, is still in the infant stage.
Apart from bird and bats, drones can also be used to acoustically monitor other species.One study, for instance, suspended hydrophones from a drone to the water surface to monitor the sound of whales in the sea (Frouin-Mouy et al., 2020) Note: These studies differ in terms of the type of drone and audio recorder, the set-up of the recorder on the drone, the flight altitude of the drone during the recording, the target species and the data analysis strategy.
drone at an altitude of 1 m and suspended a microphone close to the grassland surface to detect grasshoppers using machine learning techniques (Zhang, 2023).

| Impact of drones on wildlife
Drone ego-noise not only affects bioacoustic monitoring performance but also alters bird and bat behaviour, leading to avoidance of the drone, reduced detection rates and inaccurate surveys (Mulero-Pazmany et al., 2017).A few studies investigated the impact of drone noise on wildlife and the optimal distance of the drone that minimizes this impact during bioacoustic monitoring (Duporge et al., 2021;Ednie et al., 2021;Vas et al., 2015;Wilson et al., 2022).
We reviewed these works for insights on the best monitoring distance of the drone in our applications.
The optimal monitoring distance varies remarkably depending on the type of drone and the target species.One study (Ednie et al., 2021) investigated the impact of small and commercial drone flight on bats and reported that only 22% of bats were recorded in the presence of a drone (DJI Phantom 4) that hovered at a height of 5-10 m, but the impact was smaller for certain species (e.g.big brown bat and silver-haired bat).Another study (Wilson et al., 2022) investigated the impact of small drones on seven bird species.It was reported that a drone (DJI Mavic 2) hovering at 48 m above ground level for 3 min caused significant responses by two species, but little by the other five species.
There has also been work that studied the optimal flight altitude to minimize the acoustic drone disturbance to wildlife (Duporge et al., 2021;Vas et al., 2015).One study (Vas et al., 2015) experimented with a small drone (Cyleone Phantom) approaching three types of waterbirds (mallards, flamingos and greenshanks) and reported that the drone could fly to within 4 m of the birds without visibly modifying their behaviour and that the birds tended to react more to drones approaching vertically.Another study (Duporge et al., 2021) experimented with seven common types of small drones (Inspire, Mavic 2, Mavic Mini, Mavic Pro, Mavic Platinum, Spark, Phantom) approaching 20 types of sea and ground species and advised an optimal hovering altitude ranging from 10 to 100 m, depending on the type of drone and the target species.
In summary, the response of wildlife to drone noise is species-specific and the optimal monitoring distance may vary from tens of metres to one hundred metres, depending on the type of drone and target species.Since the sound level attenuates with distance, the SNR at the audio recorder becomes extremely low at large monitoring distances.Understanding how to design a system that works robustly in these adverse circumstances is crucial to the success of applying drone audition to bioacoustic monitoring.

| Bird call analysis
Next, we discuss the state of the art of bird call analysis to identify appropriate bird call analysis algorithms and bird call datasets for our study.Automatic bird call analysis has advanced significantly thanks to the progress of machine and deep learning (Stowell, 2022).The research in the field is driven by ongoing academic challenges, which lead to proliferation of bird call recognition algorithms and real-life datasets.Currently, two series of challenges have attracted intensive attention from the research community: the bird audio detection challenge at DCASE (Detection and Classification of Acoustic Scenes and Events) and the bird call identification challenge at BirdCLEF.Table 2 summarizes the details of these challenges.
The bird audio detection DCASE 2016 challenge posed the binary classification task by detecting if the testing segment contains bird call or not (Stowell et al., 2019).Bird call detection is a necessary preprocessing step before bird call classification.The BirdCLEF challenges 2016-2022 (Goeau et al., 2016(Goeau et al., , 2017(Goeau et al., , 2018;;Kahl et al., 2019Kahl et al., , 2020Kahl et al., , 2021Kahl et al., , 2022) ) and the bird event detection DCACE 2021 challenge (Morfi et al., 2021) were aimed at multiclass bird classification, which typically consists of two tasks: monospecies and soundscapes.Monospecies tasks aim at detecting the main species in the testing segment.Soundscape tasks aim at detecting all the active species in the test segment.The soundscape task is more realistic and also more challenging than the monospecies task.In this paper, we focus on the simpler case, that is the monospecies task.

| MATERIAL S AND ME THODS
In this section, we introduce the prototype drone audition system we designed for bioacoustics monitoring, which consists of hardware and software modules.The hardware module aims to reduce the ego-noise with physical methods, while the software module aims to design a bird recognition algorithm that is robust to ego-noise.

| Hardware design
As shown in Figure 1, the system consists of a drone with a recorder suspended beneath the drone body with a wire rope.The set-up is similar to the ones used in the existing literature (Fischer et al., 2021;Michez et al., 2021;Wilson et al., 2017), with the major differences being in the recorder and the length of the rope.
For the drone, we used a DJI Mavic 2 Pro, which is a small and popular model available in the commercial market.The drone has dimensions 32.2 × 24.2 × 8.4 cm, a weight of 907 g, a payload capacity of 500 g and a flight time of up to 30 min.The drone was equipped with a standard Air-dropping Thrower release system, which can gently drop the payload to the ground before landing.
For the recorder, we used a Rode NTG4+ shotgun microphone in combination with a Tascam DR10X recorder (see Table 3).The NTG4+ shotgun microphone provides a supercardioid polar sensitivity pattern as shown in Figure 2. The microphone can suppress the sound coming off-axis (105°-255°) efficiently by up to 20 dB.This makes it an excellent fit for our applications when suspending the microphone beneath the drone body and letting it point downwards: The target sound on the ground comes from the on-axis (front) direction and the ego-noise comes from the off-axis (rear) direction.
In comparison with Zoom H1, which is designed for stereo sound capture with two built-in cardioid microphones and is widely used in the literature (Fischer et al., 2021;Michez et al., 2021;Wilson et al., 2017), NTG4+ provides better ego-noise rejection although Zoom H1 is smaller and lighter.In comparison with other shotgun microphones, one advantage of NTG4+ is that it uses a rechargeable internal battery to power the microphone, thus reducing the weight of the recording system remarkably.We used a lightweight recorder Tascam DR10X, which is powered with one AAA battery and attaches to the NTG4+ firmly via a captive XLR connector.
For the rope, we used a steel fishing line 2 with a length of 2 m, which is different from the common set-up (8 m) in the literature (Fischer et al., 2021;Michez et al., 2021;Wilson et al., 2017).While a longer rope provides better noise reduction, it imposes challenges on the operation of the drone, especially when flying the drone in the presence of wind in the natural environment.We further used a professional windshield (Rycote Super Softie) to protect the microphone from wind generated by the propellers and the natural environment.

| Software design
We developed a bird recognition pipeline that is robust to ego-noise.
Here, we first introduce the bird call dataset used for algorithm development; we then present the pipeline of the bird recognition system; we finally propose the noise-augmented training strategy to improve the robustness to ego-noise.
TA B L E 2 Summary of academic challenges for bird call analysis.Note: These challenges differ in terms of the number of bird species, the target scenario, the size of the training dataset and testing dataset.

| Dataset
For bird recognition, we considered 20 bird species that are commonly heard in the UK. 3 The testing data were from the 'British Birdsong dataset', 4 which is a specific subset gathered from the     was computed as the energy ratio between the clean bird call and the noise in the frequency band [0, 10,000] Hz.It can be observed that, depending on the bird species, the bird call may occupy a typical frequency range and show sparsity in both time and frequency.
The drone ego-noise consists of harmonic components and full-band components.The harmonic noise originates from the rotating motors and the pitch of the harmonics is proportional to the rotation speed.
The ego-noise originates from the rotating propellers cutting the air and occupies the full frequency band.From the audio spectrogram, it can be observed that, as the SNR declined, the bird calls became less distinguishable from the noisy background.At SNR lower than −20 dB, the bird calls were completely masked by the drone egonoise.This illustrates the challenge of bird recognition at lower SNR.

| Bird recognition data processing pipeline
Figure 4 depicts the pipeline for bird recognition using a VGGishbased convolutional neural network.VGGish has proven successful in many sound event recognition tasks (Hershey et al., 2017).We Each FCNN block sequentially consisted of a fully connected (FC) layer and a nonlinear (ReLu) layer.The FC layer consisted of many neurons that were connected to all the neurons in the previous layer.The FC layer is defined by the number of neurons.The decision block consisted of an FC layer and a nonlinear (Softmax) layer which output the classification result.The number of neurons in the FC layer is given by the number of decision labels.
The parameters of the neural network are given in Table 5.The model contained 4.9 M learnable parameters.

| Noise-augmented training
We adapted a recently proposed noise-augmented training strategy to improve the robustness of the classifier to ego-noise (Mukhutdinov et al., 2023).We generated noisy training data by adding the ego-noise to the clean bird call at various SNRs.Table 6 lists our choices combining bird call and drone noise data for training The recognition performance of the classifier was measured by recognition accuracy.Suppose we have I segments with the labels of J segments correctly predicted, the recognition accuracy is defined as

| RE SULTS
We first evaluated all the trained models with simulated data to identify the best performing models in low-SNR scenarios.Then, we validated the performance of these models using in-field recordings.

| Simulation results
Depending on the distance between the drone and the target sound source, the SNR at the microphone can be as low as −30 dB.We therefore evaluated the classification performance at different SNRs varying from −35 to 20 dB in intervals of 5 dB.For each SNR, the evaluation data were generated by adding the clean bird call and the drone noise at a specific SNR.The drone noise was randomly extracted from the testing drone noise with the same length as the bird call clip.

| Set-up
We evaluated the recognition performance of the trained DNN models using field-recorded data.As illustrated in Figure 1, we placed a loudspeaker (Behringer EPA40) on the ground, pointing upward, to emit a clean bird call sound.The median power of the loudspeaker sound was about 70 dB measured by a sound level metre at a distance of 1 m.The reference level for dB is human threshold of hearing (i.e. 10 −12 W/m 2 ).Most birds call at sound levels ranging from 50 to 80 dB.We then evaluated the bird recognition performance with data recorded by the drone hovering at altitudes varying from 5 to 30 m in intervals of 5 m.For each hovering altitude, the loudspeaker played a complete cycle of the benchmark bird data, which lasted 8 min.We then applied models MD0 and MD4 to the recorded data.
As shown in Figure 9, MD4 performed more robustly than MD0 as the hovering altitude of the drone rises.MD0 achieved nearly perfect recognition accuracy for both the original audio and the clean (bird-only) recording, but its performance dropped remarkably as the hovering altitude increased.MD4 performed worse than MD0 for both original audio and the clean (bird-only) recording; however, its performance dropped much slower as the hovering altitudes increased.Encouragingly, the recognition accuracy (75%) of MD4 at an altitude of 30 m was only 15 percentage points lower than for the bird-only recording (90%).
In summary, the in-field results confirmed the outperformance by the noise-augmented model (MD4) over the non-augmented

F
Drone audition system for bioacoustic monitoring.The drone with a suspended recorder at high altitude records the sound from a loudspeaker on the ground.(a, b) Illustration and real system.(c) Real device: DJI Mavic 2 Pro drone and Rode NTG4+ microphone with Tascam DR10X Recorder.Specifications of drone audition hardware system.
Xeno-Canto collection to form a balanced dataset across 88 bird species, each with three recordings.The duration of the recordings is variable.We selected the recordings of the 20 birds from the British Birdsong dataset to construct the evaluation dataset, which was divided into clips of 5 s lengths.This yielded 921 evaluation clips in total.For training, we used the Xeno-Canto recordings of the same 20 bird species excluding the British Birdsong dataset.For each bird species, we randomly extracted 1100 clips, each lasting 5 s and used a total of 22,000 clips as a training dataset.All audio clips were resampled at 20 kHz.We additionally used drone ego-noise for training and evaluation.The ego-noise was recorded by our developed hardware when the drone was hovering in the air.The training noise is 290 s, and the test noise is 140 s long.Table4summarizes the details of the training and testing datasets.

Figure 3
Figure 3 depicts the time-frequency spectrograms of segments of bird calls from five bird species.The bird calls were mixed with the drone ego-noise at varying SNR {0, −10, −20, −30} dB.The SNR computed a Mel spectrogram from each sound frame and then fed it as an image input to the CNN classifier.The audio recordings were divided into 5-s chunks.Mel spectrogram features were computed for every 5-s audio segment and used as input to the model.STFT was employed to obtain the spectrogram from the input audio recordings with a window length of 40 ms with 50 per cent overlap, an FFT window of 2048 and a sample rate of 20 kHz.This process converted a 5-s audio recording into a 1025 × 250 dimensional spectrogram representation.Each frame of this spectrogram was converted into a 128-dimensional vector of log filter bank energies using a Mel filterbank, with a frequency range of [300, 10,000] Hz.The conversion from STFT to Mel spectrogram reduces the dimension of the input data, meanwhile providing a better resolution in lower frequencies than higher frequencies (Stowell et al., 2019).Min-max normalization was applied to the Mel band energies.Hence, each 5-s audio recording was represented by a 128 × 250 dimensional Mel spectrogram.A convolutional neural network typically consists of a number of neural layers stacked upon each other in a deep architecture, as shown in Figure 4.The input layer receives and stores the original Mel spectrogram image.In our set-up, the input layer was followed by four CNN blocks, two fully connected neural network (FCNN) blocks and an output decision block.Each CNN block sequentially consisted of convolutional layers, nonlinear (ReLu) layers and pooling layers.The first two blocks each consisted of one convolutional, one ReLu and one MaxPooling layer.The third block consisted of two convolutional, one ReLu and one MaxPooling layer.The fourth block consisted of two convolutional, one ReLu, one MaxPooling and one GlobalAverage Pooling layer.The convolutional layer puts the input spectrogram image through a set of convolutional filters, each of which activates certain features from the input.The convolutional layer is defined by the number of filters, the size of the filter and the step size (stride) when traversing the input.The rectified linear unit (ReLU) layer allows for faster and more effective training by mapping negative values to zero and maintaining positive values, for example by using the activation function f(x) = max(0, x).In this way, only the activated features are carried forward into the next layer.The pooling layer simplifies the output by performing nonlinear downsampling, reducing the number of parameters that the network needs to learn.For MaxPooling, the input was divided into rectangles (pool) and the largest value was taken from each pool.The MaxPooling layer is defined by the size of the pool and the stride when traversing the input.The GlobalAverage Pooling layer performed downsampling by computing the mean of the time and frequency dimensions of the 2D input.This layer essentially converts a 3D tensor of size T × F × C into a vector of size 1 × C , where T and F are the time-frequency dimensions and C denotes the number of channels.
the DNN.Due to a large amount of data, we employed a minibatch processing scheme that updates the weights of the neural network for subsets (batches) of training samples.The baseline model MD0 was trained with clean bird call data.For each model MD1-MD4, we generated the noisy training data on the fly in every minibatch, with F I G U R E 4 Bird recognition using a VGGish-like convolutional deep neural network.an SNR uniformly sampled in the target SNR range.The drone noise was randomly extracted from the training noise data with the same length as the bird call clips.Convergence of model training becomes difficult as the SNR decreases.We employed a successive training strategy: Models trained in higher SNR were used as initializations for training models at lower SNR.The first model was trained from scratch, using a binary cross entropy loss function with Adam optimizer.We set the minibatch size to 128 and the maximum number of epochs to 15, where one epoch refers to one entire pass through the training data.We randomly selected 10% of the training dataset for validation.Based on the validation performance, we selected the best model during training.

Figure 5
Figure 5 depicts the recognition results obtained using the DNN models MD0-MD4 at various testing SNRs.The training SNR impacted model performance in noisy conditions significantly.For MD0 and MD1, which were trained at relatively high SNRs, the performance displays similar trends with respect to testing SNR.Both achieved the best performance for testing SNR above 0 dB, and their performance dropped quickly for testing SNR lower than a certain value.The turning point was 0 and −10 dB for MD0 and MD1, respectively.For MD2, MD3 and MD4, which were trained at relatively low SNRs, the performance showed a different trend with respect to testing SNR.All achieved the best performance at testing SNR around −10 dB, and their performance dropped when the testing SNR both increased and decreased.MD3 and MD4 appeared to be the two most promising models for our low-SNR applications.Trained at SNR [−35, −15] dB, MD3 achieved the best performance among all the models at extremely low SNR −30 dB but lost performance for SNR above −5 dB.By comparison, MD4, trained at a wider SNR range of [−35, 0] dB, performed stable at most testing SNRs, but had worse performance than MD3 at SNR −30 dB.The performance of both models dropped quickly for testing SNR lower than −25 dB.Interestingly, the best performance of MD3 and MD4 (70% at testing SNR −10 dB) was higher than the best performance of MD0 (66% at testing SNR 20 dB).The simulation results demonstrated that with noise-augmented training we can improve the robustness of the classifiers to the strong ego-noise.In comparison with non-augmented models (MD0), the noise-augmented model (e.g.MD4) achieved similar performance at high SNRs but much better performance at low SNRs.The outperformance of MD4 over MD0 at low SNRs is further verified by their confusion matrices, shown in Figure6.In the remaining experiment, we used MD4 as our selected model, comparing with MD0 as a reference.

Figure
Figure 7a depicts the SNR computed for every segment of 5 s length.The SNR varied significantly across the segments, with a median SNR of 4.4 dB.This roughly corresponds to the drone hovering altitude of 3 m (with a suspending rope of length 2 m).From this value, we can infer the SNR at other drone hovering altitudes.According to the inverse square law, the sound level declines by 6 dB for every doubling of distance.Figure 7b illustrates the SNR variation with the drone hovering altitude.The SNR declined monotonically as the hovering altitude increased.Specifically, the SNR was about −5 dB at altitude 5 m, and about −25 dB at altitude 30 m.As shown in Figure 5, bird recognition performance of the DNN model dropped quickly when the SNR falls below −25 dB, indicating that

F
a) Signal-to-noise ratio (SNR) measured with the equipment at microphone-loudspeaker distance of 1 m.The SNR was computed per 5-s segment.The median SNR across all short segments is about 4.4 dB.(b) SNR varying with the drone hovering altitude (based on the inverse square law).
in the presence of strong ego-noise.The prototype system in combination with hardware and software performed robustly at altitudes up to 30 m, demonstrating the feasibility for bioacoustic monitoring even at high hovering altitudes.5| DISCUSS IONWith the capability of moving freely in the air, drone audition has great potential to advance acoustic monitoring of wildlife in large and hard-to-reach areas.The main bottleneck is the ego-noise generated by the rotating motors and propellers during flight.The ego-noise not only degrades the sound recording but also disturbs wildlife in the target area.On one hand, the drone needs to monitor the target area at close distance to capture high-quality sound from the target species; on the other hand, the drone needs to fly far away from the target area to reduce disturbance on the target species.How to minimize the influence of the ego-noise and how to find a good balance for the monitoring distance is key to the wider application of drone audition techniques for bioacoustics monitoring.The research in this field is still at a preliminary stage.Most existing work employs an intuitive solution that suspends an audio recorder below the drone with a long rope (e.g. 8 m).While the long F I G U R E 8 Bird recognition performance by DNN models MD0 and MD4 for the benchmark bird data simulated at different testing signal-tonoise ratios (SNRs).Bird recognition performance by DNN models MD0 and MD4 for the benchmark bird data recorded by the drone at different hovering altitudes.Here, 'original' means the original audio clips; 'clean' means the bird-only recording.