Drone Detection and Tracking System Based on Fused Acoustical and Optical Approaches

The increasing popularity of small drones has stressed the urgent need for an effective drone‐oriented surveillance system that can work day and night. Herein, an acoustic and optical sensor‐fusion‐based system‐termed multimodal unmanned aerial vehicle 3D trajectory exposure system (MUTES) is presented to detect and track drone targets. MUTES combines multiple sensor modules including microphone array, camera, and lidar. The 64‐channel microphone array provides semispherical surveillance with high signal‐to‐noise ratio of sound source estimation, while the long‐range lidar and the telephoto camera are capable of subsequent precise target localization in a narrower but higher definition field of view. MUTES employs a coarse‐to‐fine, passive‐to‐active localization strategy for wide‐range detection (semispherical) and high‐precision 3D tracking. To further increase the fidelity, an environmental denoising model is trained, which helps to select valid acoustic features from a drone target, thus overcomes the drawbacks of the traditional sound source localization approaches when facing noise interference. The effectiveness of the proposed sensor‐fusion approach is validated through field experiments. To the best of the knowledge, MUTES provides the farthest detection range, highest 3D position accuracy, strong anti‐interference capability, and acceptable cost for unverified drone intruders.


Introduction
Micro unmanned aerial vehicles (UAVs), commonly referred to as drones, have experienced rapid development in recent years and have been integrated into various aspects of daily life. [1]anging from commercial products to do it yourself (DIY) toys, a wide variety of drones are easily accessible to the public.However, this accessibility raises concerns about potential threats to public security and citizens' privacy due to possible illegal uses. [2]Moreover, security analysis of drones systems reveals their vulnerability to wireless network attacks, [3] which can be exploited by hackers for malicious purposes, leading to loss or damage.Detecting drones is a challenging task compared to traditional large air targets due to their small size and low flying altitude. [4][20][21][22] Furthermore, sensor-fusion approaches have gained attention from both the academic and industry communities (see Section 2 for a review); however, these approaches have not been fully developed or implemented.The ideal consideration when designing a fusion system is to effectively integrate the sensing components cohesively, rather than separately detecting and providing a joint prediction.
In this work, we propose an acoustic and optical sensor-fusion system-termed multimodal UAV 3D trajectory exposure system (MUTES) to detect and track drone targets.Microphonearray-based acoustic approach is seemingly straightforward, as it is similar to a person noticing a drone flying overhead by hearing its propeller noise and determining its direction before even seeing it.Sound source localization relies on the phase delay difference received at each microphone element to estimate the direction of sound wave propagation, ensuring 24 h omnidirectional functionality in a wide area that is minimally affected by light conditions.Moreover, with numerous microphone units cooperating in the sound field, the total signal-to-noise ratio (SNR) of the system can be significantly improved, enabling the detection of weak sounds that may be inaudible to human ear.Another ability in human auditory perception is to focus on a specific sound source, such as a speaker's voice, and filter out other sounds.[25] Analog to human behavior, we trained a denoising model that filters out environmental noises and retains the acoustic features of drone targets, making the system more robust against noise interference.Conversely, cameras and DOI: 10.1002/aisy.202300251 The increasing popularity of small drones has stressed the urgent need for an effective drone-oriented surveillance system that can work day and night.Herein, an acoustic and optical sensor-fusion-based system-termed multimodal unmanned aerial vehicle 3D trajectory exposure system (MUTES) is presented to detect and track drone targets.MUTES combines multiple sensor modules including microphone array, camera, and lidar.The 64-channel microphone array provides semispherical surveillance with high signal-to-noise ratio of sound source estimation, while the long-range lidar and the telephoto camera are capable of subsequent precise target localization in a narrower but higher definition field of view.MUTES employs a coarse-to-fine, passive-to-active localization strategy for wide-range detection (semispherical) and high-precision 3D tracking.To further increase the fidelity, an environmental denoising model is trained, which helps to select valid acoustic features from a drone target, thus overcomes the drawbacks of the traditional sound source localization approaches when facing noise interference.The effectiveness of the proposed sensor-fusion approach is validated through field experiments.To the best of the knowledge, MUTES provides the farthest detection range, highest 3D position accuracy, strong anti-interference capability, and acceptable cost for unverified drone intruders.
lidar are known for their resolution and precision in detection and tracking but inherently suffer from a limited field of view (FOV).Similar to human responses where people start to gaze at a drone with their eyes, which provide extremely high resolution once they hear it, we expect that the combination of acoustic and optical sensor modules will compensate for each other's limitations, thereby achieving a system capable of both wide-range detection and high-precision tracking.When an unknown drone intrudes into the surveillance area, the microphone array captures its acoustic features and estimates the coarse position of the sound source.The gimbal then steers the optical modules toward the suspicious direction for further verification, and tracks the target's movements once a drone is identified.
The novelty and contribution of this work can be summarized as follows: 1) we propose a novel multimodal sensor-fusion system for small drone detection problems that combines a microphone array, camera, and lidar and incorporates a coarse-to-fine localization strategy.This innovative solution, to the best of our knowledge, is the first of its kind to provide a complete semispherical field of view, an extensive detection range (over 500 m), and high 3D positioning accuracy (error less than 1.5% of the range).2) For the acoustic component of the proposed multimodal system, we introduce a denoising deep-learning model that effectively extracts drone acoustic features from background noises.By reducing surrounding environmental interferences, such as bird chirping or human voices commonly encountered in many surveillance scenarios, the model significantly enhances the sensitivity and robustness of the system's passive-to-active detection mechanism.3) To optimize the performance of microphone arrays and ensure real-time fusion in hardware level, we specifically designed and manufactured a microphone array with a total of 256 microphone elements arranged in a novel nonuniform pattern.This layout optimization effectively enhances the SNR through independent sampling and sidebands reduction, compared to the native single microphone unit.The SNR is analyzed in details and the microphone array could achieve a maximum drone detection distance of 1300 m.

Related Work
This section reviews existing studies on drone detection, ranging from single-sensor-based approach to sensor-fusion-based approach.

Camera Detection Approaches
Camera-based computer vision approaches have been universally applied for object detection and tracking over the last decade due to the success of deep learning.The work in ref. [6] uses the drone versus bird challenge dataset to train deep-learning models for target classification and bounding box regression in surveillance videos.The performance of this approach is affected by environmental conditions and the drone size presented in the video; detection becomes less accurate as drones fly farther away from the surveillance point or the background becomes more complex.The work in ref. [7] proposes an integrated detection strategy that combines a wide-angle static camera with a narrow-angle rotating camera.Frames from multiple cameras are overlaid to perform the detection algorithm once, allowing suspicious targets on wide-angle image plane to be verified by rotating zoom cameras.The work in ref. [9] presents an air-to-air drone image dataset recorded by another flying drone, as well as an experimental evaluation of several deep-learning models tested on the proposed dataset.Similar to the aforementioned works, the results suggest that environmental background, target scale, and view angles greatly impact the performance of the algorithms.Camera-based methods face a fundamental trade-off between FOV and resolution.Since the image sensor has a fixed number of pixels, changing the camera's FOV results in different number of pixels covering the same object, thereby affecting resolution.Increasing FOV leads to reduced angular resolution due to fewer pixels being available to describe an object.To capture distant objects more clearly, telescope lenses are often used to reduce the FOV.Consequently, detection accuracy of camera-based approaches is inherently limited by the trade-off between FOV and resolution, and their effectiveness may be further restricted due to the lack of depth information in certain applications.

Lidar Detection Approaches
A few works [16,17] have also investigated lidar-based drone detection.The major advantage of a lidar sensing system is its ability to intrinsically measure the ground-truth 3D coordinates of targets even in challenging ambient illumination.The work in ref. [16]  uses a commercial Velodyne lidar mounted on the roof of a selfdriving vehicle platform and tests the detection and tracking of a small UAV in a field trial.The 64-scan-line lidar demonstrates basic usability for distances up to 50 m with around three detected lidar points; however, the target is undetected when flying in the gaps between the scan lines.The work in ref. [17]  attempts to address the gap problem by combining the mechanical scanning lidar with a turret, which can rotate the system and adjust the angle of the scan lines to meet the trajectory path of the UAV.Experimental results show a maximum distance of up to 100 m with a 16-scan-line Velodyne lidar.While line scanning lidar is capable of high-resolution detection, its limited FOV and lower resolution at long ranges may restrict its ability to detect small UAVs at longer distances.

Acoustic Detection Approaches
Acoustic detection offers another sensing approach by analyzing the sound signals emitted from the drone's motor, which can be used to either identify or localize the sound source.The work in ref. [21] investigated the use of a neural network to detect the presence of drones based on recorded sound events.Melfrequency cepstrum coefficients and Mel-spectrogram are selected as the acoustic features for model input, and the output is a binary classification of the existence of a drone.In terms of sensor configuration, microphone arrays are more popular acoustic detection approaches [18][19][20]22] due to their advantages over a single microphone. The rray leverages the signals from multiple channels captured at different positions, which contribute to estimating the direction of sound arrival as well as reducing the total SNR.The work in ref. [20] uses Doppler shift and direction of arrival (DOA) estimation of the target to provide a total least square estimate of the target trajectory under the assumption of constant target height, direction, and speed.The work in ref. [18] uses a tetrahedral microphone array to detect a military class I UAV with a beamforming algorithm within a selected frequency passband, achieving the best performance with a 99.5% probability of detection at ranges below 600 m.The works in refs.[19,22] deploy several distributed microphone cluster nodes, each with a pyramid structure composed of 4-5 microphone units similar to Ref. [18].The relative 3D position of the target can be estimated by calculating the time difference of arrival between each pair of sensors and using the results for pairwise triangulation.The drawback of acousticbased detection approaches is their low detection precision, limited 3D localization accuracy, and susceptibility to various environmental noises, which may impact the system effectiveness in more complex environments.Furthermore, conventional microphone arrays used in these methods also face hardware challenges.Insufficient elements in the arrays limit detection range and resolution.[18,20,22] To improve resolution with a larger aperture in a uniform array, the number of elements needs to increase quadratically; otherwise, side lobe problems may arise which can deteriorate detection performance.

Camera þ Acoustic Hybrid Detection Approaches
As should be apparent from earlier discussions, single-sensor detection methods inevitably result in failure under certain unsuitable conditions, such as poor illumination, noisy disturbance, non-line-of-sight, etc.Therefore, hybrid drone detection systems have been proposed to balance trade-offs related to range, accuracy, precision, and other aspects. [26][31][32][33] We will mainly focus on vision and acoustic integration methods in the following.The work in refs.[29,30] implements an acoustic camera, which contains an array of more than 60 microphone units and an optical camera located in the center.By overlaying the sound intensity map onto the standard optical image, the system can visualize and track noise sources appearing inside the FOV of the camera.The work in ref. [33] sets up a pyramid-shaped camera group with 30 cameras and 3 microphones, with each camera pointing in different directions to cover a 360°optical view.Eight workstations are connected to process the large amount of data.The work in ref. [32] uses a tetrahedral-shaped four-microphone array along with a short-wave infrared (SWIR) gated-viewing system.The localization result from the microphone array is sent to the SWIR components for further identification and tracking.The SWIR camera has a quite narrow imaging FOV of 0.9°Â 1.2°with a resolution of 640 Â 480 pixels.Thus, the availability of the SWIR camera relies heavily on the initial estimate from the microphone array.Overall, these methods heavily rely on static cameras to process visual information while the microphones provide cues about the drones.However, the fundamental limits in camera FOV are not well addressed and the distance of the drone is not detected either.

System Overview
In light of the previous approaches, we have specifically designed an acoustic-optical hybrid system.The optical components, including a high-resolution camera and a lidar, are integrated on a gimbal system to facilitate a large addressable FOV.The overall experimental setup and detection scheme are illustrated in Figure 1.The system consists of the following hardware and software modules, hardware: 1) 64-channel microphone array and 2) gimbalized lidar and camera; software: 1) environmental noise filtering module, 2) sound source estimation module, 3) optical aided precise 3D localization module, and 4) target tracking by multimodal data association. Figure 1a displays a full view of the experimental deployment including the complete hardware modules and the testing drone to be detected.The top view of the designed microphone array highlights the nonuniform distribution of the mic units, which will be discussed in detail in Section 3.1.As shown in Figure 1b, the combination of microphone array, lidar, and camera provides multilevel scopes of FOV, enabling the entire system to simultaneously monitor a wide area and track specific targets with high precision, depending on the stage of the detection workflow.Figure 1c shows the coarse-to-fine strategy of the fused target tracking.Once a drone invades the surveillant space, the emitted acoustic noise signal is captured by the microphone array.Through the proposed droneaimed sound-source-localization algorithm, the approximate direction of the target can be measured in the primary stage (Figure 1d).Simultaneously, the gimbal rotates to the estimated direction to steer the small FOV optical device toward the suspicious target.As higher-resolution observations and additional distance information are incorporated, state estimation will have lower uncertainty and capture the 3D trajectory of the target (Figure 1e).The proposed detection method will be explained in detail in Section 4.

Hardware Configuration
The microphone array consists of 64 elements, and each element is composed of 4 microphones (SPH0642HT5H-1), which have a flat frequency response in the range of 0.1-10 kHz and omnidirectional response capability.The output of each element is combined by the 4 microphones with a 0.0126 V Pa À1 sensitivity.This ensures that the SNR of the array element is twice the SNR of a single microphone.The SNR of each array element is 71 dB under the condition of 94 dB SPL@1 kHz.The signal processing circuit of each element includes a 600-fold amplification circuit and a sampling circuit with 16-bit sampling accuracy and AE10 V range.Therefore, the sound response range of the element is 0-0.6 Pa and the response precision is 10 μPa.The size of the microphone array is 40 Â 40 cm 2 , which ensures that the array has an angular resolution of 8.7°for 5 kHz sound signal from the Rayleigh criterion in array signal processing. [34]As shown in Figure 1, the arrangement of elements is irregular, which reduces the sidebands as well as the number of microphones needed without compromising the array performance (see Supporting Information for optimization of microphone array layout).Based on the simulation of the beam pattern, we designed an optimization objective function with the element positions as variables, and then determined the optimal arrangement using a genetic algorithm (see Supporting Information).
Optical sensing is achieved by a long-distance lidar (Tele-15, Livox Technology) with an angle accuracy of less than 0.03°and a telescope camera with a narrow FOV of 15°Â 10°and a resolution of 5472 Â 3648 pixels.The whole system is installed on a gimbal, which is responsible for dynamically pointing the optical system toward the target.Tele-15 is a Risley prism lidar with incommensurable scanning. [35]The point cloud distribution in Tele-15 has a high density in the center and lower density around, resembling the rod distribution on a human retina.This feature ensures that there is enough point cloud density to identify a UAV at a long distance, which cannot be achieved by multiline Velodyne lidar.Tele-15 provides distance information with an accuracy of 2 cm within 500 m.For the visual information of the camera, there are more than 100 pixels available to perform UAV detection for an object with a size of 0.3 Â 0.3 m at 500 m.

Acoustic Detection Module
The acoustic signal process in the sound-source-localization module is illustrated in Figure 2a.The time-sequential array signals are divided into frames, transferred to the frequency domain for filtering by trained network, and then utilized for spatial localization of the target in iteration.The fundamental technique used is known as beamforming, [36] which involves combining the signals from multiple microphone channels by adjusting the relative time delays and amplitudes to ensure the signals add constructively at the desired direction and destructively elsewhere.The plain beamformer can be expressed as bf ðt In this equation, xk is the kth unit vector of a steered direction within the source search space, M is the number of microphones, w m is a weighting factor applied to the individual mth microphone, and ãm ðtÞ is the amplitude of received signal at time t.The time delay of arrival is written as τ m ¼ ðx k Â x m Þ=c, where x m is the vector of the mth microphone location, and c is the speed of sound, considering the wave propagation model of a far-field sound source.By steering the direction of the beam and scanning the surrounding space, it is possible to estimate an acoustic map that represents the received signal power of each direction.We adopted the filter-and-sum beamforming method [37] in the frequency domain for implementation as an efficient spatial-temporal filtering procedure, and the phase shift is changed from time delay to steering function [36] s m ðf , xk Þ ¼ expðÀi2πf τ m ðx k ÞÞ.It is worth noting that while adaptive beamforming techniques such as multiple signal classification (MUSIC) or minimum variance distortionless response (MDVR) often provide superior spatial filtering performance, they are generally more computationally demanding due to the additional matrix factorization step. [38]Therefore, these methods are not suitable for real-time implementation on mobile computing platforms with a large number of microphones.Our proposed method to estimated signal power can be expressed as where pm ð f Þ ¼ Ψð f Þp m ð f Þis the masked spectrum filtered by the denoising mask Ψð f Þ which is predicted by the neural network named DroneFinderNet.The model is trained to recognize the noise patterns on the input spectrum and suppress them accordingly by the output mask, while retaining the features of the drone signals used for target detection.p m ð f Þ is the spectrum of mth microphone signal, which is the complex pressure amplitudes obtained by evaluating the discrete Fourier transform of the corresponding windowed frame.The final estimated source direction of the current frame is expressed as where F is a set of selected frequency bins representing the spectrum characteristics of the target drone, and G is a set of selected spatial grids representing the potential search region.These two elements can be considered symbolic representations of temporal and spatial filtering, respectively, which will be discussed in more detail in Supporting Information.
The goal of DroneFinderNet is to distinguish relevant drone signals from irrelevant environmental noise such as bird or insect chirping, human conversations, and other ambient sounds.To accomplish this objective through supervised learning, pattern recognition is performed on the spectrograms of potential drone targets while retaining their frequency signals.
To achieve this, we utilize the techniques of convolutional neural networks and long short-term memory that are typically employed in sound separation and speech enhancement tasks.The designed structure of the deep-learning system is shown in Figure 2b.We developed a drone audio dataset comprising audio pairs consisting of a clean drone sound and a noisy drone sound.To create this dataset, we used a microphone array platform placed outdoors to collect a raw audio set that records the sound emitted by the drone while flying in a relatively quiet environment.The drone was flown in different directions and heights to capture various sound patterns.To preserve signal quality, we limited the distance between the drone and the microphone array to 120 m.We then applied beamforming enhancement to the collected raw audio set, which contained 64 channels of audio pieces, to suppress unexpected environmental noises that appeared during recording.The beamformed audio pieces were considered as the ground truth used for training the network model.Additionally, we prepared an interference audio set comprising a series of noise events such as wind, birds, and insects collected from the AudioSet public dataset. [39]The corresponding noise audio was generated by mixing the beamformed audio with a randomly selected interference audio, and it was used as the training input for the model.

Sensor-Fusion Tracking
After estimating the direction using the acoustic detection module, the optical device is rotated accordingly via the steering gimbal for a more precise observation of the target.Once the target is captured within the optical FOV, its accurate angular position and depth can be estimated by object localization on red, green, blue (RGB) images and point clouds, respectively.In our scenario, traditional methods are sufficient to achieve single target recognition, considering the contrast between the target and the clear sky background.Advanced methods such as deep-learning models may provide better performance.Target tracking is accomplished using a Kalman filter, which fuses measurements from the three different modules.To ensure consistency and simplicity, all the target measurements are represented in the polar coordinate system.The measurement of the acoustic module and the camera is a 2D vector Z ¼ ½x a , x e , where x a and x e are the observed azimuth and elevation.The measurement of the lidar is a 3D vector concatenating the angles with the additional distance measurement.We use the first-order system with constant velocity to approximate the flying drone movement, thereby the target state is a 6D vector: X ¼ ½x a , xe , xd , x : a , x : e , x : d , representing the polar angles, distance, and the corresponding velocity components, respectively.The state extrapolation equation and the measurement equation are as follows where Δ is the time duration between the current measurement and the last estimation update of the Kalman filter.I is a unit matrix, of which the size in Equation ( 4) is either 2 Â 2 or 3 Â 3 depending on the dimension of the measurements.w n is the process noise, and v n is the measurement error manually set according to the prior knowledge of the sensor performance.

Environmental Noise Suppression
The drone audio dataset is randomly split with a ratio of 7:3 into the training set including 2100 samples and the testing set including 840 samples.Each sample pair contains a 3 s clean audio piece and a corresponding mixed noisy audio piece.All audio samples have a 22 kHz sampling rate, which is consistent with the practical sampling rate used in the experimental setup of the microphone array platform.More specifically, the interference audios in the preprocessing step are up-sampled from 16 to 22 kHz to be directly mixed with the clean audios.We use the SNR between the clean audio and the enhanced audio as the metric to evaluate the performance of the trained DroneFinderNet models.The result shows that the average SNR of the testing set is improved from 6.5 to 11.73 after being enhanced by the denoising network model.The detailed evaluation and comparison of the noisy audio samples and the enhanced audio samples in the testing set are shown in Figure S2, Supporting Information.Furthermore, a video demonstration showing the effectiveness of the acoustic drone detection with the help of DroneFinderNet to cope with apparent noise disturbance can be seen in Video S1, Supporting Information.In this experiment, a drone target flies across the sky while a bird near the test site is chirping intensively.From the view of the estimated acoustic maps, the detection result of the traditional version of beamforming is severely misled to the sound source of the birdsong, whereas the proposed method greatly suppresses the noise burst and preserves the intensity of the target signals.

Analysis of Mic Array SNR Improvement
SNR is a crucial factor in designing an artificial sensory system, as it determines the system's detection range.The noise in a single-channel microphone can be modeled as a combination of sensor noise and electronic noise, both of which are assumed to be independent and flat in spectrum.Let x m represent the noise signal of the mth channel with a noise power of P, and let y represent the beamformed output, which is the summation and average over M individual channels.Therefore, the relationship between the noise power of the beamformed output and the noise power of an individual channel can be expressed as To evaluate this relationship experimentally, we placed the array platform in a soundproof box and collected the received noise data, which was regarded as the static noise.We then randomly selected m channels, where m increased gradually from 1 to 64, and calculated the noise power of the beamformed output for each selection.The resulting data, plotted as 64 blue dots in Figure 3a, fits well with the inverse proportional curve, which validates the relationship described in Equation ( 5).Specifically, it shows that for a constant input signal power, the SNR increases linearly with the number of channels utilized in the array system.
Considering the spherical acoustic wave propagation causes sound intensity to attenuate in proportion to the square of the distance from the sound source, according to the inverse square law in decibels (dB):I d ¼ I 1 À 20 Â log 10 ðdÞ, where d is the relative distance and I 1 is a reference intensity nearby the source.A beamformer system aims to address the issue of attenuated SNR by preserving the target signal power while suppressing the noise power.We conducted outdoor experiments to evaluate the limit SNR of the system performance.As shown in Figure 3b, the measured drone signal power is plotted as scatter of points, while the reference noise power, measured %47 dB, is plotted as a dotted line representing the threshold.It should be noted that the drone flying altitude is limited to 500 m due to safety regulations in cities.From the trend of the fitting curve by inverse square law, the power of the received target signal falls below the threshold of the noise power at %1.3 km distance, where it becomes difficult to distinguish the signal from the noise and the detection reliability drops significantly.To the best of our knowledge, this is the maximum distance detected by acoustic approaches for drones.Figure 3c,d provides an example of a drone sound spectrogram flying at a far distance of 481 m.On the spectrogram of the single-channel audio, the characteristics of the target sound are submerged in the dominant noise.However, on the spectrogram of the beamformed output audio, the frequency pattern of the received drone audio can still be clearly distinguished.The complete audio for comparison of both is provided in Video S2, Supporting Information.

UAV Trajectory Estimation
In the real-world field experiment, we assessed the performance of our multimodal UAV-tracking system.The experiment was carried out in an open field located in the same area as depicted in Figure 1.During the experiment, we controlled the target drone to fly at an altitude of 180 m relative to the ground and orbit around the test site while the detection system was positioned at the center.The gimbal, which contained the optical modules, was set to default orientation in a vertical upward position.Upon activation of the detection system, the microphone array module estimated the acoustic sources in the surrounding space and provided the direction of the suspicious target, which in turn triggered the gimbal to adjust its orientation.Figure 4a depicts the direction estimation from the acoustic module, represented by the colored cone, as it tracked the movement of the flying target, represented by the blue circles.The gimbal adjusted its orientation accordingly after activation to maintain its focus on the target.The optical equipment captures the target within the FOV and began acquiring available measurements at t 1 , which is %6 s in this experiment.Subsequently, additional measurements from lidar and camera were merged with acoustical measurements and fed into the Kalman tracker, which provided a 3D trajectory estimation with reduced uncertainty.This, in turn, enabled more precise control of the gimbal's steering adjustment, allowing it to keep pace with the target's movement as shown in Figure 4b.
The experiment was conducted using the same drone in Section 5.2, which was equipped with onboard real-time kinematic (RTK) to provide absolute location data of flight trajectory.The RTK data is used as the ground truth and is transferred into the polar coordinate system format for evaluation (Figure 4c).It can be observed along the timeline axis that, prior to the integration of multimodal data at t 1 , the estimated trajectory exhibits a larger variation.As more modalities were incorporated into the estimation, it gradually fits the ground truth curve afterward.The top view of the full trajectory comparison on the real world map is shown in Figure S5, Supporting Information.The average mean square error (MSE) in 3D Cartesian coordinates is 2.8 m with average distances of 206 m.The MSE of the angle evaluation results is presented in Table 1 to provide a more specific understanding of the performance of individual components and improvement achieved through fusion.As can be expected, acoustic estimation shows the largest deviation, while the fused result exhibits the best accuracy among all the others.In this system, we utilized three sensors, each with unique advantages and limitations.The microphone array offers a semispherical FOV and can measure elevation and azimuth angles with an angular resolution proportional to the ratio of wavelength to aperture diameter.Distinct from the use of a large-scale array with an aperture of 2-3 m [19,20] or multiple distributed arrays [22] that primarily aim to detect fundamental frequency below 1 kHz, [18,31] we designed a compact 40 cm array optimized for high-frequency wideband detection.By combining this array with DroneFinderNet, we enhance the acoustic detection modules' robustness to noise and its ability to effectively leverage all useful acoustic signals emitted from the target source.However, the angular resolution of the microphone array is coarse and lacks target depth information.The camera on a gimbal hence provides precise angular positioning and color/texture information, but with no depth information.The lidar, as a third sensor, obtains depth information, with a non-repetitive scanning nature from Tele-15, which ensures a sufficient number of target points and a high confidence level in distance estimation.We carefully optimized all three sensor hardware design and their synergy for different target angles, distances and sizes.By combining the strengths of these three modalities, we created a powerful target detection and tracking system.For multiple drone tracking from different angles, the microphone array can track multiple targets simultaneously, but the precise localization system can only track a single FOV.To achieve precise localization of multiple drones, multiple sets of cameras and lidars with gimbals could be added or the speed of the gimbal motion could be accelerated to perform cross-tracking.

Conclusion
Our results demonstrate that MUTES, which integrates a 64-channel microphone array, a camera, and a lidar, can provide wide-range detection (90°Â 360°) and high-precision 3D tracking for UAVs.A coarse-to-fine and passive-to-active localization strategy software was implemented in MUTES, with a well-designed microphone array capturing acoustic features and estimating the coarse position of the sound source, and the optical modules being used for further verification and tracking.Additionally, we trained an environmental denoising model to extract drone acoustic features, overcoming the drawbacks of traditional sound-source-localization approaches.A Kalman filtering algorithm for the fusion of three sensors proved to be effective and achieved the accuracy of RTK.In terms of both hardware and algorithm, MUTES represents an innovative multimodal detection and tracking system.Furthermore, the progress made in privacy protection, drone detection, and multimodal monitoring technologies demonstrated by MUTES has both theoretical and engineering significance.Moreover, the readiness of the system (without complicated engineering or prototyping components) and its modular design provide potential for wider deployment in practical drone detection tasks.

Figure 1 .
Figure 1.The overview diagram of drone detection scheme.a) A demonstration of the outdoor experimental scenario: the detection system is deployed in the center of a playground with a drone flying above the field.b) Field of view (FOV) for each sensor component utilized in the detection system.c) The timeline of the fusion strategy in the target-tracking Kalman filter.d) Direction estimation result obtained from the component of the acoustic array.e) Final 3D trajectory estimation obtained from the fused system.

Figure 2 .
Figure 2. The proposed method in the drone-sound-localization module.a) Pipeline of acoustic denoising and source-localization process.b) Dataset collection and training of the denoising network.

Figure 3 .
Figure 3. Performance analysis of the microphone array.a) The noise power declines inversely with respect to the number of channels used in experiments.b) The power of the beamformed acoustic signals collected outdoors changes with respect to the distance of a flying drone.The farthest experimental distance is limited to 500 m due to safety regulations.c,d) Comparison of the spectrograms from a single microphone channel and from the beamformed signals using 64 channels when the target drone is flying 481 m away.The color bar indicates the signal power in dBW.

igure 4 .
The complete process of the coarse-to-fine drone detection approach.a) Initial step: a target is detected and localized by the acoustic estimation.At this point, the gimbal is not yet steered to the region of interest.b) Final trajectory from fused estimation in 3D space.c) Comparison of azimuth and elevation between fused estimation and real-time kinematic (RTK) ground truth.t 1 represent the time when the optical device begins tracking the target.

Table 1 .
Comparison of estimation result from different sensors.