1. Introduction
Drones as robots have been widely used in many applications, including search and rescue and military operations [
1]. For example, in search and rescue operations, victims are often stranded in areas that are challenging to search [
2]. In contrast to conventional means, drones can take advantage of their small size and manoeuvrability to navigate through environments that are difficult to access or perform more challenging tasks, thereby improving the success rate of locating the victims. Recently, drones have been actively studied in the field of computational auditory scene analysis (CASA) [
3,
4], which is otherwise referred to as “drone audition” [
5]. Here, drones equipped with auditory sensors help understand the acoustic scenes they are subjected to. This is particularly useful in search and rescue missions, as understanding the surrounding environment influences the efficiency of the search and rescue significantly, especially when the victim’s visual presence is impaired due to factors such as unfavourable weather conditions, excessive vegetation coverage in mountainous terrain, and collapsed buildings [
3,
6].
Drone audition is often realised by attaching an array of microphones to a drone (known as a
microphone array [
7]). With audio from the sound source of interest arriving at each microphone at slightly different times (or phase difference), a microphone array can capture both the spectral and spatial information regarding the sound source. Sound source localisation (SSL), an essential aspect of CASA, primarily utilises spatial information to locate the whereabouts of the sound source [
5]. To date, numerous studies have attempted to perform SSL using drones. Many of these studies are based on the MUltiple SIgnal Classification (MUSIC) algorithm [
8], along with the inclusion of a noise correlation matrix (NCM) to reduce the influence of the drone’s high levels of rotor noise contaminating the microphone recordings [
9]. Common approaches include simply averaging segments of microphone input signals [
10,
11]. In contrast, more sophisticated approaches utilise multimodal information through various sensors, such as rotor speed, inertia measurement unit, etc., to build an NCM specifically for reducing rotor noise [
12]. A real-life drone system was also developed in [
13] demonstrating real-time SSL on a flying UAV. Other approaches include utilising generalised crosscorrelation phase transform (GCC-PHAT) as the baseline to perform SSL using drones. For example, the study from [
14] utilised GCC-PHAT to perform SSL, along with statistics-based spatial likelihood functions regarding the target sound source and rotor noise, to facilitate the reduction of rotor noise and thus also perform source enhancement. On the other hand, the study from [
15] employed Wiener filtering (WF) to directly reduce the influence of rotor noise at the microphone’s input signal before applying GCC-PHAT. Recent studies also showed approaches using convolutional neural networks for source localisation, such as [
16,
17,
18].
As such, studies regarding SSL for drone audition have shown increasing attention. Most studies to date are based on a single drone, single microphone array configuration, which is used to estimate the
direction of the sound source. However, in search and rescue missions, it is important not only to identify the
direction of the sound but also the actual
3D location of the sound source, which is a limitation of most single drone (with single microphone array) approaches. Studies for SSL for 3D location estimation have been considered for robots, and various approaches have been proposed. For example, some studies continuously change the orientation of the robot (microphone array) to improve direction estimation results [
19,
20], or they navigate the robot around the sound source, thus obtaining location information of the sound source from multiple viewpoints to map the sound source location [
21,
22]. However, it should be noted that the robots in these approaches do not move in response to the estimated sound source location. More recently, some approaches utilise multiple robots, where the trajectory of a sound source can be obtained by calculating the triangulation points from the estimated directions of multiple microphone arrays (from each robot) prior to obtaining the sound source location through Kalman filtering [
23,
24,
25].
Recently, studies on drone audition using multiple microphone arrays, either in the form of multiple arrays attached to a single drone [
26] or multiple drones with a single array attached [
27], have also been proposed, wherein most are based on triangulation of the direction estimations or other forms of SSL estimations to obtain the sound source’s 3D location [
27]. In this study, we focus on the approach using multiple drones. Notable studies include that in [
27,
28], which not only performed SSL of their 3D location, but in the case where the sound source was moving, they performed sound source
tracking (SST). For example, the study in [
27] triangulated direction estimations from each drone using MUSIC to form triangulation points, which were then processed through a Gaussian sum filter. This process ensures that only points highly likely to represent the target sound source are considered. The method is called Multiple Triangulation and Gaussian Sum Filter Tracking (MT-GSFT). This study was later extended, where instead of triangulating the estimated directions, the MUSIC spectrum from each drone was directly triangulated to form a direction likelihood distribution, where the location of the sound source could then be obtained. The method is known as Particle Filtering with Integrated MUSIC (PAFIM) [
28]. This allows more information regarding the MUSIC spectrum to be considered, as much can be lost when only considering the
peaks of the MUSIC spectrum where the
direction lies. Although the studies have successfully demonstrated their effectiveness in simulation or outdoor experiments [
29], there is still room for improvement [
28]. As expected, due to the high levels of rotor noise, sound source tracking performance degrades significantly when rotor noise dominates over the target source signal, which can easily occur when a) the target signal is too weak, or b) the target source is too far away from the drones. As such, placement of the drones relative to the target sound source is also an essential factor for consideration [
28].
Considering these issues, the studies in [
27,
28] were further extended to consider the optimal placement of the drones/microphone arrays while performing SST [
30]. Exploiting the fact that the drones are mobile and can be navigated, the study in [
30] developed a drone placement algorithm to continuously optimise the placement of the microphone arrays by navigating the drones to maximise SST effectiveness (hereby referred to as
array placement planning). Furthermore, the algorithm also enables the tracking of multiple sound sources by calculating the likelihood of the sound sources occurring based on the previous sound source location estimates of each sound source. In other words, the study in [
30] is one of the first studies to perform
active drone audition in which the drones become
active in the SST process. We note that while autonomous navigation for the optimal placement of multiple drones is not yet a widely explored area of study for sound source tracking, other applications have found significant performance improvement, such as 3D visual map reconstruction from aerial images [
31].
While optimising the placement of drones is a highly effective way to improve the SST performance, there is still a lack of direct treatment for the high levels of rotor noise itself. To address this issue, the study in [
32] proposed an extension to MT-GSFT and PAFIM by introducing a WF designed using the drone’s rotor noise power spectral density (PSD) [
33,
34]. This reduces the influence of rotor noise at each microphone channel in a manner similar to that from [
15]. The WF approach was adopted due to its effectiveness not only in SSL but also in sound source enhancement for drone audition [
35,
36,
37]. On the other hand, the study [
32] proposed an NCM design by exploiting the fact that the drone’s rotors and the microphone array are fixed in positions relative to each other. Therefore, using the steering vectors or impulse response measurements, nulls can be placed at the rotor locations to suppress the sound arriving from those directions. This is a well-proven approach in sound source enhancement, where the rotor noise’s influence can be reduced using beamforming [
33,
38].
With several techniques proposed in recent years, particularly those from [
30,
32], this study aims to integrate these proposed techniques and evaluate their performance improvements in SST based on [
27,
28]. This includes SST not only for single but also multiple sound sources. In addition, it has also been mentioned in [
32] that to date, the simulations conducted for multidrone SST that neglect the noise arriving from adjacent drones are irrelevant due to the elevation difference between the drones (typically
m above ground) and the target sound source (which is typically on the ground). However, this should not be ignored due to the loudness of the drones, particularly when they are grouped to track sound source(s). Therefore, this study also evaluates the degree of influence noise from adjacent drones on multidrone SST. Due to the range of techniques considered in this study, we only evaluate them using the PAFIM method.
Overall, this study contributes to the following:
Evaluating the performance improvement brought by combining optimal microphone array placement and/or rotor noise reduction techniques to PAFIM;
A more complete/realistic numerical simulation setting;
Assessment of the evaluation results leading to design recommendations for further improving multidrone SST.
The rest of the paper is organised as follows. A brief description of the problem setting is introduced in
Section 2, followed by an overview of PAFIM, the optimal drone placement algorithm, and the rotor noise reduction methods in
Section 3. The simulation setup for evaluation is shown in
Section 4, with the results and discussion presented in
Section 5. Finally, we conclude with some remarks in
Section 6.
2. Setting
The problem assumes
N microphone arrays
consisting of
M sensors, with each mounted to a drone. Each
ith microphone array receives a number of mutually uncorrelated sound sources, including
K target sources
,
spatially coherent noise generated by
U rotors on the drone—with each
uth rotor noise denoted as
—and ambient
spatially incoherent noise, such as wind
, in an outdoor free field environment. For the multidrone scenario, we also consider the noise radiated by adjacent drones
(i.e., all drones apart from the
ith drone). Since the drones are, at most times, a distance away from each other, we assume the noise source of any neighbouring drone received by the
ith microphone array to be a point source, and the noise itself propagates from the
jth drone to the
ith microphone array in accordance to the steering vector
.
and
f denote the angular frequency and frame index, respectively. Here, we assume all drones carry the same number of rotors
U. Using the
-channel noisy recordings, the system aims to localise and track the
K target source signals [
27]. Assuming overdetermined cases (i.e.,
for each
ith drone/microphone array), the short-time Fourier transform (STFT) of the
ith microphone array’s (
) input signals are expressed in vector form as
where
denotes the transpose, and
is the STFT of the
mth microphone input signal.
and
are the azimuth and elevation directions (i.e., in spherical coordinates) from the
kth target sound source to the microphone array
in its own body coordinates, respectively. Likewise,
and
indicate the azimuth and elevation directions from the
uth rotor
in its own body coordinates, respectively.
and
are the steering vector between the source located at directions
,
, and each microphone
m in
, as well as the incoherent noise vector observed by the microphone array.
In addition to the microphone input signals, the state of each microphone array
is described as
where
and
indicate the center of
in the three-dimensional coordinates, and
and
indicate the yaw, pitch, and roll angles of
. The location of the
kth sound source is defined as