Robust DOA Estimation Using Multi-Scale Fusion Network with Attention Mask

Yan, Yuting; Huang, Qinghua

doi:10.3390/app14114488

Open AccessArticle

Robust DOA Estimation Using Multi-Scale Fusion Network with Attention Mask

by

Yuting Yan

and

Qinghua Huang

^*

School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(11), 4488; https://doi.org/10.3390/app14114488

Submission received: 20 April 2024 / Revised: 22 May 2024 / Accepted: 23 May 2024 / Published: 24 May 2024

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

To overcome the limitations of traditional methods in reverberant and noisy environments, a robust multi-scale fusion neural network with attention mask is designed to improve direction-of-arrival (DOA) estimation accuracy for acoustic sources. It combines the benefits of deep learning and complex-valued operations to effectively deal with the interference of reverberation and noise in speech signals. The unique properties of complex-valued signals are exploited to fully capture inherent features and rich information is preserved in the complex field. An attention mask module is designed to generate distinct masks for selectively focusing and masking based on the input. After that, the multi-scale fusion block efficiently captures multi-scale spatial features by stacking complex-valued convolutional layers with small size kernels, and reduces the module complexity through special branching operations. Experimental results demonstrate that the model achieves significant improvements over other methods for speaker localization in reverberant and noisy environments. It provides a new solution for DOA estimation for acoustic sources in different scenarios, which has significant theoretical and practical implications.

Keywords:

complex-valued neural network; direction-of-arrival; reverberant; multi-scale; attention

1. Introduction

Direction-of-arrival (DOA) estimation plays a crucial role in various applications such as speech enhancement, speech separation, and speech recognition [1,2,3,4,5,6]. Accurate DOA estimation enables robust audio signal processing in reverberant and noise environments. Traditional DOA estimation methods, including beamforming and multiple signal classification (MUSIC) [7,8,9,10], have shown limitations in complex acoustic scenarios due to their sensitivity to reverberation and noise. To address these challenges, recent research has explored deep neural networks for DOA estimation tasks [11,12,13,14,15,16].

Eigenmike32 is one of the most widely used microphone arrays in acoustic signal processing [17]. It is a compact sensor array with 32 electret microphones embedded in a rigid sphere with a diameter of 84 mm, which has a great advantage in sound source localization. The sound field can be decomposed into a set of spherical harmonic functions [18]; therefore, signals can be processed in the spherical harmonic domain. In recent years, many deep learning-based DOA methods have been proposed in the spherical harmonic domain. Most of them are based on real-valued neural networks (RVNNs).

However, speech signals after short-time Fourier transform (STFT) are inherently complex-valued. Conventional neural networks typically concatenate the magnitude and phase components or the real and imaginary components, then use them as real-valued inputs. These kinds of convolutional neural networks (CNNs) limit their ability to effectively model the complex-valued nature of acoustic signals, leading to suboptimal performance, particularly in reverberant and noisy environments. In contrast, complex-valued neural network (CVNN) has the ability to capture both magnitude and phase information inherent in audio signals, offering a more comprehensive representation of the underlying data. In the realm of speech processing, CVNN has demonstrated promising performance in tasks [19,20,21,22,23]. For example, in speech recognition tasks, CVNNs have shown improved robustness to background noise and reverberation, leading to higher recognition accuracy, especially in adverse acoustic conditions [24]. Similarly, in emotion recognition tasks, CVNNs achieved better performance, even when dealing with variations in speaking styles and environmental conditions [25]. By explicitly modeling the phase information inherent in speech signals, CVNNs can learn more robust representations that are less sensitive to reverberation and noise than RVNN, ultimately leading to improved performance in various speech processing tasks. It may be well suited for DOA estimation in challenging acoustic environments. Moreover, CVNNs have been shown to exhibit faster convergence and better generalization capabilities compared to their real-valued counterparts. This is attributed to the fact that complex-valued representations can capture richer information about the underlying data distribution, leading to more efficient learning and improved generalization to unseen data.

To solve the problem of the nature of complex-valued features and the computational cost, we propose a new CVNN-based DOA estimation method. The main contributions of this work are summarized as follows:

A robust multi-scale fusion network with attention mask (MF-AMnet) is proposed based on CVNN. By directly handling complex-valued inputs in the spherical harmonic domain, the proposed method preserves the rich information of the original signal, meanwhile minimizing data redundancy caused by data concatenation.
A low-complexity multi-scale fusion block (MF) is designed to efficiently capture the inherent spatial patterns of the input feature maps by stacking complex-valued convolutional layers with small kernels. The module complexity is effectively reduced by a special branching operation while promoting information flow. Additionally, we adopt an attention mask (AM) module to dynamically assign varying weights to the input features. This enables the network to focus on the relevant information and shield the interference of reverberation and noise.
Experimental results on both simulated and real datasets demonstrate that our method has significantly enhanced the accuracy and stability of DOA estimation compared with other state-of-the-art methods.

The rest of the paper is organized as follows. Related work is described in Section 2. Section 3 presents spherical harmonic analysis of signals. In Section 4, the proposed approach is introduced. Section 5 describes the experimental setup and discusses the results obtained from synthetic and real datasets. Finally, conclusions are drawn in Section 6.

2. Related Work

For indoor acoustic room impulse responses, there is the situation where an analysis with a short time frame contains multiple room reflections [26]. Acoustic reflections are highly correlated or even coherent, especially in a narrow frequency band. Subspace estimation algorithms and beamforming methods require that the source signals be independent, which is not true for highly correlated reflections. A number of smoothing methods have been developed to preprocess the array covariance matrix in frequency [27], time [28], or space [29] for spherical microphone array processing. Another major limitation of subspace algorithms is their sensitivity to noise. The orthogonality between the signal and noise subspaces is destroyed when the signal is perturbed by noise and subspace swapping occurs [30]. Maximum likelihood (ML) methods [31] handle coherent signals without smoothing operations, but they cannot easily localize multiple active sound sources simultaneously.

To improve the estimation performance of the algorithm in highly reverberant environments, the direct-path dominance test (DPD-MUSIC) [32] was proposed. The algorithm identifies time–frequency bins (TF-bins) that contain the direct sound dominant of a target source, thus overcoming the problem of reverberant multipath distortion. However, most TF-bins that pass the test may carry false DOA information. The SHD-RMUSIC algorithm [33] improves robustness to noise by using noise-insensitive sound pressure values instead of original values. It does not work well at low signal-to-noise ratios (SNRs) and is not suitable for highly reverberant environments. Intensity vector-based DOA estimation methods [34] directly compute the direction of the energy flow rather than the spatial spectra, which can reduce the computational cost. But it is sensitive to phase mismatch and sensor noise. For spherical arrays, the acoustic field can be transformed into the spherical harmonic domain to generate pseudo intensity vectors (PIVs) [35], which are effective in estimating the direction of a single source in a noise-free environment. PIVs under non-ideal conditions were analyzed and subspace pseudo intensity vectors (SSPIVs) were proposed to reduce the effect of interference from the perspective of feature extraction. However, like array signal processing methods, the robustness is not sufficiently improved.

With spherical harmonic decomposition, the magnitudes and phases of the spherical harmonic coefficients are fed into a CNN to obtain the classification of the DOA [36]. By converting the spherical harmonic function into the Kronecker product of two Vandermonde vectors, the uncoupled spherical harmonic azimuth feature (SHA) and spherical harmonic elevation feature (SHE) are extracted [37]. This results in a dimensional reduction of the input feature. Based on [36], an M-SVM was utilized to pre-classify the sound source into eight quadrants of 3D space. The azimuth and elevation are predicted between 0° and 90° by the CNN [38]. This approach offers superior noise and reverberation resistance through the use of spherical harmonic coefficients, while simultaneously reducing the complexity of the neural network by reducing the dimensionality of the input features. However, the use of RVNN limits the performance by overlooking the complex-valued property of the spherical harmonic coefficients.

In recent years, CVNN has been widely used in signal processing and other fields [39,40,41,42], such as speech enhancement and image segmentation. A speech signal, as a wave signal, has a phase component that represents the time difference and an amplitude that represents the energy of the wave [43]. CVNN can deal directly with the phase and amplitude components and is able to accurately model the phase rotation and amplitude decay features implicit in the signal on neural networks. In addition, CVNN has shown faster convergence and better generalization than RVNN due to the fact that complex-valued representations capture richer information about the underlying data distribution, leading to more efficient learning and improved generalization to unseen data [44]. Therefore, wave information can be fully extracted when using CVNN compared to RVNN. In the field of speech processing, CVNN has shown excellent potential in various tasks [24,25]. CVNN has great potential to be adapted to DOA estimation in challenging acoustic environments.

At present, there is a paucity of consideration of CVNN network models for sound source DOA in the spherical harmonic domain. The objective of this study is to design a more suitable CVNN-based DOA framework for speech localization.

3. Signal Analysis in the Spherical Harmonic Domain

Consider a spherical microphone array (SMA) with the radius

r

consisting of

L

array elements. The center of the array coincides with the origin of the coordinate system and the position of the

l

-th microphone, which is

r_{l} = r (\cos ϕ_{l} \sin θ_{l}, \sin ϕ_{l} \sin θ_{l}, \cos θ_{l})^{T}

in Cartesian coordinates, where

θ_{l}

and

ϕ_{l}

are the elevation and azimuth angle of the

l

-th microphone, respectively.

Assuming that the elevation and azimuth of the

D

far-field sources are

ϑ_{d}

and

φ_{d}

, respectively, the model of the signals received by the SMA can be expressed as

x (t) = A s (t) + v (t),

(1)

where

x (t)

is the received signal vector,

A = [a (ϑ_{1}, φ_{1}), \dots, a (ϑ_{D}, φ_{D})]

is the steering matrix,

s (t)

is the source signal vector,

v (t)

is the noise vector, and

t

is the time index.

The steering matrix

A

can be decomposed as follows:

A = Y (Ω) B (k r) Y^{H} (Φ),

(2)

where

Y (Ω) = [y (θ_{1}, ϕ_{1}), \dots, y (θ_{L}, ϕ_{L})]^{T}

is the spherical harmonic matrix corresponding to the microphone position,

B (k r) = d i a g {b_{0} (k r), b_{1} (k r), b_{1} (k r), \dots, b_{N} (k r)}

is the mode strength matrix,

Y (Φ) = [y (ϑ_{1}, φ_{1}), \dots, y (ϑ_{D}, φ_{D})]^{T}

is similar in structure to

Y (Ω)

but depends only on the DOA of the

D

sources,

[\cdot]^{T}

represents the transpose,

[\cdot]^{H}

is the conjugate transpose, and

y (θ, ϕ)

is the spherical harmonic vector [39].

After STFT, the signal model in (1) can be rewritten as

x (τ, k) = Y (Ω) B (k r) Y^{H} (Φ) s (τ, k) + v (τ, k),

(3)

where

τ

is the time frame index,

k = 2 π f / c

is the wavenumber corresponding to the frequency

f

,

c

is the speed of sound,

x (τ, k)

,

s (τ, k)

, and

v (τ, k)

are the short-term Fourier transforms corresponding to

x (t)

,

s (t)

and

v (t)

respectively.

According to the orthogonality criterion for uniformly or approximately uniformly sampled spherical harmonic functions

Y^{H} (Ω) Y (Ω) = I

[45], by left multiplying (3) by

Y^{H} (Ω)

, the spherical harmonic coefficients are given as

x_{n m} (τ, k) = B (k r) Y^{H} (Φ) s (τ, k) + v_{n m} (τ, k),

(4)

where

I

is the identity matrix,

x_{n m} (τ, k) = Y^{H} (Ω) x (τ, k)

and

v_{n m} (τ, k) = Y^{H} (Ω) v (τ, k)

.

Further left multiplication of the model (4) by

B^{- 1} (k r)

leads to

{\hat{x}}_{n m} (τ, k) = Y^{H} (Φ) s (τ, k) + {\hat{v}}_{n m} (τ, k),

(5)

where

{\hat{x}}_{n m} (τ, k) = B^{- 1} (k r) x_{n m} (τ, k)

and

{\hat{v}}_{n m} (τ, k) = B^{- 1} (k r) v_{n m} (τ, k)

.

In (5), the model steering matrix is

Y^{H} (Φ)

and does not contain microphone locations and the mode strength components. The complex-valued

{\hat{x}}_{n m} (τ, k)

is called a spherical harmonic complex-valued feature (SHC).

4. The Proposed Method

Acknowledging the complex-valued representation of speech signals in the spherical harmonic domain, our study introduces a novel MF-AMnet, as illustrated in Figure 1 to estimate DOAs. MF-AMnet uses the SHC feature of microphone array signals. These features are then processed by a series of key modules within the network. The network outputs unit vectors that represent the directions of the source signals. Special multi-scale fusion block and attention mask help to address the challenges posed by the nature of the complex-valued features, offering a promising way for sound source localization.

AM module is a crucial component of MF-AMnet, depicted in Figure 2a. It operates on the input SHC and masks itself using a series of complex-valued convolutional layers (C-Conv). The output of AM is normalized through the application of the complex-valued sigmoid function, limiting the output to the interval [0, 1]. The normalized output is then element-wise multiplied by the original SHC to obtain the masked SHC. This mask amplifies significant spatial features while attenuating noise and interference. It is important to note that residual connections are used between the first and last C-Conv layers to facilitate information dissemination and enhance training process stability. The significance of the attention mechanism cannot be overlooked, as it enables the network to focus its computational resources on extracting and enhancing the spatial cues required for precise DOA estimation. AM serves as a pre-processing step for MF-AMnet. It filters out irrelevant information and amplifies signal components that indicate the location of sound sources. This enables the network to achieve robust and accurate DOA estimation capabilities. After the AM module, the multi-layer composite processing (MCP) module extracts information from the feature map comprehensively. This module is a crucial component and consists of three sub-modules: MF block, the channel attention mechanism (CAM), and a pooling operation.

MF is essential for the network to distinguish complex spatial features, shown in Figure 2b. It processes the input feature map through a series of stacked C-Conv layers to further extract information in each layer. This overcomes the limitations of a single scale and allows for a global understanding of the input. Here, channel shuffle operation [46] is used to divide feature maps into two distinct sections, each containing unique information. One section undergoes further convolution processing to analyze the complexity of the spatial representation, while the other section is temporarily retained for subsequent fusion. This structure facilitates a thorough exploration of the feature map and enables the synergistic fusion of information collected at different scales. The shuffle operation enhances the flow and utilization of information, while also reducing channel dimensionality and the number of network parameters. Retained features after the final C-Conv layer are concatenated as the output of MF. MF uses C-Conv and fusion operations to help the network better understand the interrelationships within the input feature map. This multi-scale fusion strategy has low complexity and allows the network to efficiently learn spatial features at different scales, improving its ability to capture complex-valued spatial patterns in input signals.

After MF, the feature map is fed into CAM [47]. It regulates the importance of individual channels in feature mapping dynamically and identifies their relevance to the underlying target estimation task. It amplifies channels that contain critical spatial information while attenuating irrelevant channels. This facilitates the formation of compact and significant representations that enhance the capabilities of the network to better overcome complex acoustic scenarios. To improve training stability and simplify information propagation, the residual blocks (RBs) are used as pathways for gradient flow, ensuring smooth propagation of error signals throughout the network during training. RB includes a C-Conv with a kernel size of 1 and an average pooling operation [48]. This technique promotes efficient gradient propagation, mitigates the problem of gradient vanishing, and facilitates stable and fast learning dynamics. All C-Convs are followed by a complex-valued rectified linear unit activation [49].

After extracting features from the three MCP blocks, the resulting feature map is flattened and sent to the complex-valued fully connected layer (C-FC). The final output of the MF-AMnet consists of three numbers, which represent the unit vector pointing from the center of SMA to the sound source.

5. Experiments and Discussion

This section presents the experimental setup and performance testing and comparison experiments. Analysis and discussion of the experimental results are also provided in this section.

5.1. Dataset

The training dataset is simulated using spherical microphone array impulse response (SMIR) [50]. The SNR is set in the range of [0 dB, 30 dB], and reasonably covers the entire SNR range. Similarly, the RT60 is randomly selected in the range of [0.2 s, 1 s]. Room size is set to room 1 (6 m × 8 m × 3 m). The position of the array and microphones is random within the room, and the following conditions are met. The center of the array is at least 1 m from the light source, at least 0.5 m from the walls, floor, and ceiling, and the center of the sound source is at least 0.5 m from the walls, floor, and ceiling. Pure speech signals were selected from the Librispeech development corpus [51] randomly for VAD processing [52]. The dataset of two sound sources is generated in the same way as single source, with the pure acoustic signal changed from one to two.

The test dataset is divided into two parts: simulated and real. The previous part is generated in the same way as the training set. In addition, room size room 2 (7 m × 10 m × 3 m) was added to test the robustness of the frame in different room environments. The other part is derived from the LOCATA dataset for the IEEE-AASP Sound Source Localization and Tracking Challenges [53]. The LOCATA dataset provides an objective benchmark for state-of-the-art algorithms for sound source localization and tracking. It contains records from a range of real-world scenarios, including single and multiple sources, as well as ground-based real-world data for source and sensor location information. We selected data from Task 1 and Task 2 as a publicly available real-world dataset for testing.

The STFT of the recorded signals is calculated considering a 512-point Hanning window with 50% overlap between the adjacent frames and a 512-point fast Fourier transform. The sampling frequency is 16 kHz and the mode strength frequency range is chosen from 500 to 3875 Hz to avoid aliasing.

5.2. Evaluation Indicators

The average of the absolute error (MAE) is used to reflect the error between the predicted value and the ground truth:

M A E = \frac{1}{D} \sum_{i = 1}^{D} |({\hat{ϑ}}_{i}, {\hat{φ}}_{i}) - (ϑ_{i}, φ_{i})|,

(6)

where

D

is the number of sources within a sample,

({\hat{ϑ}}_{i}, {\hat{φ}}_{i})

is the predicted DOA,

(ϑ_{i}, φ_{i})

is the ground truth, and

({\hat{ϑ}}_{i}, {\hat{ϕ}}_{i}) - (ϑ_{i}, φ_{i})

is the angular distance between the predicted and actual DOA.

Root Mean Square Error (RMSE) measures the error between the predicted value and the true value and is sensitive to outliers in the data. The definition is as follows:

RMSE = \sqrt{\frac{1}{D} \sum_{i = 1}^{D} (({\hat{ϑ}}_{i}, {\hat{ϕ}}_{i}) - (ϑ_{i}, φ_{i}))^{2}} .

(7)

The MAE and RMSE below are obtained for all samples. We also use Acc5°, Acc10°, and Acc15° to demonstrate the percentage of samples that are estimated correctly with their MAE less than 5°, 10°, and 15°, which are defined as follows:

A c c = \frac{K_{c}}{K} \times 100 % .

(8)

where

K

and

K_{c}

are the total number of samples and the number of correctly estimated samples.

5.3. Baselines

MUSIC carries out DOA estimation by searching for spectral peaks of noise subspace

V_{n}

obtained from decomposition of eigenvalues of the signal covariance matrix. The DPD test is widely used to combat reverberation by selecting the TF-bins dominated by the source from all TF-bins. Handling the noise subspace by DPD testing can greatly improve the accuracy of the MUSIC algorithm in reverberation environments, referred to as DPD-MUSIC.

Based on the spherical harmonic domain decomposition, the magnitude and phase of SHC are taken as SHPM features. A CNN network consists of three convolutional layers, and two fully connected layers (FC) are used to output the estimation. In order to maintain as much comparability as possible with MF-AMnet, we use the output strategy of regression. This approach is called SHPM-R. The complex version of SHPM-R is denoted SHPM-CR.

For acoustic models, a multi-stream network based on Fourier transform uses two independent branches to process the real and imaginary parts and then fuse them [54]. SHC features are used as inputs to this framework, called MS.

CV-CNN is a CVNN that uses multiple complex-valued layers, including C-Conv, C-FC, complex-valued batch normalization, and complex-valued dropout, to estimate DOA [55]. It has been shown to perform well under low SNR and with fewer snapshots.

Complex Residual Angle Estimation Network (CVRAEN) is also proposed for DOA [56]. CVRAEN processes the input signal in two stages using the initial feature extraction module and the deep feature extraction module, which includes complex-valued linear layers and C-Conv.

SADOAnet is a deep learning framework with a sparse signal enhancement layer [57]. A sparse representation of the original signal is formed by using a binary mask. Then, the signal is enhanced by signal embedding and position coding. Finally, the DOA estimation output is obtained by a CNN.

According to the characteristics of circular harmonic domain signal, the TF-bins are filtered by the power spectrum of 0-th degree 0-th order components of the signal. Then, DOA is estimated by the neural network. We denote it as circular domain localization (SDL) [58].

5.4. Experiments

Experimental results are presented and discussed next. All of the experiments mentioned below used the Adam optimizer and with mean square error as the loss. The default epochs of training were set to 100 and the batch size was 64. In addition, the training took an early stop strategy and set patience to 20 epochs.

5.4.1. Performance Comparison

On the simulation dataset generated in Section 5.1, MF-AMnet and baseline models were trained and tested. We kept the same settings such as the number of units in C-Conv and the activation function. Experimental results are shown in Table 1.

As a traditional algorithm, the Acc5° of DPD-MUSIC was less than 50% in the experiment, indicating that the performance of DPD-MUSIC was poor in the case of strong interference. In particular, its performance was severely affected by the interference of reverberation. Similar results were obtained by SHPM-R and SHPM-CR, with SHPM-CR performing slightly better, indicating that the CVNN models the signal more accurately. At the same time, using SHC as a complex-valued input resulted in a smaller input dimension, which fundamentally reduced the network parameters and saved the computational overhead present in all complex-valued methods. MS used the real and imaginary inputs of SHC, extracted features on their own branches, concatenated them together, and output them via FC. Its performance was not as good as the above two methods, probably because the independent processing of the real and imaginary parts did not make better use of the information. Both the real and imaginary parts contained unique and related azimuth and elevation components; treating them as two separate features weakens the ability to represent the complete information. As complex-valued models, the performance of CV-CNN and CVRAEN were effectively improved. The MAE values of both methods were less than 3°, indicating that CVNN had a more accurate estimation effect than RVNN. The higher Acc5° of CVRAEN may be due to the presence of residual structure, which enabled more accurate gradient descent and more stable network training. SADOAnet and SDL perform similarly and slightly worse than the complex-valued approaches like CV-CNN, indicating that the design of the estimation network is important. Among them, MF-AMnet still shows the best performance, because it employs effective multi-scale processing after the sparsity representation of the signal using the attention mask, which allows for better feature extraction on the sparse signal. The MAE was reduced to 2.029, indicating that MF-AMnet had a good estimation accuracy. The Acc5° of MF-AMnet exceeded 90% and reached 94.07%, which was about 3% higher than the previous methods, indicating that our method had good practical significance.

Figure 3 shows the RMSEs for all methods at different SNRs and RT60s. The RMSEs of all methods decreased with increasing SNR and increasing RT60, showing a consistent trend. Each method performs worse under the highest reverberation and the strongest noise. Among them, MF-AMnet maintained a consistently low RMSE all the time, with an average RMSE of less than 3°. This shows that the performance of MF-AMnet was very stable and had strong robustness and generalization even in the presence of interference.

5.4.2. Ablation Experiments

In this section, we present the results of the MF-AMnet ablation experiment, presenting the role of each module in detail, including AM, MF, CAM, and RB, as shown in Table 2.

The base CVNN was derived from the structure of SHPM-CR, which consists of three layers of C-Conv and two layers of C-FC. “+” means that we added or changed the module step by step and recorded the results. The last element is the completer, i.e., MF-AMnet. As can be seen from Table 2, MF was the module with the greatest performance improvement in the structure, indicating that multi-scale information extraction of feature maps was very important for DOA estimation. With only AM added, Acc5° showed a slight improvement, indicating that SHC still contained a lot of noisy information despite the removal of invalid information. The filtering effect of the mask makes the useful information in the SHC more explicit. The addition of CAM guided the network to pay attention to key information, and RB stabilized the network training, allowing the network to converge better.

The trends of Acc5° and MAE are shown in Figure 4. The utilization of each module had a positive effect, with MF playing the largest role. The effective action and cooperation of all these modules greatly improved the performance and stability of MF-AMnet.

5.4.3. Test on LOCATA Dataset

We also tested the performance of methods on the real-world dataset LOCATA.

Given the priority of accuracy, the complex-valued neural network methods with better performance were included in the experiment: CV-CNN, CVREAN, and MF-AMnet. Table 3 shows the results of Task 1 and the results of Task 2 are shown in Table 4.

Compared to simulated data, real-world data have more interference from other factors, such as noise from the microphone, vibration of sound sources, etc., so the demands on the algorithm are higher. Mean deviation and standard deviation are indicators of the stability of the algorithm. All the methods had low metrics, indicating that the CVNN has pretty good stability. Among them, our method achieved a score of less than 3° for both MAE and RMSE, showing that the MF-AMnet maintains this consistent performance even with real data. For the case of multiple sound sources, we also ran simulations and tests, and the results are shown in Table 4. It is well-known that neural network models are data-driven. The slightly worse test results for real data compared to simulated data may be due to the fact that the models used for testing were trained with simulated data and then fine-tuned with a small amount of real data. The degradation may be due to the fact that there is a certain amount of mutual interference between the two sound sources, especially in close proximity. But overall, MF-AMnet kept its leading performance in experiments, indicating that our algorithm has practical value and significance, even in the case of multiple sound sources. We randomly selected two of the recordings in Task 2, and Figure 5 shows the ground truth and estimates of the methods.

As shown in Figure 5, the estimations are very close to the ground truth. Even when the sound sources are relatively close to each other, our method distinguished them accurately and produced the most precise estimation. The results of Task 2 prove once again that MF-AMnet has a good practical value.

5.4.4. The Robustness of Methods to Different Noise Types

In addition to white noise, experiments on signals in diffuse noise [59] and pink noise [60] were added. The diffuse noise model takes the indoor reverberation into account. For rooms with longer

{R T}_{60}

, the diffuse noise field is a more accurate model. Pink noise is one of the most common noises in nature. The energy of pink noise decays from low frequency to high frequency, and traffic sound is a typical pink noise. Therefore, both of them have good practical significance.

Experiments on the robustness of methods were conducted under different types of noise. At the beginning, the models trained on the white noise signal were directly used to test the two newly added noise signals, and the results were not very satisfactory. It may be that the model has learned some inherent patterns of white noise signal features on the previous dataset. Therefore, a mixed noise dataset was prepared. The new dataset contains the same number of noise samples based on different types, and other settings such as SNR and

{R T}_{60}

are consistent with the previous dataset. The test performance of the new model trained using this new dataset is shown in Table 5.

For diffuse and pink noise, the performance of all three methods was degraded to varying degrees. Due to the irregular variation of diffuse noise, all methods were relatively poor for diffuse noise. The Acc5° of CV-CNN and CVREAN decreased to about 80% in diffuse noise, while it maintained 88% in MF-AMnet. In pink noise, the Acc5° of all methods appeared to decrease within 5%. In pink noise, the Acc5° of all methods appeared to decrease within 5%. In addition, the performance of MF-AMnet under all three noises in MAE was maintained within 3°, reflecting good stability. The overall performance of the model is relatively stable, indicating that MF-AMnet has a certain ability to deal with different types of noise. White noise was used as the noise setting in the subsequent experiments.

5.4.5. Inference Time of Methods

In Figure 6, the inference time required for DOA estimation of the three methods for a single sample was presented. The inference time is the average time required to perform 100 independent inferences. The software platform includes Python 3.7, Tensorflow 2.4.1, and CUDA 11.0.

Compared with CV-CNN, CVREAN and MF-AMnet have higher inference time with residual structure. CVREAN has the longest inference time due to its many layers. The proposed MF-AMnet was allowed to capture multi-scale features, but the use of multi-branch and attention mechanism also increases the computation load. Compared with CV-CNN algorithms, the MF-AMnet requires longer inference time to achieve DOA estimation due to the higher complexity of the algorithm itself. Overall, MF-AMnet performs well in the balance of performance and complexity.

6. Conclusions

In conclusion, the utilization of CVNN in speech tasks represents a promising way for the advancement of the state of the art in audio signal processing. Our proposed MF-AMnet fully exploits the nature of speech signals, thereby conferring enhanced performance of DOA estimation through multiscale feature fusion and attention mask. Significant enhancements in accuracy and precision are achieved in challenging acoustic environments, providing a robust foundation for applications related to DOA estimation in real-world scenarios. In the future, we intend to extend our method to real-time tracking to better accommodate non-stationary sound sources.

Author Contributions

Conceptualization, Q.H.; methodology, Y.Y. and Q.H.; software, Y.Y.; validation, Y.Y. and Q.H.; formal analysis, Q.H.; writing—original draft preparation, Y.Y.; writing—review and editing, Q.H. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Relevant information and codes are available from the corresponding author if required.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Borgström, B.J.; Brandstein, M.S. A Multiscale Autoencoder (MSAE) Framework for End-to-End Neural Network Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2418–2431. [Google Scholar] [CrossRef]
Park, H.J.; Shin, W.; Kim, J.S.; Han, S.W. Leveraging Non-Causal Knowledge Via Cross-Network Knowledge Distillation for Real-Time Speech Enhancement. IEEE Signal Process. Lett. 2024, 31, 1129–1133. [Google Scholar] [CrossRef]
Lee, Y.; Choi, S.; Kim, B.-Y.; Wang, Z.-Q.; Watanabe, S. Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 446–450. [Google Scholar] [CrossRef]
Fraś, M.; Kowalczyk, K. Reverberant Source Separation Using NTF With Delayed Subsources and Spatial Priors. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 1954–1967. [Google Scholar] [CrossRef]
Li, J.; Li, C.; Wu, Y.; Qian, Y. Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 1941–1953. [Google Scholar] [CrossRef]
Liu, L.; Liu, L.; Li, H. Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 1559–1572. [Google Scholar] [CrossRef]
Schmidt, R. Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 1986, 34, 276–280. [Google Scholar] [CrossRef]
Palanisamy, P.; Kishore, C. 2-D DOA estimation of quasi-stationary signals based on Khatri-Rao subspace approach. In Proceedings of the 2011 International Conference on Recent Trends in Information Technology (ICRTIT), Chennai, India, 3–5 June 2011; pp. 798–803. [Google Scholar] [CrossRef]
Wang, X.; Amin, M. Design of optimum sparse array for robust MVDR beamforming against DOA mismatch. In Proceedings of the 2017 IEEE 7th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), Curacao, The Netherlands, 10–13 December 2017; pp. 1–5. [Google Scholar] [CrossRef]
Zhu, C.; Wang, W.-Q.; Chen, H.; So, H.C. Impaired Sensor Diagnosis, Beamforming, and DOA Estimation with Difference Co-Array Processing. IEEE Sens. J. 2015, 15, 3773–3780. [Google Scholar] [CrossRef]
Zaken, O.B.; Kumar, A.; Tourbabin, V.; Rafaely, B. Neural-Network-Based Direction-of-Arrival Estimation for Reverberant Speech—The Importance of Energetic, Temporal, and Spatial Information. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 1298–1309. [Google Scholar] [CrossRef]
Zhang, Z.; Qu, X.; Li, W.; Miao, H.; Liu, F. DOA Estimation Method Based on Unsupervised Learning Network With Threshold Capon Spectrum Weighted Penalty. IEEE Signal Process. Lett. 2024, 31, 701–705. [Google Scholar] [CrossRef]
Xu, S.; Wang, Z.; Zhang, W.; He, Z. End-to-End Regression Neural Network for Coherent DOA Estimation with Dual-Branch Outputs. IEEE Sens. J. 2024, 24, 4047–4056. [Google Scholar] [CrossRef]
Cai, R.; Tian, Q. Two-Stage Deep Convolutional Neural Networks for DOA Estimation in Impulsive Noise. IEEE Trans. Antennas Propag. 2024, 72, 2047–2051. [Google Scholar] [CrossRef]
Labbaf, N.; Oskouei, H.R.D.; Abedi, M.R. Robust DoA Estimation in a Uniform Circular Array Antenna With Errors and Unknown Parameters Using Deep Learning. IEEE Trans. Green Commun. Netw. 2023, 7, 2143–2152. [Google Scholar] [CrossRef]
Nie, W.; Zhang, X.; Xu, J.; Guo, L.; Yan, Y. Adaptive Direction-of-Arrival Estimation Using Deep Neural Network in Marine Acoustic Environment. IEEE Sens. J. 2023, 23, 15093–15105. [Google Scholar] [CrossRef]
The Eigenmike Microphone Array. [Online]. 2013. Available online: http://www.mhacoustics.com/ (accessed on 22 May 2024).
Williams, E.G. Fourier Acoustics: Sound Radiation and Nearfield Acoustical Holography. J. Acoust. Soc. Am. 2000, 108, 1373. [Google Scholar] [CrossRef]
Zhao, S.; Nguyen, T.H.; Ma, B. Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6648–6652. [Google Scholar] [CrossRef]
Shahhoud, F.; Deeb, A.A.; Terekhov, V.I. PESQ enhancement for decoded speech audio signals using complex convolutional recurrent neural network. In Proceedings of the 2024 6th International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE), Moscow, Russia, 29 February–2 March 2024; pp. 1–6. [Google Scholar] [CrossRef]
Guo, P.; Yu, M.; Shen, L.; Lin, Z.; An, K.; Wang, J. Single-Channel Blind Source Separation in Wireless Communications: A Complex-Domain Deep Learning Approach. IEEE Wirel. Commun. Lett. 2024; early access. [Google Scholar] [CrossRef]
Saadati, M.; Toroghi, R.M.; Zareian, H. Multi-Level Speaker-Independent Emotion Recognition Using Complex-MFCC and Swin Transformer. In Proceedings of the 2024 20th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP), Babol, Iran, 21–22 February 2024; pp. 1–4. [Google Scholar] [CrossRef]
Deb, S.; Dandapat, S. Emotion Classification using Dual-Tree Complex Wavelet Transform. In Proceedings of the 2017 14th IEEE India Council International Conference (INDICON), Roorkee, India, 15–17 December 2017; pp. 1–5. [Google Scholar] [CrossRef]
Kong, Y.; Wu, J.; Wang, Q.; Gao, P.; Zhuang, W.; Wang, Y.; Xie, L. Multi-Channel Automatic Speech Recognition Using Deep Complex Unet. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 104–110. [Google Scholar] [CrossRef]
Xiang, Y.; Tian, J.; Hu, X.; Xu, X.; Yin, Z. A Deep Representation Learning-Based Speech Enhancement Method Using Complex Convolution Recurrent Variational Autoencoder. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 781–785. [Google Scholar] [CrossRef]
Shlomo, T.; Rafaely, B. Blind Localization of Early Room Reflections Using Phase Aligned Spatial Correlation. IEEE Trans. Signal Process. 2021, 69, 1213–1225. [Google Scholar] [CrossRef]
Khaykin, D.; Rafaely, B. Acoustic analysis by spherical microphone array processing of room impulse responses. J. Acoust. Soc. Am. 2012, 132, 261–270. [Google Scholar] [CrossRef] [PubMed]
Huleihel, N.; Rafaely, B. Spherical array processing for acoustic analysis using room impulse responses and time-domain smoothing. J. Acoust. Soc. Am. 2013, 133, 3995–4007. [Google Scholar] [CrossRef] [PubMed]
Sun, H.; Teutsch, H.; Mabande, E.; Kellermann, W. Robust localization of multiple sources in reverberant environments using EB-ESPRIT with spherical microphone arrays. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 117–120. [Google Scholar] [CrossRef]
Johnson, B.A.; Abramovich, Y.I.; Mestre, X. MUSIC, G-MUSIC, and Maximum-Likelihood Performance Breakdown. IEEE Trans. Signal Process. 2008, 56, 3944–3958. [Google Scholar] [CrossRef]
Hu, Y.; Lu, J. Direction of arrival estimation of multiple acoustic sources using a maximum likelihood method in the spherical harmonic domain. Appl. Acoust. 2018, 135, 85–90. [Google Scholar] [CrossRef]
Nadiri, O.; Rafaely, B. Localization of Multiple Speakers under High Reverberation using a Spherical Microphone Array and the Direct-Path Dominance Test. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1494–1505. [Google Scholar] [CrossRef]
Hu, Y.; Abhayapala, T.D.; Samarasinghe, P.N. Multiple Source Direction of Arrival Estimations Using Relative Sound Pressure Based MUSIC. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 253–264. [Google Scholar] [CrossRef]
Pavlidi, D.; Delikaris-Manias, S.; Pulkki, V.; Mouchtaris, A. 3D localization of multiple sound sources with intensity vector estimates in single source zones. In Proceedings of the 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015; pp. 1556–1560. [Google Scholar] [CrossRef]
Hafezi, S.; Moore, A.H.; Naylor, P.A. Augmented Intensity Vectors for Direction of Arrival Estimation in the Spherical Harmonic Domain. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1956–1968. [Google Scholar] [CrossRef]
Varanasi, V.; Gupta, H.; Hegde, R.M. A Deep Learning Framework for Robust DOA Estimation Using Spherical Harmonic Decomposition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 1248–1259. [Google Scholar] [CrossRef]
Huang, Q.; Fang, W. DOA estimation using two independent convolutional neural networks with residual blocks. Digit. Signal Process. 2022, 131, 103765. [Google Scholar] [CrossRef]
Dwivedi, P.; Routray, G.; Hegde, R.M. Octant Spherical Harmonics Features for Source Localization using Artificial Intelligence based on Unified Learning Framework. IEEE Trans. Artif. Intell. 2024. early access. [Google Scholar] [CrossRef]
Dong, Z.; He, H. A training algorithm with selectable search direction for complex-valued feedforward neural networks. Neural Netw. 2021, 137, 75–84. [Google Scholar] [CrossRef]
Costanzo, S.; Flores, A. CVNN-Based Microwave Imaging Approach. In Proceedings of the 2023 IEEE Conference on Antenna Measurements and Applications (CAMA), Genoa, Italy, 15–17 November 2023; pp. 728–731. [Google Scholar] [CrossRef]
Costanzo, S.; Flores, A. CVNN Approach for Microwave Imaging Applications in Brain Cancer: Preliminary Results. In Proceedings of the 2024 18th European Conference on Antennas and Propagation (EuCAP), Glasgow, UK, 17–22 March 2024; pp. 1–3. [Google Scholar] [CrossRef]
Gan, J.; Li, Q.; Shao, H.; Wen, Z.; Yang, T.; Pan, Y.; Sun, G. A Zynq-Based Platform With Conditional-Reconfigurable Complex-Valued Neural Network for Specific Emitter Identification. IEEE Trans. Instrum. Meas. 2024, 73, 5502711. [Google Scholar] [CrossRef]
Hirose, A. Complex-valued neural networks: The merits and their origins. In Proceedings of the 2009 International Joint Conference on Neural Networks, Atlanta, GA, USA, 14–19 June 2009; pp. 1237–1244. [Google Scholar] [CrossRef]
Nitta, T. Solving the XOR problem and the detection of symmetry using a single complex-valued neuron. Neural Netw. 2003, 16, 1101–1105. [Google Scholar] [CrossRef] [PubMed]
Roy, R.; Kailath, T. ESPRIT-estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 984–995. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, X.; Chen, W.; Li, Y.; Wang, J. Research on Recognition of Fly Species Based on Improved RetinaNet and CBAM. IEEE Access 2020, 8, 102907–102919. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2012, 25, 84–90. [Google Scholar] [CrossRef]
Tachibana, K.; Otsuka, K. Wind Prediction Performance of Complex Neural Network with ReLU Activation Function. In Proceedings of the 2018 57th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE), Nara, Japan, 11–14 September 2018; pp. 1029–1034. [Google Scholar] [CrossRef]
Jarrett, D.P.; Habets, E.A.P.; Thomas, M.R.P.; Naylor, P.A. Rigid sphere room impulse response simulation: Algorithm and applications. J. Acoust. Soc. Am. 2012, 132, 1462–1472. [Google Scholar] [CrossRef] [PubMed]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar] [CrossRef]
Kim, J.; Hahn, M. Voice Activity Detection Using an Adaptive Context Attention Model. IEEE Signal Process. Lett. 2018, 25, 1181–1185. [Google Scholar] [CrossRef]
Löllmann, H.W.; Evers, C.; Schmidt, A.; Mellmann, H.; Barfuss, H.; Naylor, P.A.; Kellermann, W. The LOCATA Challenge Data Corpus for Acoustic Source Localization and Tracking. In Proceedings of the 2018 IEEE 10th Sensor Array and Multichannel Signal Processing Workshop (SAM), Sheffield, UK, 8–11 July 2018; pp. 410–414. [Google Scholar] [CrossRef]
Loweimi, E.; Yue, Z.; Bell, P.; Renals, S.; Cvetkovic, Z. Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 876–890. [Google Scholar] [CrossRef]
Hu, S.; Zeng, C.; Liu, M.; Tao, H.; Zhao, S.; Liu, Y. Robust DOA Estimation Using Deep Complex-Valued Convolutional Networks with Sparse Prior. In Proceedings of the 2023 6th International Conference on Information Communication and Signal Processing (ICICSP), Xi’an, China, 23–25 September 2023; pp. 234–239. [Google Scholar] [CrossRef]
Zhang, Y.; Zeng, R.; Zhang, S.; Wang, J.; Wu, Y. Complex-Valued Neural Network with Multistep Training for Single-Snapshot DOA Estimation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Zheng, R.; Sun, S.; Liu, H.; Chen, H.; Soltanalian, M.; Li, J. Antenna Failure Resilience: Deep Learning-Enabled Robust DOA Estimation with Single Snapshot Sparse Arrays. Invited paper for IEEE Asilomar conference 2024. arXiv 2024, arXiv:2405.02788. [Google Scholar] [CrossRef]
SongGong, K.; Zhang, P.; Zhang, X.; Sun, M.; Wang, W. Multi-Speaker Localization in the Circular Harmonic Domain on Small Aperture Microphone Arrays Using Deep Convolutional Networks. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 8586–8590. [Google Scholar] [CrossRef]
Habets, E.A.; Gannot, S. Generating sensor signals in isotropic noise fields. J. Acoust. Soc. Am. 2007, 122, 3464–3470. [Google Scholar] [CrossRef]
Rajguru, C.; Brianza, G.; Memoli, G. Sound localization in web-based 3D environments. Sci. Rep. 2022, 12, 12107. [Google Scholar] [CrossRef]

Figure 1. Overall diagram of MF-AMnet.

Figure 2. Details of MF-AMnet. (a) The structure of AM module; (b) schematic diagram of MF block in MCP.

Figure 3. The performance of different inputs at different SNRs and RT60s. (a) MSE of methods under different SNRs; (b) RMSE of methods under different RT60s.

Figure 4. Trends in Acc5° and MAE.

Figure 5. The results of MF-AMnet and ground truth for recording 1 and recording 2 from LOCATA Task 2: (a) recording 1; (b) recording 2.

Figure 6. The inference time of CV-CNN, CVREAN, and MF-AMnet.

Table 1. Experiments performance of MF-AMnet and other methods.

Methods	Acc5° (%)	Acc10° (%)	Acc15° (%)	MAE (°)
DPD-MUSIC	43.65	62.39	71.73	9.972
SHPM-R	81.34	97.13	99.31	3.467
SHPM-CR	82.31	97.27	99.40	3.352
MS	76.30	95.74	98.94	3.769
CV-CNN	87.45	98.19	99.77	2.819
CVRAEN	89.21	98.06	99.68	2.690
SADOAnet	85.97	97.55	99.49	3.821
SDL	86.71	97.69	99.49	3.809
MF-AMnet	94.07	99.54	99.95	2.029

Table 2. Ablation results of MF-AMnet.

	Acc5° (%)	Acc10° (%)	Acc15° (%)	MAE (°)
Base CVNN	82.31	97.27	99.40	3.352
+AM	84.63	96.81	99.44	3.187
+MF	91.76	99.35	99..91	2.44
+CAM	92.36	99.03	99.81	2.285
+RB	94.07	99.54	99.95	2.029

Table 3. Performance of methods on Task 1.

	Mean Deviation (°)	Standard Deviation (°)
CV-CNN	2.82	3.14
CVREAN	2.97	3.01
MF-AMnet	2.02	2.58

Table 4. Performance of methods on Task 2.

	Acc5° (%)	MAE (°)	RMSE (°)
CV-CNN	85.64	3.039	3.685
CVREAN	88.78	2.779	3.693
MF-AMnet	92.31	2.216	3.209

Table 5. Performance of MF-AMnet in different types of noise.

	Noises	Acc5° (%)	Acc10° (%)	Acc15° (%)	MAE (°)
CV-CNN	Gaussian white noise	94.07	99.54	99.95	2.029
	Diffuse noise	88.08	98.92	99.85	2.986
	Pink noise	92.56	97.42	99.38	2.676
CVREAN	Gaussian white noise	89.21	98.06	99.68	2.690
	Diffuse noise	80.97	97.67	99.21	3.435
	Pink noise	85.05	97.5	99.44	3.139
MF-AMnet	Gaussian white noise	94.07	99.54	99.95	2.029
	Diffuse noise	88.08	98.92	99.85	2.986
	Pink noise	92.56	97.42	99.38	2.676

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, Y.; Huang, Q. Robust DOA Estimation Using Multi-Scale Fusion Network with Attention Mask. Appl. Sci. 2024, 14, 4488. https://doi.org/10.3390/app14114488

AMA Style

Yan Y, Huang Q. Robust DOA Estimation Using Multi-Scale Fusion Network with Attention Mask. Applied Sciences. 2024; 14(11):4488. https://doi.org/10.3390/app14114488

Chicago/Turabian Style

Yan, Yuting, and Qinghua Huang. 2024. "Robust DOA Estimation Using Multi-Scale Fusion Network with Attention Mask" Applied Sciences 14, no. 11: 4488. https://doi.org/10.3390/app14114488

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust DOA Estimation Using Multi-Scale Fusion Network with Attention Mask

Abstract

1. Introduction

2. Related Work

3. Signal Analysis in the Spherical Harmonic Domain

4. The Proposed Method

5. Experiments and Discussion

5.1. Dataset

5.2. Evaluation Indicators

5.3. Baselines

5.4. Experiments

5.4.1. Performance Comparison

5.4.2. Ablation Experiments

5.4.3. Test on LOCATA Dataset

5.4.4. The Robustness of Methods to Different Noise Types

5.4.5. Inference Time of Methods

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI