Parallel Net: Frequency-Decoupled Neural Network for DOA Estimation in Underwater Acoustic Detection

Yang, Zhikai; Zhang, Xinyu; Luo, Zailei; Shen, Tongsheng; Cui, Mengda; Li, Xionghui

doi:10.3390/jmse13040724

Open AccessArticle

Parallel Net: Frequency-Decoupled Neural Network for DOA Estimation in Underwater Acoustic Detection

by

Zhikai Yang

^†,

Xinyu Zhang

^†,

Zailei Luo

^*,

Tongsheng Shen

,

Mengda Cui

and

Xionghui Li

Advanced Interdisciplinary Technology Research, National Innovation Institute of Defense Technology Center, Beijing 100071, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Mar. Sci. Eng. 2025, 13(4), 724; https://doi.org/10.3390/jmse13040724

Submission received: 21 February 2025 / Revised: 25 March 2025 / Accepted: 2 April 2025 / Published: 4 April 2025

(This article belongs to the Topic Advances in Underwater Acoustics and Aeroacoustics)

Download

Browse Figures

Versions Notes

Abstract

:

Under wideband interference conditions, traditional neural networks often suffer from low accuracy in single-frequency direction-of-arrival (DOA) estimation and face challenges in detecting single-frequency sound sources. To address this limitation, we propose a novel model called Parallel Net. The architecture adopts a frequency-parallel design: it first employs a recurrent neural network, the generalized feedback gated recurrent unit (GFGRU), to independently extract features from each frequency component, and then it fuses these features through an attention mechanism. This design significantly enhances the network’s capability in estimating the DOA of single-frequency signals. The simulation results demonstrate that when the signal-to-noise ratio (SNR) exceeds −10 dB, Parallel Net achieves a mean absolute error (MAE) below 2°, outperforming traditional frequency-coherent neural networks and the MUSIC algorithm, and reduces the error to half that of classical beamforming (CBF). Further validation on the SWellEx-96 experiment confirms the model’s effectiveness in detecting single-frequency sources under wideband interference. Parallel Net exhibits superior sidelobe suppression and fewer spurious peaks compared to CBF, achieves higher accuracy than MUSIC, and produces smoother and more continuous DOA trajectories than conventional neural network models.

Keywords:

DOA estimation; frequency decoupling; sparse Bayesian learning; GFGRU

1. Introduction

Direction-of-arrival (DOA) estimation is a fundamental technique in underwater acoustic detection, with its origins tracing back to Bartlett’s beamforming method proposed in the 1950s [1]. Traditional DOA estimation methods, such as classical beamforming, MUSIC algorithms, and maximum likelihood estimation, have been widely applied in both theoretical studies and practical scenarios. However, their performance in complex underwater environments is significantly limited by noise interference and multipath effects. In recent years, the rapid development of deep learning technology has provided new solutions to the problem of sound source localization. Numerous studies have explored the use of neural networks to improve localization accuracy and environmental adaptability. For example, Niu et al. [2] utilized feed-forward neural networks (FNNs) for sound source localization, while Huang et al. [3] combined TDNN [4] and CNN-FNN [5] structures to process time-domain and frequency-domain signals, demonstrating their effectiveness in shallow water environments. For more complex sound source signals in deep sea environments, convolutional neural networks such as Inception [6] and ResNet50 [7] have been widely employed to process array signal covariance matrices, achieving further improvements in localization accuracy [8,9,10]. Moreover, recurrent neural networks (RNNs) [11] have been applied to Bayesian processes [12] or enhanced with attention mechanisms to improve prediction accuracy [13,14,15]. In the context of DOA prediction, Niu and Liu explored the feasibility of using FNNs and CNNs for DOA estimation [16,17], but their studies were limited to single-frequency signals. Similarly, Xie [15] investigated the application of RNNs to signals received by a single-vector hydrophone, and Li examined the feasibility of using CNN-RNN combinations for array-based multi-frequency DOA prediction [18], though neither addressed multi-source scenarios comprehensively.

Currently, models leveraging broadband signals often treat neural networks [8,9,10,11,12] as a ‘black box’, typically operating within the framework of frequency-coherent methods [19], with architectures evolving toward increasing complexity [20]. While such networks exploit frequency interrelations to some extent, their high computational complexity and sensitivity to environmental changes limit their practicality. Additionally, preprocessing methods introduce extra uncertainty during network input: (1) independent normalization may amplify noise at each specific frequency without signals; and (2) global normalization might obscure weak signals, leading to the loss of critical information. These limitations are particularly pronounced under complex interference conditions. Recent studies [21] have applied neural networks to enhance beamforming outcomes. Although beamforming methods are non-coherent, this work mainly investigates fusion-enhanced networks in the later stages of processing, while non-coherent alternatives receive limited experimental attention. Furthermore, existing studies lack targeted solutions when key frequency features are obscured by interference.

To address these challenges, this study proposes a Parallel Net model inspired by the group convolution structure [22,23]. Unlike traditional frequency-coherent methods [8,9,10,11,12], this model incorporates a frequency-incoherent [24] approach by introducing a grouping mechanism in RNNs. This allows the model to independently process multi-frequency information, thereby decoupling frequency interrelations and mitigating interference between frequency components. During the frequency information fusion phase, the model employs an attention mechanism [25] to suppress irrelevant information and emphasize key features, enhancing its adaptability to complex environments and improving localization accuracy for single-frequency signals. The experimental results show that the proposed method enhances DOA estimation performance under broadband interference, offering a robust and efficient approach to address the limitations of conventional methods in complex acoustic scenarios.

2. Architecture of Parallel Net

2.1. Model Input

As shown in Figure 1, suppose N sound sources impinge on an array with M elements. The signal received by the

m

-th array element can be expressed as follows:

x_{m} (t) = \sum_{n = 1}^{N} s_{n} (t) a_{m, n} + n_{m} (t), m = 1,2, \dots, M,

(1)

where

s_{n} (t)

represents the signal of the

n

-th sound source;

a_{m, n}

is the array manifold vector between the

n

-th sound source and the m-th receiving array element; and

n_{m} (t)

denotes the additive Gaussian white noise at the m-th element, which is assumed to be independent of the signal.

When N sound sources,

s_{1} (t), s_{2} (t), \dots {, s}_{N} (t)

, impinge on the array at frequencies

f_{1}, f_{1}, \dots {, f}_{N}

, the received signal at frequency f can be expressed as follows:

\begin{array}{l} X (f) = A S + N = {[a (θ_{1}), a (θ_{2}), \dots, a (θ_{N})]}_{M \times N} \times {[s_{1} (f), s_{2} (f), \dots, s_{N} (f)]}_{1 \times N}^{T} + N \\ = [\begin{matrix} 1 & 1 & \dots & 1 \\ e^{- 2 π d \frac{\sin θ_{1}}{λ}} & e^{- 2 π d \frac{\sin θ_{2}}{λ}} & \dots & e^{- 2 π d \frac{\sin θ_{N}}{λ}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ e^{- 2 π d \frac{(M - 1) \sin θ_{1}}{λ}} & e^{- 2 π d \frac{(M - 1) \sin θ_{2}}{λ}} & \dots & e^{- 2 π d \frac{(M - 1) \sin θ_{N}}{λ}} \end{matrix}] [\begin{matrix} s_{1} (f) \\ s_{2} (f) \\ ⋮ \\ s_{N} (f) \end{matrix}] + [\begin{matrix} N_{1} (f) \\ N_{2} (f) \\ ⋮ \\ N_{M} (f) \end{matrix}] \end{array}

(2)

where

A (f)

is the array manifold vector,

S (f)

is the signal vector, and

N (f)

is the noise.

A (f)

is a Vandermonde matrix, and as long as M

\geq

N (i.e., the number of sound sources is less than the array dimensions),

A (f)

is non-singular.

The angular information of sound sources is embedded in the array-received signal, enabling the deep neural network to model complex information for extracting the incident angles of sound sources.

Typically, deep neural networks use single-frequency localization methods where the network input is the covariance matrix of the signal at frequency

f

. Compared to time-domain signals or

X (f)

, the covariance matrix has a smaller data volume and can reduce the impact of the phase of the source excitation. This allows for better performance with less training data. The covariance matrix

R (f)

can be expressed as follows:

R (f) = E [X (f) X^{H} (f)],

(3)

For multi-frequency localization, the network input consists of the covariance matrices of multiple frequency points:

{R (f_{1}), R (f_{2}), \dots, R (f_{F})},

(4)

where F denotes the number of frequency points. If the number of array elements is M and the signal contains F frequency points, the input data dimension is M² × F × 2.

The covariance matrix at each frequency point is normalized by the maximum magnitude, as expressed in Equation (8):

R_{n o r m} (f) = \frac{R (f)}{\max (|R (f)|)}

(5)

2.2. Details of the Parallel Net Model

Inspired by group convolution, this paper proposes the Parallel Net model based on the generalized feedback gated recurrent unit (GFGRU) [11] layer, which groups frequency points. Compared with a GRU [26], the GFGRU allows each hidden unit to learn the states of all hidden layers from the previous time step through a global reset gate, adaptively retaining information from all hidden layers of the previous time step. This results in faster convergence and higher accuracy during training. The gating mechanism of the GFGRU is shown in Equation (A1) of Appendix A.1, and the structure of the GFGRU is shown in Figure A1 of Appendix A.2.

As shown in Figure 2, Parallel Net consists of 10 grouped GFGRU layers and squeeze-and-excitation (SE) attention layers [23]. Compared to the multi-head attention in transformers, the SE layer is a computationally lightweight attention implementation method. It can selectively emphasize informative channels by assigning higher weights through threshold-based filtering. In the feature extraction phase, the sparse Bayesian learning network constructed by the GFGRU performs independent operations for each frequency to extract the location information of the sound source. The SE layer then weights each channel based on the activation features of each frequency point, focusing more on effective frequency points. Finally, the average value of the outputs from the Softmax [27] layer of each frequency point is used as the localization result. The detailed parameters are shown in Figure 3.

To compare and analyze, two versions of the Parallel Net model are proposed: V1 and V2, while V3 serves as a baseline model based on the frequency-coherent approach [12]. The structural parameters of these three neural network models are shown in Figure 3. The complexity evaluation of different methods is shown in Table A1.

V1: Consists of a sparse Bayesian learning network formed by 10 groups of GFGRU layers. Each group contains 256 hidden units, 10 time steps, and 2 stacked layers. The output is a vector of size 360, representing the direction of arrival of the sound source. Thus, the output vector dimension of the GFGRU layer is 10 × 360 = 3600.
V2: Built upon V1 by incorporating an SE attention layer. The SE layer applies channel-wise weighting with minimal computation, emphasizing effective frequency points for better localization.
V3: Serves as a frequency-coherent baseline model constructed using non-grouped GFGRU layers. Each group contains 256 hidden units, 10 time steps, and 2 stacked layers, with the output dimension being 360.

2.3. Loss

In this study, DOA prediction is formulated as a classification problem where the output represents the directional probabilities of sound sources. For classification tasks, the Softmax [27] function is used to map the raw outputs of the network into a probability distribution, while the cross-entropy [28] loss function minimizes the discrepancy between the predicted and true distributions.

For a sample containing N frequency points, the Softmax function for the i-th node output is defined as follows:

\bar{P_{j}} = \frac{1}{N} \sum_{i = 1}^{N} \frac{\exp (z_{i, j})}{\sum_{j = 1}^{C} \exp (z_{i, j})}

(6)

where

z_{i, j}

is the raw output value of the j-th node, C is the total number of output nodes (i.e., the number of DOA classes, C = 360), and

P_{i}

represents the predicted probability for class i.

And the cross-entropy loss is expressed as follows:

E = - \sum_{j = 1}^{C} y_{j} l n \bar{P_{j}}

(7)

where

y_{j}

represents the ground truth label of sample.

2.4. Dataset and Training Details

2.4.1. Training Set

The training set was generated through simulations in MATLAB-R2022b, following the experimental array configuration of the SWellEx-96 experiment [29]. The incident angles ranged from 1° to 360° in 1° increments, corresponding to noise-free sources. The sampling rate was 3276.8 Hz, with a frequency resolution of 0.8 Hz and a frequency band of 72–79 Hz, including 10 frequency points. By combining the incident angles of two sound sources, a total of

C_{360}^{2}

= 64,980 training samples were generated. During training, a random frequency point was selected for each sample, and white noise with an amplitude of 0.1–1.4 times the signal strength was added.

Algorithm 1 outlines the process of generating simulated signals in this paper:

Algorithm 1. The process of generating simulated signals

Generate the received signal of the array according to the direction-of-arrival (DOA) angle θ: $X_{27 \times 10}^{θ}$
Add the received signals of two sound sources: ${X_{27 \times 10} = X}_{27 \times 10}^{θ_{1}}$ $+ X_{27 \times 10}^{θ_{2}}$
Inverse Fourier transform to the time-domain sequence: $F^{- 1} {X_{27 \times 10}} {= S}_{27 \times 4096}$
Add noise: $S = S + N$
Fourier transform to the frequency-domain signal: $F \{S_{27 \times 4096}\} = X_{27 \times 10}$
Calculate the covariance matrix at frequency point f: $R (f) = E [X (f) X^{H} (f)]$ , forming: ${R (f_{1}), R (f_{2}), \dots, R (f_{10})}$
Take the real and imaginary parts of the matrix to form a real-number matrix of size 10 × 27 × 27 × 2 as the input to the network.

2.4.2. Testing Sets

The testing sets were generated using MATLAB simulations, with the array receiving signals at incident angles ranging from 1° to 360° in 1° increments. The scenarios were divided into cases where the sound sources have identical or different frequencies and include the following three types of signal combinations:

Two identical sound sources (72–79 Hz): Two sound sources with identical frequencies impinge on the array from different directions. The incident angles range from 1° to 360° in 1° increments for one source and from 360° to 1° in 1° increments for the other. Noise is added to all sources.
Three identical sound sources (72–79 Hz): Three identical sound sources impinge on the array from different directions. The incident angles of sources 1 and 2 range from 1° to 360° and from 360° to 1°, respectively. The third source impinges from the direction of 180°. All sources operate in the frequency band of 72–79 Hz, and noise is added to simulate real-world conditions.
Three distinct sound sources (72–79 Hz): Three sound sources with distinct frequency bands impinge on the array from different directions. The incident angles of sources 1 and 2 range from 1° to 360° and from 360° to 1°, respectively, with source 1 covering the frequency band of 72–75 Hz and source 2 covering 76–79 Hz. The third source, spanning the frequency band of 72–79 Hz, impinges from the direction of 180°. Figure 4 shows the frequency range representation of the three sound sources. Noise is added to all sources.

In the simulation, partial frequency loss was simulated to represent scenarios where only ocean ambient noise was received, as illustrated in Figure 5. After the time-domain data of each array element were transformed into the frequency domain using Fourier transform, among the selected 10 frequency points, frequencies from 78 Hz to 72 Hz were progressively replaced with white noise to “mask” the signal. Specifically, 2, 4, and 6 frequency points were masked in different cases. In this way, in scenarios with “three distinct sound sources,” the number of frequency points consistently followed: source 2 < source 1 < source 3. When masking 4 or 6 frequency points, source 2 was effectively reduced to a single-frequency signal at 79 Hz.

3. Experiments and Result Analysis

3.1. Evaluation Method

In this study, the performance of different models is evaluated under various scenarios using two key metrics: mean absolute error (MAE) and root mean square error (RMSE). These metrics are widely used in direction-of-arrival (DOA) estimation tasks to measure the deviation between the predicted and true DOA angles, providing a quantitative assessment of model accuracy.

From the model’s prediction results, the top n angles with the highest probabilities are selected to form a prediction set. The true angles of the sound sources constitute the label set. For each true angle in the label set (denoted as j = 1, 2, …, m), the closest angle in the prediction set is chosen as the predicted DOA.

For a testing set containing K samples, each with m sound sources, the mean absolute error (MAE) is computed as follows:

M A E = \frac{1}{K \cdot m} \sum_{k = 1}^{K} \sum_{j = 1}^{m} | {\hat{θ}}_{k, j} - θ_{k, j}^{label} |,

(8)

Root mean square error (RMSE) is computed as follows:

R M S E = \frac{1}{m} \sum_{j = 1}^{m} \sqrt{\frac{1}{K} \sum_{k = 1}^{K} {({\hat{θ}}_{k, j} - θ_{k, j}^{label})}^{2}},

(9)

where

{\hat{θ}}_{k, j}

represents the predicted angle for the

j

-th sound source in the

k

-th sample, and

θ_{k, j}

is the corresponding true angle.

The evaluation focuses on two primary scenarios:

Sound sources with identical frequencies.
Sound sources with different frequencies.

Additionally, real-world data from the SWellEx-96 experiment are used to validate the models on multi-frequency signal localization. The results and analyses for each scenario are detailed in the following sections.

Algorithm 2 describes the evaluation process for traditional methods (CBF/MUSIC) and the neural network (V1/V2/V3):

Algorithm 2. Evaluation Process

Prediction result of each sample: $P (θ), θ \in 1, 2, \dots, 360$
If CBF or MUSIC:
Perform peak detection.
Select the top n values as candidate angles: $Θ_{t o p - n} = \arg t o p - n P (θ)$

4.: Choose the angle closest to the label as the predicted value:
${\hat{θ}}_{j} = \arg \min_{θ_{i} \in Θ_{t o p - n}} |θ_{i} - θ_{j}^{label}|$
5.: Calculate MAE/RMSE

3.2. DOA Estimation for Identical Frequency Sound Sources

Testing sets 1 and 2 consist of sound sources with identical frequencies. This section analyzes the performance of V1, V2, and V3 using these testing sets. Testing set 1 contains two sound sources (m = 2), and the top two points (n = 2) with the highest probabilities from the DOA prediction results are selected to calculate the MAE and RMSE, as shown in Figure 6 and Figure 7.

Overall, the MAE and RMSE of all three models increase as the number of masked frequency points grows, when two identical sound sources are masked at the same frequency points. For SNR > −10 dB, V2 and V3 exhibit similar performance in terms of the MAE and RMSE. However, when 4 to 6 frequency points are masked, V3 experiences a more significant increase in the MAE, exceeding 0.1°, which is slightly higher than that of V2.

In scenarios with SNR < −10 dB, V1 and V2 exhibit notable increases in both the MAE and RMSE, while V3 demonstrates better performance in maintaining stability as the SNR decreases.

Testing set 2 contains three identical sound sources (m = 3). From the DOA prediction results, the top six points (n = 6) with the highest probabilities are selected to calculate the MAE and RMSE, as shown in Figure 8 and Figure 9. Similar to Figure 6 and Figure 7, the errors reach their maximum when 6 frequency points are masked.

For SNR > −15 dB, V3 demonstrates the most stable performance, with the MAE remaining below 0.4° and the RMSE remaining below 2.5°. However, V2 also performs competitively in this range, with MAE and RMSE values only slightly higher than those of V3. In scenarios with 0–2 masked frequency points, both V1 and V2 achieve low MAE values (approximately 1.5°), with V2 demonstrating better consistency as the number of masked points increases.

When 4–6 points are masked, V1 experiences a significant increase in the MAE, exceeding 2.5°, while V2 maintains lower error rates compared to V1. Compared to scenarios with two sound sources, adding a third source results in larger RMSE variations for V1 and V2, which increase with the number of masked frequency points. Notably, when four points are masked, the RMSE of V1 exceeds 5°, whereas V2 shows better adaptability and maintains a more moderate error increase.

These results indicate that for sound sources with identical frequency points, while V3 excels in handling multiple sources, V2 also demonstrates strong robustness and competitive performance, especially under lower masking conditions.

Figure 10 visualizes the DOA prediction results of V1, V2, and V3 under the condition of SNR = −20 dB, for two and three sound sources with identical frequency characteristics, when 6 frequency points are simultaneously masked. The x-axis represents the predicted DOA angles, while the y-axis corresponds to the sequence of test samples.

For both the two-source and three-source scenarios, V3 demonstrates relatively clear DOA prediction patterns. By contrast, while V1 and V2 also exhibit distinct angular variations, their predictions include spurious values at positions other than the true sources. These spurious predictions increase as the number of masked frequency points grows, introducing noise and complicating the determination of the actual source positions.

3.3. DOA Estimation for Distinct Frequency Sound Sources

In this section, the performance of V1, V2, and V3 is evaluated using testing set 3, which contains three sound sources with distinct angles and frequencies (m = 3). From the DOA prediction results, the top six points (n = 6) with the highest probabilities are selected to calculate the MAE and RMSE, as shown in Figure 11 and Figure 12.

The results indicated that the MAE and RMSE of all models are affected by the number of masked frequency points. When 4 frequency points are masked, all three models exhibit their highest MAE values. At this point, source 2 has only one usable frequency for DOA prediction, making it the weakest signal among the three sources. This has the greatest impact on V3, where the MAE exceeds 16° and the RMSE surpasses 22°.

Under higher SNR conditions (SNR > −10 dB), V2 demonstrates the best performance, maintaining an MAE below 1.5° and an RMSE under 5.2°. While V3 achieves an MAE of approximately 1 with no masked frequency points, its performance degrades significantly when frequency points are masked. Specifically, with two masked points, V3’s MAE exceeds 6°, and its RMSE surpasses 14°, making it the least robust model among the three in such scenarios.

These results highlight V2’s strong robustness and superior adaptability under higher SNR conditions, particularly in scenarios with masked frequency points. V2’s ability to handle weaker signals and maintain stable predictions makes it a better choice in multi-source environments.

Figure 13 visualizes the DOA prediction results of V1, V2, and V3 for sources 1, 2, and 3 under the condition of SNR = −20 dB. The x-axis represents the predicted DOA angles, while the y-axis corresponds to the sequence of the test data.

Overall, when frequency points are masked, the impact of frequency loss on V2’s prediction results is minimal, followed by V1, while V3 is the most affected. Specifically, V3 loses its ability to accurately predict the position of source 2 when 2 frequency points are masked. When 6 frequency points are masked, source 2 becomes entirely unobservable in V3’s predictions, as corroborated by Figure 11, indicating that V3 can no longer effectively detect source 3. By contrast, both V1 and V2 maintain relatively clear predictions for source 3, despite being affected by the masking.

Figure 14 illustrates the MAE and RMSE of three neural network models (V1, V2, and V3) and spectral estimation techniques (CBF and MUSIC) for source 2 when 4 frequency points are masked. The results show that the MAE and RMSE of V1 and V2 decrease as the SNR increases, stabilizing at SNR = −10 dB. For V2, the MAE remains close to 2°, and the RMSE stays under 15°. However, V3 exhibits a significantly different trend, with its MAE and RMSE increasing slowly as the SNR increases. Notably, V3’s MAE remains above 49°, and its RMSE exceeds 68° across all SNR conditions, far surpassing those of V2. Between the two spectral estimation techniques (CBF and MUSIC), CBF demonstrates superior performance for source 2, with an MAE below 5° at −10 dB SNR. As shown, CBF’s MAE remains double that of V2 at SNRs above −10 dB.

This indicated that V3 loses the ability to detect single-frequency targets, such as source 2, under these conditions. More detailed results are shown in Table A8 and Table A9.

Based on the above analysis, it can be concluded that when the target sound source has fewer frequencies within the detection band and broadband interfering sources are present simultaneously, the frequency-coherent network structure of V3 tends to ‘ignore’ the target source. By contrast, V1, which employs a frequency-incoherent network structure for information extraction, demonstrates strong adaptability to such scenarios. Building upon V1, V2 incorporates an attention mechanism during the prediction phase to fuse multi-frequency information, resulting in more stable predictions. This allows V2 to achieve relatively accurate DOA predictions even for sound sources with only a single frequency point.

3.4. Evaluation of DOA Models Using Data of SWellEx-96 Experiment

This study utilizes the Event-S59 data from the SWellEx-96 [29] experiment for further comparison. The SWellEx-96 experiment was conducted from 10 to 18 May 1996, approximately 12 km off Point Loma near San Diego, California. The experimental data (test data) were recorded on 13 May 1996, between 11:45 and 12:50, using the HLA North array with a sampling rate of 3267.8 Hz, under conditions with significant interference. The towed sound source emitted tones consisting of five sets of 13 tones, including a 79 Hz tone. Figure 15 shows the tracks of the sound source in the SWellEx-96 experiment and the interfering source.

The HLA North array is a horizontal array with a 240 m aperture deployed on the seafloor. The bearing from the first to the last array element was oriented 34.5 degrees clockwise from true North. The array elements were arranged in a slightly bow-shaped configuration, as illustrated in Figure 16.

This study performs DOA estimation for the first 60 min of Event-S59 data using three neural network models (V1, V2, and V3) as well as traditional methods, including CBF [30] and MUSIC [31]. The data were sampled at 3276.8 Hz with a frequency resolution of 0.8 Hz, covering a frequency band of 72–79 Hz. Within this band, the towed source had a single frequency point at 79 Hz, while the interfering source spanned the entire band of 72–79 Hz. The results are shown in Figure 17 and Figure 18, where green triangles represent the trajectory of the towed source, and red circles indicate the trajectory of the interfering source.

Figure 17 presents the DOA estimation results of V1, V2, V3, and CBF without frequency masking, using a total of 10 frequency points. Among the models, V2 achieves the best performance, with the towed source’s trajectory appearing the clearest and most continuous in Figure 17b. V1 follows as the second-best model. By contrast, V3 demonstrates the poorest performance, as shown in Figure 17c, where the trajectories of both the towed and interfering sources are the least distinct.

Although CBF provides the trajectory of the sound sources, it exhibits significant sidelobes and strong spurious peaks (mirror peaks) at angles symmetrical about the end fire direction.

Figure 18 presents the DOA prediction results of V1, V2, and V3 under the conditions where 6 frequency points are masked, alongside the results of MUSIC without frequency masking. From Figure 18a,c it can be observed that, compared to the unmasked condition, V1 demonstrates a more continuous and clearer trajectory for the target source than V3. However, the results of V1 contain a higher number of spurious points. By contrast, Figure 18b shows that V2 produces the fewest spurious points among the three networks, making it the best-performing model overall.

In Figure 18d, MUSIC, which performs DOA estimation using 10 snapshots, provides a very clear trajectory for the interfering source. However, it struggles to balance the relationship between the target source and the interfering source. Additionally, MUSIC fails to localize the towed source effectively when it is in the end fire direction of the array.

These results are consistent with the simulation findings, further indicating that the frequency-coherent V3 network performs poorly in localizing the single-frequency towed source (79 Hz) under broadband interference. The performance of V3 improves only when certain characteristic frequencies of the interfering source are replaced with white noise. By contrast, V2 demonstrates robust detection capability for single-frequency target sources within the frequency band. Regardless of changes in the frequency of the interfering source, V2 effectively balances the detection of single-frequency target sources with other interfering signals.

4. Conclusions

To address the limitations of existing frequency-coherent neural network structures in estimating the direction of arrival (DOA) of single-frequency sound sources under underwater broadband interference, this paper proposes the Parallel Net model. The model employs parallel GFGRU networks to independently extract information from each frequency point in a decoupled manner, followed by the use of an attention mechanism for multi-frequency information fusion.

Through simulations and analyses of array-received data from the SWellEx-96 experiment, with partial frequencies replaced by white noise, the DOA estimation performance of the proposed model and existing models was evaluated. The simulations considered scenarios with two or three sound sources of identical frequencies and three sound sources with distinct frequencies arriving from different angles.

The results demonstrate that Parallel Net improves DOA estimation accuracy for single-frequency sound sources under broadband interference compared to frequency-coherent neural network methods. Specifically, when the SNR > −10 dB, the MAE for single-frequency sources remains below 2°, outperforming frequency-coherent neural networks and reaching only half of the error of CBF. Validation using the SWellEx-96 experiment further confirmed the model’s robustness in detecting single-frequency targets under broadband interference. Parallel Net exhibits superior sidelobe suppression and fewer spurious peaks compared to CBF, achieves higher detection accuracy than MUSIC, and produces smoother and more continuous DOA trajectories than conventional neural network models.

Significantly, we observed that CBF delivers markedly more stable predictions for the 79 Hz single-frequency source 2 under challenging low-SNR conditions (<−10 dB). By contrast, MUSIC exhibits distinct advantages in suppressing mirror peaks. Future research will focus on developing a hybrid framework that integrates conventional methods (CBF and MUSIC) with neural networks to synergistically fuse their predictions, thereby enhancing source detection performance in low-SNR conditions and complex acoustic environments.

Author Contributions

These authors contributed equally to this work: Z.Y. and X.Z.; Conceptualization, Z.Y. and M.C.; methodology, Z.Y.; software, Z.Y., X.Z. and M.C.; validation, Z.Y., M.C. and X.Z.; formal analysis, Z.Y. and X.L.; writing—original draft preparation, Z.Y. and X.L.; writing—review and editing, Z.L.; visualization, Z.Y.; supervision, T.S. and Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Part of the code and the raw data that have been analyzed in the manuscript are available in https://blog.csdn.net/YANGN1?type=blog (accessed on 20 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Gating Mechanism of GFGRU

The gating mechanism of the GFGRU can be expressed by the following equations:

\begin{array}{l} z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1}) \\ r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1}) \\ {\tilde{h}}_{t}^{j} = t a n h (W^{j - 1 \to j} h_{t}^{j - 1} + r_{t}^{j} ⨀ \sum_{i = 1}^{L} g^{i \to j} U^{i \to j} h_{t - 1}^{i}) \\ h_{t} = (1 - z_{t}) + z_{t} {\tilde{h}}_{t} \end{array}

(A1)

Appendix A.2. Structure of GFGRU

Figure A1 shows the cell of the GFGRU at time T = t, which consists of two layers.

Figure A1. Structure of GFGRU’s unit at time T = t.

Appendix A.3. Complexity Evaluations for Difference Methods

Table A1 shows the complexity evaluation of different methods in the paper. The calculation amount (MACs) and parameter number of the model are included. Additionally, for an intuitive comparison of the three models, Table A1 provides the running time on an RTX 960 m GPU (with 4 GB of memory) and the maximum batch size under the 4 GB memory constraint. Since V1 and V2 involve a loop for each frequency point, their processing times are almost 10 times that of V3.

Table A1. Complexity evaluations for difference methods.

Models	MACs(G)	Params(M)	Running Time	Max Batch Size
V1	2.73	167.6	1.01 s	120
V2	2.73	167.6	1.01 s	120
V3	1.28	117.7	0.12 s	1600

Appendix B

Visualization of DOA Estimation Results: CBF vs. MUSIC

Figure A2 and Figure A3 visualize the DOA estimation results of CBF and MUSIC at SNR = −20 dB in the simulation. They display cases with 0 and 6 masked frequency points. The scenarios include: “two identical sound sources”, “three identical sound sources”, and “three distinct sound sources”. The vertical axis represents sample indices, while the horizontal axis denotes angle values.

Under −20 dB SNR conditions, CBF demonstrates stable source 2 detection regardless of frequency masking, although it is significantly affected by mirror peaks. By contrast, while MUSIC exhibits less severe mirror-peak effects, it completely loses source 2 detection capability when 6 frequency points are masked.

Figure A2. CBF’s DOA estimation for two and three sound sources with 0 or 6 frequency points Masked at SNR = −20 dB.

Figure A3. MUSIC’s DOA estimation results for two and three sound sources with 0 or 6 frequency points masked at SNR = −20 dB.

Appendix C