3.1. Evaluation Method
In this study, the performance of different models is evaluated under various scenarios using two key metrics: mean absolute error (MAE) and root mean square error (RMSE). These metrics are widely used in direction-of-arrival (DOA) estimation tasks to measure the deviation between the predicted and true DOA angles, providing a quantitative assessment of model accuracy.
From the model’s prediction results, the top n angles with the highest probabilities are selected to form a prediction set. The true angles of the sound sources constitute the label set. For each true angle in the label set (denoted as j = 1, 2, …, m), the closest angle in the prediction set is chosen as the predicted DOA.
For a testing set containing
K samples, each with
m sound sources, the mean absolute error (MAE) is computed as follows:
Root mean square error (RMSE) is computed as follows:
where
represents the predicted angle for the
-th sound source in the
-th sample, and
is the corresponding true angle.
The evaluation focuses on two primary scenarios:
Additionally, real-world data from the SWellEx-96 experiment are used to validate the models on multi-frequency signal localization. The results and analyses for each scenario are detailed in the following sections.
Algorithm 2 describes the evaluation process for traditional methods (CBF/MUSIC) and the neural network (V1/V2/V3):
Algorithm 2. Evaluation Process |
Prediction result of each sample: If CBF or MUSIC: Perform peak detection. Select the top n values as candidate angles:
|
- 4.
Choose the angle closest to the label as the predicted value: - 5.
Calculate MAE/RMSE
|
3.2. DOA Estimation for Identical Frequency Sound Sources
Testing sets 1 and 2 consist of sound sources with identical frequencies. This section analyzes the performance of V1, V2, and V3 using these testing sets. Testing set 1 contains two sound sources (m = 2), and the top two points (n = 2) with the highest probabilities from the DOA prediction results are selected to calculate the MAE and RMSE, as shown in
Figure 6 and
Figure 7.
Overall, the MAE and RMSE of all three models increase as the number of masked frequency points grows, when two identical sound sources are masked at the same frequency points. For SNR > −10 dB, V2 and V3 exhibit similar performance in terms of the MAE and RMSE. However, when 4 to 6 frequency points are masked, V3 experiences a more significant increase in the MAE, exceeding 0.1°, which is slightly higher than that of V2.
In scenarios with SNR < −10 dB, V1 and V2 exhibit notable increases in both the MAE and RMSE, while V3 demonstrates better performance in maintaining stability as the SNR decreases.
Testing set 2 contains three identical sound sources (m = 3). From the DOA prediction results, the top six points (n = 6) with the highest probabilities are selected to calculate the MAE and RMSE, as shown in
Figure 8 and
Figure 9. Similar to
Figure 6 and
Figure 7, the errors reach their maximum when 6 frequency points are masked.
For SNR > −15 dB, V3 demonstrates the most stable performance, with the MAE remaining below 0.4° and the RMSE remaining below 2.5°. However, V2 also performs competitively in this range, with MAE and RMSE values only slightly higher than those of V3. In scenarios with 0–2 masked frequency points, both V1 and V2 achieve low MAE values (approximately 1.5°), with V2 demonstrating better consistency as the number of masked points increases.
When 4–6 points are masked, V1 experiences a significant increase in the MAE, exceeding 2.5°, while V2 maintains lower error rates compared to V1. Compared to scenarios with two sound sources, adding a third source results in larger RMSE variations for V1 and V2, which increase with the number of masked frequency points. Notably, when four points are masked, the RMSE of V1 exceeds 5°, whereas V2 shows better adaptability and maintains a more moderate error increase.
These results indicate that for sound sources with identical frequency points, while V3 excels in handling multiple sources, V2 also demonstrates strong robustness and competitive performance, especially under lower masking conditions.
Figure 10 visualizes the DOA prediction results of V1, V2, and V3 under the condition of SNR = −20 dB, for two and three sound sources with identical frequency characteristics, when 6 frequency points are simultaneously masked. The x-axis represents the predicted DOA angles, while the y-axis corresponds to the sequence of test samples.
For both the two-source and three-source scenarios, V3 demonstrates relatively clear DOA prediction patterns. By contrast, while V1 and V2 also exhibit distinct angular variations, their predictions include spurious values at positions other than the true sources. These spurious predictions increase as the number of masked frequency points grows, introducing noise and complicating the determination of the actual source positions.
3.3. DOA Estimation for Distinct Frequency Sound Sources
In this section, the performance of V1, V2, and V3 is evaluated using testing set 3, which contains three sound sources with distinct angles and frequencies (m = 3). From the DOA prediction results, the top six points (n = 6) with the highest probabilities are selected to calculate the MAE and RMSE, as shown in
Figure 11 and
Figure 12.
The results indicated that the MAE and RMSE of all models are affected by the number of masked frequency points. When 4 frequency points are masked, all three models exhibit their highest MAE values. At this point, source 2 has only one usable frequency for DOA prediction, making it the weakest signal among the three sources. This has the greatest impact on V3, where the MAE exceeds 16° and the RMSE surpasses 22°.
Under higher SNR conditions (SNR > −10 dB), V2 demonstrates the best performance, maintaining an MAE below 1.5° and an RMSE under 5.2°. While V3 achieves an MAE of approximately 1 with no masked frequency points, its performance degrades significantly when frequency points are masked. Specifically, with two masked points, V3’s MAE exceeds 6°, and its RMSE surpasses 14°, making it the least robust model among the three in such scenarios.
These results highlight V2’s strong robustness and superior adaptability under higher SNR conditions, particularly in scenarios with masked frequency points. V2’s ability to handle weaker signals and maintain stable predictions makes it a better choice in multi-source environments.
Figure 13 visualizes the DOA prediction results of V1, V2, and V3 for sources 1, 2, and 3 under the condition of SNR = −20 dB. The x-axis represents the predicted DOA angles, while the y-axis corresponds to the sequence of the test data.
Overall, when frequency points are masked, the impact of frequency loss on V2’s prediction results is minimal, followed by V1, while V3 is the most affected. Specifically, V3 loses its ability to accurately predict the position of source 2 when 2 frequency points are masked. When 6 frequency points are masked, source 2 becomes entirely unobservable in V3’s predictions, as corroborated by
Figure 11, indicating that V3 can no longer effectively detect source 3. By contrast, both V1 and V2 maintain relatively clear predictions for source 3, despite being affected by the masking.
Figure 14 illustrates the MAE and RMSE of three neural network models (V1, V2, and V3) and spectral estimation techniques (CBF and MUSIC) for source 2 when 4 frequency points are masked. The results show that the MAE and RMSE of V1 and V2 decrease as the SNR increases, stabilizing at SNR = −10 dB. For V2, the MAE remains close to 2°, and the RMSE stays under 15°. However, V3 exhibits a significantly different trend, with its MAE and RMSE increasing slowly as the SNR increases. Notably, V3’s MAE remains above 49°, and its RMSE exceeds 68° across all SNR conditions, far surpassing those of V2. Between the two spectral estimation techniques (CBF and MUSIC), CBF demonstrates superior performance for source 2, with an MAE below 5° at −10 dB SNR. As shown, CBF’s MAE remains double that of V2 at SNRs above −10 dB.
This indicated that V3 loses the ability to detect single-frequency targets, such as source 2, under these conditions. More detailed results are shown in
Table A8 and
Table A9.
Based on the above analysis, it can be concluded that when the target sound source has fewer frequencies within the detection band and broadband interfering sources are present simultaneously, the frequency-coherent network structure of V3 tends to ‘ignore’ the target source. By contrast, V1, which employs a frequency-incoherent network structure for information extraction, demonstrates strong adaptability to such scenarios. Building upon V1, V2 incorporates an attention mechanism during the prediction phase to fuse multi-frequency information, resulting in more stable predictions. This allows V2 to achieve relatively accurate DOA predictions even for sound sources with only a single frequency point.
3.4. Evaluation of DOA Models Using Data of SWellEx-96 Experiment
This study utilizes the Event-S59 data from the SWellEx-96 [
29] experiment for further comparison. The SWellEx-96 experiment was conducted from 10 to 18 May 1996, approximately 12 km off Point Loma near San Diego, California. The experimental data (test data) were recorded on 13 May 1996, between 11:45 and 12:50, using the HLA North array with a sampling rate of 3267.8 Hz, under conditions with significant interference. The towed sound source emitted tones consisting of five sets of 13 tones, including a 79 Hz tone.
Figure 15 shows the tracks of the sound source in the SWellEx-96 experiment and the interfering source.
The HLA North array is a horizontal array with a 240 m aperture deployed on the seafloor. The bearing from the first to the last array element was oriented 34.5 degrees clockwise from true North. The array elements were arranged in a slightly bow-shaped configuration, as illustrated in
Figure 16.
This study performs DOA estimation for the first 60 min of Event-S59 data using three neural network models (V1, V2, and V3) as well as traditional methods, including CBF [
30] and MUSIC [
31]. The data were sampled at 3276.8 Hz with a frequency resolution of 0.8 Hz, covering a frequency band of 72–79 Hz. Within this band, the towed source had a single frequency point at 79 Hz, while the interfering source spanned the entire band of 72–79 Hz. The results are shown in
Figure 17 and
Figure 18, where green triangles represent the trajectory of the towed source, and red circles indicate the trajectory of the interfering source.
Figure 17 presents the DOA estimation results of V1, V2, V3, and CBF without frequency masking, using a total of 10 frequency points. Among the models, V2 achieves the best performance, with the towed source’s trajectory appearing the clearest and most continuous in
Figure 17b. V1 follows as the second-best model. By contrast, V3 demonstrates the poorest performance, as shown in
Figure 17c, where the trajectories of both the towed and interfering sources are the least distinct.
Although CBF provides the trajectory of the sound sources, it exhibits significant sidelobes and strong spurious peaks (mirror peaks) at angles symmetrical about the end fire direction.
Figure 18 presents the DOA prediction results of V1, V2, and V3 under the conditions where 6 frequency points are masked, alongside the results of MUSIC without frequency masking. From
Figure 18a,c it can be observed that, compared to the unmasked condition, V1 demonstrates a more continuous and clearer trajectory for the target source than V3. However, the results of V1 contain a higher number of spurious points. By contrast,
Figure 18b shows that V2 produces the fewest spurious points among the three networks, making it the best-performing model overall.
In
Figure 18d, MUSIC, which performs DOA estimation using 10 snapshots, provides a very clear trajectory for the interfering source. However, it struggles to balance the relationship between the target source and the interfering source. Additionally, MUSIC fails to localize the towed source effectively when it is in the end fire direction of the array.
These results are consistent with the simulation findings, further indicating that the frequency-coherent V3 network performs poorly in localizing the single-frequency towed source (79 Hz) under broadband interference. The performance of V3 improves only when certain characteristic frequencies of the interfering source are replaced with white noise. By contrast, V2 demonstrates robust detection capability for single-frequency target sources within the frequency band. Regardless of changes in the frequency of the interfering source, V2 effectively balances the detection of single-frequency target sources with other interfering signals.