A Novel Multi-Feature Fusion Model Based on Pre-Trained Wav2vec 2.0 for Underwater Acoustic Target Recognition

Pu, Zijun; Zhang, Qunfei; Xue, Yangtao; Zhu, Peican; Cui, Xiaodong

doi:10.3390/rs16132442

Open AccessTechnical Note

A Novel Multi-Feature Fusion Model Based on Pre-Trained Wav2vec 2.0 for Underwater Acoustic Target Recognition

by

Zijun Pu

¹

,

Qunfei Zhang

¹,

Yangtao Xue

¹,

Peican Zhu

²

and

Xiaodong Cui

^1,*

¹

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

²

School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(13), 2442; https://doi.org/10.3390/rs16132442

Submission received: 29 May 2024 / Revised: 28 June 2024 / Accepted: 29 June 2024 / Published: 3 July 2024

(This article belongs to the Special Issue Space-Air-Ground-Ocean Integrated Sensing and Information Transmission)

Download

Browse Figures

Versions Notes

Abstract

:

Although recent data-driven Underwater Acoustic Target Recognition (UATR) methods have played a dominant role in marine acoustics, they suffer from complex ocean environments and rather small datasets. To tackle such challenges, researchers have resorted to transfer learning in an effort to fulfill UATR tasks. However, existing pre-trained models are trained on audio speech data, and are not suitable for underwater acoustic data. Therefore, it is necessary to make further optimization on the basis of these models to make them suitable for the UATR task. Here, we propose a novel UATR framework called Attention Layer Supplement Integration (ALSI), which integrates large pre-trained neural networks with customized attention modules for acoustic. Specifically, the ALSI model consists of two important modules, namely Scale ResNet and Residual Hybrid Attention Fusion (RHAF). First, the Scale ResNet module takes the Constant-Q transform feature as input to obtain relatively important frequency information. Next, RHAF takes the temporal feature extracted by wav2vec 2.0 and the frequency feature extracted by Scale ResNet as input and aims to better integrate the time–frequency features with the temporal feature by using the attention mechanism. The RHAF module can help wav2vec 2.0, which is trained on speech data, to better adapt to underwater acoustic data. Finally, the experiments on the ShipsEar dataset demonstrated that our model can achieve recognition accuracy of

96.39 %

. In conclusion, extensive experiments confirm the effectiveness of our model on the UATR task.

Keywords:

underwater acoustic target recognition; deep learning; multi-feature fusion; wav2vec 2.0; CQT; Mel-spectrogram

1. Introduction

The Underwater Acoustic Target Recognition (UATR) task uses acoustic signals to identify underwater targets, which plays an important role in ocean information sensing. It has two branches: active sonar image recognition [1,2] and passive ship-radiated noise recognition [3,4]. Passive ship-radiated noise recognition uses passive sonar systems to determine the direction and type of underwater targets, which have the characteristics of good invisibility and motility. However, passive ship-radiated noise recognition still faces many challenges such as low Signal-to-noise Ratio (SNR) and the antagonism of targets. Therefore, many studies focus on ship-radiated noise recognition [5,6,7,8].

Ship-radiated noise recognition commonly uses time–frequency analysis techniques such as feature extraction. These techniques include Short-term Fourier Transform (STFT) [9], Mel-spectrogram [10,11,12,13,14,15], Mel Frequency Cepstral Coefficients (MFCC) [16], and Gammatone Frequency Cepstral Coefficients (GFCC) [17]. However, some researchers are exploring alternative features for UATR tasks because the time–frequency features currently used lack sufficient time resolution. These works can be roughly divided into two categories: (1) Finding alternative acoustic features. The idea of this type of method is to explore more efficient time–frequency analysis methods such as Boltzmann machines [18,19] and wavelet transforms [20,21]. (2) Feature fusion methods. The idea of this type of method is to improve the feature representation of data through multi-feature fusion [5,22,23,24]. In particular, Zakaria proposed a method to integrate temporal and frequency information [25]. This method address the issue of insufficient temporal resolution in time–frequency features. Nonetheless, the feature fusion method used in these works is simple concatenation, which makes it easy to confuse the information between different features.

Although the methods mentioned above achieve great performance on ship-radiated noise recognition, they have still been plagued with the following issues: (1) Insufficient time resolution. Existing methods still rely on time–frequency features, resulting in time-domain information loss. Thus, using the temporal features in the raw wave as a supplement of time–frequency features can significantly improve the model’s performance, which has also been proved in [25]. (2) Shallow feature fusion. Existing feature fusion methods still use simple fusion strategies such as concatenation, which results in the confusion of information. Nevertheless, the attention mechanism [26,27] is an efficient feature fusion strategy that can achieve the deep integration of features by capturing the correlation between different features. This mechanism has proven to be effective in multimodal tasks [28,29,30,31].

Therefore, we proposed an Attention Layer Supplement Integration (ALSI) model for the UATR tasks. This architecture overcomes the issues mentioned above by fine-tuning the pre-trained wav2vec 2.0 model [32] and multi-feature fusion. Since wav2vec 2.0 is trained on a large amount of speech data, it has a strong feature extraction ability for sound signals. On the basis of wav2vec 2.0, the downstream UATR mission model is further designed, so that it can adapt to the underwater acoustic data. Specifically, temporal embedding encoded by the pre-trained wav2vec 2.0 model fuses with time–frequency features at different granularities. Next, the fused embeddings are fed into a classifier to complete the UATR tasks.

Briefly, our contributions can be summarized as follows:

(1): We propose an ALSI model, which can integrate high-resolution temporal information and frequency information to complete UATR tasks efficiently.
(2): To enhance the information fusion efficiency of ALSI and the adaptability of wav2vec 2.0 on underwater acoustic data, we propose the Scale ResNet module to compress time-domain information and the Residual Hybrid Attention Fusion (RHAF) module to integrate different feature embeddings.
(3): We conduct extensive experiments and meticulous analysis on a widely used public dataset. The combination of features and the model design produced strong performance, surpassing some existing research.

The rest of the article is arranged as follows: Section 2 summarizes the relevant work, Section 3 provides a detailed explanation of the proposed model, Section 4 presents the experimental design and analyzes the experimental results, Section 5 discusses the results and Section 6 draws conclusions.

2. Related Works

Ship-radiated noise recognition is one of the most popular and long-standing topics in the field of UATR. Feature extraction is a necessary step in ship-radiated noise recognition. Thus, many works focus on feature selection and fusion. Below we introduce their details.

Time–frequency feature extraction is still a necessary step in most of the existing works. Due to STFT being one of the basic methods of time–frequency analysis, Wang proposed an AMNet model to complete the UATR task by using an STFT spectrogram [33]. However, STFT is ineffective in representing the low-frequency information of ship-radiated noise. To better characterize the low-frequency information of ship-radiated noise, a CFTANet model was proposed to enhance low-frequency features by concatenating the Mel-spectrogram sub-band [34]. Just like the Mel-spectrogram, GFCC is also a time–frequency analysis method that can characterize the low-frequency information of ship-radiated noise [35]. Therefore, Feng proposed a WA-DS fusion model that integrates GFCC to enrich the character representation of ship-radiated noise [36]. In a word, many works choose Mel-spectrogram and GFCC as time–frequency features to complete UATR tasks [15,17,37,38], as they can more efficiently represent the low-frequency characteristics of ship-radiated noise compared to STFT.

Although these feature extraction methods could effectively improve the recognition performance, a single time–frequency spectrogram cannot provide sufficient information. Thus, researchers began to use feature fusion methods to enrich feature representation and these methods can be roughly divided into two categories: (1) Fusion of time–frequency features. Most feature fusion algorithms use Fourier Transform-based features (STFT, Mel-spectrogram, MFCC, and GFCC are all calculated on the basis of Fourier Transform) [4,5,39,40]. For example, Log Mel, MFCC and optimized feature based on Center loss were integrated in [41]. Zhu proposed integrating Mel-spectrogram with Constant Q Transform (CQT) [42] and achieving great performance [4]. This kind of feature fusion method also appears in [43]. CQT is a wavelet-based time–frequency feature that can represent the low-frequency information of ship-radiated noise more effectively than Fourier Transform-based features. However, due to insufficient high-frequency resolution, CQT is rarely used alone in UATR tasks and needs to be used together with Fourier Transform-based features. (2) Fusion of temporal features and time–frequency features. The idea of this type of method is to use temporal features extracted from raw waves to supplement temporal information for time–frequency features. A more representative work is [25], mentioned above.

Nevertheless, there are still insufficient temporal information issues with these methods. Thus, we proposed a multi-feature fusion method that integrates the temporal feature, CQT feature, and Mel-spectrogram. Firstly, integrating the temporal feature with the CQT feature provides a solution for insufficient temporal information. Secondly, integrating the Mel-spectrogram with the CQT feature can solve the CQT feature’s insufficient high-frequency resolution issue mentioned above.

3. Method

We propose an ALSI model to enhance the recognition accuracy of ship-radiated noise, as in Figure 1. First, ALSI takes the raw wave as the input and converts it into time series features and time–frequency features. These features are then fused at different granularities through two integration layers consisting of RHAF blocks. Finally, the embedding after fusion is sent into the classifier to complete the UATR tasks.

3.1. Time–Frequency Feature Extraction

ALSI takes the original waveform and two time–frequency features as input. The time–frequency feature includes the CQT feature and the Mel-spectrogram.

We randomly selected a 5 s sound clip from each of the 12 sound classes in the public dataset ShipsEar [44] for analysis. To make it clear that the difference between the two features is clear, the data are normalized. As in Figure 2a, The CQT feature has a high resolution of low-frequency information while retaining the textural properties. It can also reduce impulse noise, resulting in a cleaner spectrogram. As in Figure 2b, the textural detail of the Mel-spectrogram is precise, but the high-frequency pulse component still exists. Thus, CQT features play a more important role in feature fusion, and the Mel-spectrum serves as a supplement feature. The CQT feature

X_{C Q T}

is calculated as

X_{C Q T} (k, n) = \sum_{j = n - ⌊\frac{N_{k}}{2}⌋}^{n + ⌊\frac{N_{k}}{2}⌋} x (j) a_{k}^{*} (j - n + \frac{N_{k}}{2}), k = 1, 2, 3, \dots, K,

(1)

where, k represents the number of CQT filters, K is the index of CQT filter banks,

N_{k}

denotes window length,

a_{k}^{*} (n)

indicates the complex conjugate of

a_{k} (n)

, which can be calculated as

a_{k} (n) = \frac{1}{N_{k}} ω (\frac{n}{N_{k}}) e x p [- i 2 π n \frac{f_{k}}{f_{s}}],

(2)

where,

f_{k}

is the center frequency of

k^{t h}

CQT filter, and

f_{s}

is the sample rate. The window function is represented by

ω (\cdot)

.

Another time–frequency feature is the Mel-spectrogram, which can be calculated as follows:

M = \sum_{k = 0}^{N - 1} {| X (k) |}^{2} F_{i} (k), 0 \leq i \leq B,

(3)

where,

X (k)

represents the Fast Fourier Transform (FFT) of input radiated noise, and B is the number of Mel filters,

F_{i} (\cdot)

represents the Mel filter banks.

After feature extraction, the CQT feature and Mel feature will be organized into tensors and fed into the ALSI model, respectively. The tensor dimension of the CQT feature is

E_{c} \in R^{b \times t \times f_{c}}

, and the tensor dimension of the Mel-spectrogram is

E_{m} \in R^{b \times t \times f_{m}}

, in which b represents the batch size, and t represents the frame length,

f_{c}, f_{m}

denote frequency bins of the CQT feature and Mel-spectrogram, respectively.

3.2. Scale ResNet Block

To obtain frequency-domain information, we proposed a Scale ResNet block to compress time-domain information in the CQT feature, as shown in Figure 3.

The Scale ResNet block takes the CQT feature as input and is then processed by a convolutional Neural Network (CNN) layer first. The feature maps obtained from the previous CNN layer are sent to the Residual Block for the further compression of time-domain information. We design three such Residual Blocks to compress the time-domain information in the CQT feature progressively. After then, the CQT feature is embedded from

E_{c} \in R^{b \times t \times f}

into frequency embedding

E_{f} \in R^{b \times f}

through a CNN layer and an adaptive average pooling layer. This frequency embedding is able to describe the frequency component of the ship-radiated noise. Finally, the frequency embedding compressed by the Scale ResNet block will be involved in the next step of feature fusion.

3.3. Residual Hybrid Attention Fusion Block

The RHAF module receives two different embeddings encoded by preceding network layers as input. This module combines Multi-head Self Attention (MHSA) and Multi-head Cross Attention (MHCA) to focus on the correlation between sequences better, as shown in Figure 4.

We denote the first input embedding of the RHAF module as major embedding

E_{m j} \in R^{b \times l}

and denote the second input embedding as minor embedding

E_{m i} \in R^{b \times f}

. Where l denotes embedding length. To begin with,

E_{m j}

and

E_{m i}

will be transformed into

E_{m j}^{'} \in R^{b \times l \times d}

and

E_{m i}^{'} \in R^{b \times l \times \times d}

by a group of linear layers. And then

E_{m j}^{'}, E_{m i}^{'}

is encodded into embedding

Q : E_{m j_{q}}^{'} \in R^{b \times l \times d_{q}}, K : E_{m j_{k}}^{'} \in R^{b \times l \times d_{k}}, V : E_{m j_{v}}^{'}, E_{m i_{v}}^{'} \in R^{b \times l \times d_{v}}

by another group of linear layers, where

d_{q} = d_{k} = d_{v}

is the dimension of embedding Q, K, and V. For the MHSA part, Q, K, and V are encoded from the same embedding, so that MHSA can obtain the correlations between different values within major embedding. The attention map of MHSA can be calculated as follows:

E n e r g y = \frac{E_{m j_{q}}^{'} E_{m j_{k}}^{' T}}{\sqrt{d_{q}}},

(4)

A_{s} = S o f t m a x (E n e r g y) \cdot E_{m j_{v}}^{'} .

(5)

For the MHCA part, the attention map is generated by

E_{m j_{q}}

and

E_{m j_{k}}

, and then

E_{m i_{v}}

is multiplied. Therefore, MHCA enables the extraction of correlation information between different embeddings. The attention map of MHCA can be calculated as follows:

A_{c} = S o f t m a x (E n e r g y) \cdot E_{m i_{v}}^{'} .

(6)

To merge both MHSA and MHCA, we introduce a hyper-parameter

μ

to perform a weighted sum of the two attention results:

A = A_{s} + μ A_{c} .

(7)

To let modules learn features more effectively, we introduce a residual connection during the attention process. This residual connection can be calculated as

A t t e n t i o n_{o u t p u t} = E_{m j} + A .

(8)

The last step is to concatenate the outputs of two embeddings and use a Multi-layer Perceptron (MLP) to reshape the output to be the same size as the input.

For an RHAF block, Q and K come from the same embedding, while the V comes from two input embeddings. As a result, the RHAF block can obtain correlations not only within the same embedding but also between different sequences. Drawing inspiration from how the human brain integrates information, it often first identifies important features within a single piece of information, then compares the relevance of these important features in multiple pieces of information, and finally integrates these crucial features. Therefore, we have designed a form of mixed attention using MHSA and MHCA to imitate this process, with a hyperparameter

μ

controlling the weight of integrating important features.

3.4. Multi-Stage Supplement Integration

To improve underwater radiated noise recognition accuracy, we propose an ALSI model based on the multi-feature fusion method, as shown in Figure 1. The ALSI takes the raw wave as input and processes it into one temporal feature and two time–frequency features: the CQT feature and the Mel-spectrogram. The temporal feature is encoded by wav2vec 2.0. And then, the time–frequency features are processed by Scale ResNet and ResNet18 for further fusion. The ALSI model has three branches of fusion, the details of which are as follows:

Macro fusion. In this fusion branch, we aim to integrate frequency and temporal embedding. Since wav2vec 2.0 already provides high-resolution temporal features, the time-domain information in the CQT features is unimportant in this fusion branch. Therefore, we compress the time-domain information in the CQT feature using Scale ResNet. Scale ResNet encodes the CQT feature into frequency embedding, which will then be integrated with temporal embedding. There are two RHAF blocks in the Macro fusion branch. These two blocks take frequency embedding and temporal embedding as input. Then, we combine these two RHAF blocks’ outputs into one embedding and use an MLP to encode it into another embedding for the next stage of fusion.
Fine-grained fusion. In this branch of integration, detailed textural features in the CQT spectrogram participate in the fusion process. The Fine-grained fusion branch uses an intact CQT spectrogram to extract textural feature embedding using ResNet18. And then, the embedding from the previous fusion branch and the textural feature embedding together form the input for the second fusion branch. Similar to the Macro fusion, the Fine-grained fusion also consists of two RHAF blocks. These two blocks take the textural and previous fused embedding as input. Then, we combine these two RHAF blocks’ outputs into one embedding and also use an MLP to encode it into another embedding for the final fusion branch.
Comprehensive integration. In this branch, ResNet18 is first used to extract textural features from the Mel-spectrogram and encode them as embeddings. The Mel-spectrogram can provide more detailed textural features than the CQT feature. The textural embeddings are then concatenated with the output of the embedding from the previous branch and integrated using an MLP for information fusion. It is important to note that RHAF is not used in this branch because there is a semantic gap between the Mel-spectrogram and the CQT spectrogram. Since a large amount of CQT information has already been integrated into the existing embeddings, attending to the Mel-spectrogram at this point would lead to semantic confusion and potentially degrade the model’s performance.

Finally, after three branches of integration, we obtain an embedding that collects high-resolution temporal information and frequency information. The last step of ALSI is to use a classifier to complete UATR tasks.

3.5. Model Learning

We denote the output of ALSI as

z_{i}, i = 1, \dots, N

, where N is the number of categories. Then, the output will be transformed into probabilities for each class using softmax:

P r e d = s o f t m a x (z_{i}) = \frac{e^{z_{i}}}{\sum_{c = 1}^{N} e^{z_{c}}} .

(9)

Due to it being a classification task, the cross-entropy is selected as the loss function:

H (P, Q) = - \sum_{i} P (i) \cdot l o g Q (i),

(10)

where P represents the one-hot label, and Q denotes the prediction result. The loss will be used to update the gradients of each layer in the network through backpropagation and then optimize the weights of every node. Thus completes one epoch of the learning process.

4. Experiments

In this section, we first introduce the experiment setup and dataset. Then, the experimental results between our model and multiple baseline models are analyzed. Furthermore, we discuss the effectiveness and superiority of each module of our model by ablation experiments.

4.1. Experiments Setup

We have designed the experiments to demonstrate the model performance from the following aspects: (1) Comparing with the existing work, we discuss the model’s performance; (2) We verify the impact of model performance when fusing different time–frequency features and temporal feature; (3) We use ablation experiments to validate the effectiveness of the RHAF module; (4) A hyper-parameter

μ

in RHAF controls the weight of integrating. Experiments will investigate how the model performance is affected by

μ

. (5) The experiment will analyze the performance of models with different structures.

The dataset used in the above experiments is ShipsEar. Specifically, we do not use the conventional five divisions for dataset partitioning. Instead, we treat each type of ship-radiated noise as a separate category. To ensure data balance, we select the data for training and testing the model from the following categories: Fishboat (Class 0), Motorboat (Class 1), Mussel boat (Class 2), Natural ambient noise (Class 3), Ocean liner (Class 4), Passengers (Class 5), and RORO (Class 6). The sample size for each category is shown in the Table 1. The experiments are conducted on a server with two Nvidia RTX3090 GPUs, each with 22 GB of VRAM (Nvidia, Santa Clara, CA, USA). The operating system used is Ubuntu 18.04, with Python version 3.9 and PyTorch version 2.0.0.

4.2. Results and Analysis

4.2.1. Model Performance

To evaluate the model performance, Table 2 compares the performance of the ALSI model proposed in this paper with existing work. Based on the results, our proposed model performs better than most of the existing work. Significantly, because there are many false positive examples in the recognition results of the LSTM-based [9] model, the recall rate is inflated and the accuracy is reduced. Specifically, the confusion matrix in Figure 5 shows that some data in the Motorboat and Passengers classes are confused with other categories. While very few misclassifications occur in the other categories. Analyzing the convergence of the model from Figure 5b, due to sufficient features, the model converges quickly. Figure 6 shows the histogram of classification precision, recall, and F1-score for each type of ship-radiated noise. The graph shows that the accuracy for each category is quite similar, and the model can effectively classify most of the data with only a small amount of data.

4.2.2. Performance with Different Input Features

We first conducted ablation experiments to validate the effectiveness of the features we selected. We test four conditions for the UATR task: (1) Only the temporal information provided by wav2vec 2.0; (2) Fusion of the temporal information from wav2vec 2.0 and CQT features; (3) Fusion of the temporal information from wav2vec 2.0 and Mel-spectrogram feature; (4) Fusion of the temporal information from wav2vec 2.0, CQT features, and Mel-spectrogram features.

In Table 3, the test results show that the best performance is achieved when all three features are input at the same time, because of the sufficient feature. The fusion of temporal information with CQT features performs worse than the former. The fusion of temporal information with Mel-spectrogram features and using only temporal information perform unsatisfactorily.

Secondly, due to the many advantages mentioned earlier, CQT contains less noise than the Mel-spectrogram, making it more suitable for providing frequency information in this task. As the results show, models incorporating CQT in the fusion process perform better than models incorporating Mel-spectrogram. In addition, as shown in Table 3, the CQT feature fused with the temporal feature performs better than the Mel-spectrogram fused with the temporal feature. This is because CQT contains less noise than Mel-spectrogram, making it more effective.

Additionally, corresponding t-SNE plots are also drawn based on the output results of the proposed model with different input features, as shown in Figure 7. By comparing them, it can be observed that adding more features enriches the feature representation. Comparing Figure 7a to other figures, we can see that increasing the feature input can greatly enhance the distinctiveness of the embedding. By comparing Figure 7b with Figure 7c, it can be seen that the embedding with added CQT has better distinctiveness than the embedding obtained by adding the Mel-spectrogram. Finally, comparing Figure 7b–d shows that the embedding with multiple feature fusion has the highest distinctiveness.

Moreover, over-fitting also affects the model performance. Therefore, we verify that integrating more features with the temporal features can restrain over-fitting through ablation experiments. The last column of Table 3 shows that over-fitting is more severe when only using wav2vec2.0, with a difference in accuracy between training and validation of up to 0.14. However, this was effectively mitigated after adding time–frequency features. Combining three types of features has the best effect on restraining over-fitting. Since CQT can express frequency information more clearly, the difference in training and validation accuracy compared to combining the Mel-spectrogram is relatively tiny. Thus, it can be concluded that increasing the feature inputs can reduce over-fitting while improving the model’s recognition performance.

4.2.3. Model Performance with Different Fusion Methods

To investigate the impact of feature fusion techniques on the model, three ablation experiments were performed in three different scenarios: (1) retaining the RHAF of the first stage and replacing the RHAF of the second stage with feature concatenation; (2) retaining the RHAF of the second stage and replacing the RHAF of the first stage with feature concatenation; (3) retaining RHAF in both stages.

Comparing the first two rows of Table 4, it can be seen that fusion in the first stage is the primary source of performance, and using fusion in the second stage alone cannot provide sufficient performance support. It can only be a finishing touch on top of the fusion in the first stage. Finally, merging the three features using RHAF can improve the model performance.

4.2.4. Performance with Different Hyper-Parameters

In the previous section, we introduced a hyper-parameter

μ

in RHAF, which played a significant role in the feature fusion process. To investigate the selection of this hyper-parameter

μ

, we designed ablation experiments specifically for it.

Due to the two-stage RHAF in the model, we take 1, 2, and 3 for

μ

in the two stages to verify their performance changes, as shown in Table 5. The experimental results demonstrate that performance is better when

μ

is simultaneously selected as 2 in both stages. This is because when

μ

is selected as 1 in the first stage, insufficient weight is provided to the fused parameters during the fusion process, which are added to the fused features. When

μ

is selected as 3, it causes the fused features to dominate and weaken the role of the fused features, leading to a decrease in performance. Therefore, selecting 2 in the first stage is currently the best choice. Similarly, the selection of

μ

in the second stage can also be explained.

4.2.5. Performance with Different Model Structures

The structure designed in the text is a form of serial connection that integrates features. A parallel fusion model was designed to investigate the impact of model structure on performance. Since this model added too much ambiguous attribute information to the temporal features at once, causing the features to become confused in representing the data, the model did not converge, as seen in Table 6. This indicates that adding information to a feature should be gradual and not rushed. Since the designed model did not converge, the model structure is not elaborated on here.

5. Discussion

Through the ablation experiments above, what and how these features are fused and which hyper-parameters are chosen in the fusion model are all factors that affect the model’s performance. First, choosing low-noise and high-resolution features is essential, and CQT is one such feature. Thus, integrating frequency information from the CQT feature with temporal embedding extract by wav2vec2.0 can further improve model performance. Meanwhile, textural features in the time–frequency spectrogram are also important. Therefore, integrating embeddings extracted from the CQT feature and Mel-spectrogram by ResNet18 can supplement detailed feature information. Secondly, integrating embedding with different semantics by adding them directly will hurt the model performance. Thus, we designed an RHAF block. The ablation experiment shows that the RHAF block is efficient, as it better focuses on correlations between time–frequency and temporal features. Moreover, different features have different importance during the fusion period, so choosing an appropriate set of hyper-parameters can balance the relationship between attribute features. Overall, the experiments verified that our proposed ALSI model performs well on the UATR task.

6. Conclusions

The article proposes an ALSI model based on pre-training models. This model integrates temporal information provided by wav2vec2.0 and time–frequency information provided by the CQT feature and Mel-spectrogram. To begin with, we selected a Scale ResNet temporal information compression module. Simultaneously, an RHAF feature fusion module is proposed to address the fusion of multiple features and improve the adaptability of wav2vec 2.0 on underwater acoustic data. Finally, the experimental results show that the proposed model performs well on the ShipsEar dataset, achieving a recognition accuracy of

96.36 %

, surpassing existing work. The efficiency of the proposed modules is validated through ablation experiments. The optimal selection of hyper-parameters and the optimal design of the model structure are explored. The next step will involve improving the fusion module of the model to enhance the model’s ability to represent features.

Author Contributions

Conceptualization, X.C., Q.Z. and Z.P.; methodology, Z.P. and Y.X.; software, Z.P. and Y.X.; validation, Z.P.; formal analysis, Z.P.; investigation, Z.P.; resources, Q.Z.; data curation, Z.P. and Y.X.; writing—original draft preparation, Z.P.; writing—review and editing, X.C. and P.Z.; visualization, Z.P.; supervision, Q.Z. and X.C.; project administration, Q.Z. and X.C.; funding acquisition, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UATR	underwater acoustic target recognition
STFT	Short-term Fourier Transform
FFT	Fast Fourier Transform
MFCC	Mel Frequency Cepstral Coefficients
GFCC	Gammatone Frequency Cepstral Coefficients
GPT	Generative Pre-Train
CQT	Constant-Q Transform
ALSI	Attention Layer Supplement Integration
RHAF	Residual Hybrid Attention Fusion
MHA	Multi-head Attention
MHSA	Multi-head Self Attention
MHCA	Multi-head Cross Attention
MLP	Multi-Layer Perceptron
CNN	Convolutional Neural Networks

References

Lei, H.; Li, D.; Jiang, H. Multi-feature fusion sonar image target detection evaluation based on particle swarm optimization algorithm. J. Intell. Fuzzy Syst. 2023, 46, 739–751. [Google Scholar] [CrossRef]
Yin, Z.; Zhang, S.; Sun, R.; Ding, Y.; Guo, Y. Sonar Image Target Detection Based on Deep Learning. In Proceedings of the 2023 International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), Ballar, India, 29–30 April 2023; pp. 1–9. [Google Scholar]
Liu, S.; Fu, X.; Xu, H.; Zhang, J.; Zhang, A.; Zhou, Q.; Zhang, H. A Fine-Grained Ship-Radiated Noise Recognition System Using Deep Hybrid Neural Networks with Multi-Scale Features. Remote Sens. 2023, 15, 2068. [Google Scholar] [CrossRef]
Zhu, P.; Zhang, Y.; Huang, Y.; Zhao, C.; Zhao, K.; Zhou, F. Underwater acoustic target recognition based on spectrum component analysis of ship radiated noise. Appl. Acoust. 2023, 211, 109552. [Google Scholar] [CrossRef]
Zhang, W.B.; Lin, B.; Yan, Y.; Zhou, A.; Ye, Y.; Zhu, X. Multi-Features Fusion for Underwater Acoustic Target Recognition based on Convolution Recurrent Neural Networks. In Proceedings of the 2022 8th International Conference on Big Data and Information Analytics (BigDIA), Guiyang, China, 24–25 August 2022; pp. 342–346. [Google Scholar]
Yang, H.; Huang, X.; Liu, Y. InfoGAN-Enhanced Underwater Acoustic Target Recognition Method Based on Deep Learning. In Proceedings of the 2022 International Conference on Autonomous Unmanned Systems (ICAUS 2022), Xi’an, China, 23–25 September 2022; Lecture Notes in Electrical Engineering. Springer: Singapore, 2023; pp. 2705–2714. [Google Scholar]
Liu, D.; Yang, H.; Hou, W.; Wang, B. A Novel Underwater Acoustic Target Recognition Method Based on MFCC and RACNN. Sensors 2024, 24, 273. [Google Scholar] [CrossRef] [PubMed]
Yang, S.; Jin, A.; Zeng, X.; Wang, H.; Hong, X.; Lei, M. Underwater acoustic target recognition based on knowledge distillation under working conditions mismatching. Multimed. Syst. 2024, 30, 12. [Google Scholar] [CrossRef]
Yang, H.; Xu, G.; Yi, S.; Li, Y. A New Cooperative Deep Learning Method for Underwater Acoustic Target Recognition. In Proceedings of the OCEANS 2019, Marseille, France, 17–20 June 2019; pp. 1–4. [Google Scholar]
Feng, L.; Shen, T.; Luo, Z.; Dexin, Z.; Guo, S. Underwater target recognition using convolutional recurrent neural networks with 3-D Mel-spectrogram and data augmentation. Appl. Acoust. 2021, 178, 107989. [Google Scholar]
Cui, X.; He, Z.; Xue, Y.; Tang, K.; Zhu, P.; Han, J. Cross-Domain Contrastive Learning-Based Few-Shot Underwater Acoustic Target Recognition. J. Mar. Sci. Eng. 2024, 12, 264. [Google Scholar] [CrossRef]
Wei, Z.; Ju, Y.; Song, M. A Method of Underwater Acoustic Signal Classification Based on Deep Neural Network. In Proceedings of the 2018 5th International Conference on Information Science and Control Engineering (ICISCE), Zhengzhou, China, 20–22 July 2018; pp. 46–50. [Google Scholar]
Xing, G.; Liu, P.; Zhang, H.; Tang, R.; Yin, Y. A Two-Stream Network for Underwater Acoustic Target Classification. In Proceedings of the 6th International Conference on Robotics and Artificial Intelligence, Singapore, 20–22 November 2020; pp. 248–252. [Google Scholar]
Ma, Y.; Liu, M.; Zhang, Y.; Zhang, B.; Xu, K.; Zou, B.; Huang, Z. Imbalanced Underwater Acoustic Target Recognition with Trigonometric Loss and Attention Mechanism Convolutional Network. Remote Sens. 2022, 14, 4103. [Google Scholar] [CrossRef]
Yi, Z.; Li, P.; Xiong, S.; Qiong, Y.; Ma, Y.; Liu, M. Multiresolution Convolutional Neural Network for Underwater Acoustic Target Recognition. In Proceedings of the 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 20–24 October 2021; pp. 846–850. [Google Scholar]
Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar]
Guo, T.; Song, Y.; Kong, Z.; Lim, E.; López-Benítez, M.; Ma, F.; Yu, L. Underwater Target Detection and Localization with Feature Map and CNN-Based Classification. In Proceedings of the 2022 4th International Conference on Advances in Computer Technology, Information Science and Communications (CTISC), Suzhou, China, 22–24 April 2022; pp. 1–8. [Google Scholar]
Luo, X.; Feng, Y. An Underwater Acoustic Target Recognition Method Based on Restricted Boltzmann Machine. Sensors 2020, 20, 5399. [Google Scholar] [CrossRef]
Luo, X.; Feng, Y.; Zhang, M. An Underwater Acoustic Target Recognition Method Based on Combined Feature With Automatic Coding and Reconstruction. IEEE Access 2021, 9, 63841–63854. [Google Scholar] [CrossRef]
Kim, K.; Pak, M.; Pil, C.B.; Ri, C. A method for underwater acoustic signal classification using convolutional neural network combined with discrete wavelet transform. Int. J. Wavelets Multiresolution Inf. Process. 2021, 19, 2050092:1–2050092:26. [Google Scholar] [CrossRef]
Khishe, M. DRW-AE: A Deep Recurrent-Wavelet Autoencoder for Underwater Target Recognition. IEEE J. Ocean. Eng. 2022, 47, 1083–1098. [Google Scholar] [CrossRef]
Zhang, Q.; Da, L.; Zhang, Y.; Hu, Y. Integrated neural networks based on feature fusion for underwater target recognition. Appl. Acoust. 2021, 182, 108261. [Google Scholar] [CrossRef]
Ke, X.; Yuan, F.; Cheng, E. Integrated optimization of underwater acoustic ship-radiated noise recognition based on two-dimensional feature fusion. Appl. Acoust. 2020, 159, 107057. [Google Scholar] [CrossRef]
Wang, X.; Liu, A.; Zhang, Y.; Xue, F. Underwater Acoustic Target Recognition: A Combination of Multi-Dimensional Fusion Features and Modified Deep Neural Network. Remote Sens. 2019, 11, 1888. [Google Scholar] [CrossRef]
Alouani, Z.; Hmamouche, Y.; Khamlichi, B.E.; Seghrouchni, A.E.F. A Spatio-temporal Deep Learning Approach for Underwater Acoustic Signals Classification. In Proceedings of the 2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Madrid, Spain, 29 November–2 December 2022; pp. 1–7. [Google Scholar]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Wen, Z.; Lin, W.L.; Wang, T.; Xu, G. Distract Your Attention: Multi-Head Cross Attention Network for Facial Expression Recognition. Biomimetics 2023, 8, 199. [Google Scholar] [CrossRef] [PubMed]
Hua, J.; Cui, X.; Li, X.; Tang, K.; Zhu, P. Multimodal fake news detection through data augmentation-based contrastive learning. Appl. Soft Comput. 2023, 136, 110125. [Google Scholar] [CrossRef]
Zhu, P.; Hua, J.; Tang, K.; Tian, J.; Xu, J.; Cui, X. Multimodal fake news detection through intra-modality feature aggregation and inter-modality semantic fusion. Complex Intell. Syst. 2024. [Google Scholar] [CrossRef]
Wu, Y.; Zhan, P.; Zhang, Y.; Wang, L.; Xu, Z. Multimodal Fusion with Co-Attention Networks for Fake News Detection. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 2560–2569. [Google Scholar] [CrossRef]
Qian, S.; Wang, J.; Hu, J.; Fang, Q.; Xu, C. Hierarchical Multi-modal Contextual Attention Network for Fake News Detection. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 153–162. [Google Scholar] [CrossRef]
Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Wang, B.; Zhang, W.; Zhu, Y.; Wu, C.; Zhang, S. An Underwater Acoustic Target Recognition Method Based on AMNet. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5501105. [Google Scholar] [CrossRef]
Yang, S.; Jin, A.; Zeng, X.; Wang, H.; Hong, X.; Lei, M. Underwater acoustic target recognition based on sub-band concatenated Mel spectrogram and multidomain attention mechanism. Eng. Appl. Artif. Intell. 2024, 133, 107983. [Google Scholar] [CrossRef]
Lian, Z.; Wu, T. Feature Extraction of Underwater Acoustic Target Signals Using Gammatone Filterbank and Subband Instantaneous Frequency. In Proceedings of the 2022 IEEE 6th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Beijing China, 3–5 October 2022; pp. 944–949. [Google Scholar] [CrossRef]
Feng, H.; Chen, X.; Wang, R.; Wang, H.; Yao, H.; Wu, F. Underwater acoustic target recognition method based on WA-DS decision fusion. Appl. Acoust. 2024, 217, 109851. [Google Scholar] [CrossRef]
Yao, Y.; Zeng, X.; Wang, H.; Liu, J. Research on Underwater Acoustic Target Recognition Method Based on DenseNet. In Proceedings of the 2022 3rd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), Xi’an, China, 15–17 July 2022; pp. 114–118. [Google Scholar] [CrossRef]
Dong, Y.; Shen, X.; Yan, Y.; Wang, H. Small-scale Data Underwater Acoustic Target Recognition with Deep Forest Model. In Proceedings of the 2022 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Xi’an, China, 25–27 October 2022; pp. 1–5. [Google Scholar] [CrossRef]
Tan, J.; Pan, X. Underwater acoustic target recognition based on convolutional neural network and multi-feature fusion. In Proceedings of the Third International Conference on Computer Vision and Pattern Analysis (ICCPA 2023), Hangzhou, China, 31 March–2 April 2023; Volume 12754. [Google Scholar]
Qi, P.; Sun, J.; Long, Y.; Zhang, L.; Tianye. Underwater Acoustic Target Recognition with Fusion Feature. In Proceedings of the Neural Information Processing: 28th International Conference, ICONIP 2021, Sanur, Indonesia, 8–12 December 2021; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer International Publishing: Cham, Switzerland, 2021; pp. 609–620. [Google Scholar]
Li, J.; Yang, H. The underwater acoustic target timbre perception and recognition based on the auditory inspired deep convolutional neural network. Appl. Acoust. 2021, 182, 108210. [Google Scholar] [CrossRef]
Schörkhuber, C.; Klapuri, A. Constant-Q transform toolbox for music processing. In Proceedings of the 7th Sound and Music Computing Conference, Barcelona, Spain, 21–24 April 2010. [Google Scholar]
Chen, L.; Liu, F.; Li, D.; Shen, T.; Zhao, D. Underwater Acoustic Target Classification with Joint Learning Framework and Data Augmentation. In Proceedings of the 2022 5th International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 27–30 May 2022; pp. 23–28. [Google Scholar] [CrossRef]
Santos-Domínguez, D.; Torres-Guijarro, S.; Cardenal-López, A.; Pena-Gimenez, A. ShipsEar: An underwater vessel noise database. Appl. Acoust. 2016, 113, 64–69. [Google Scholar] [CrossRef]
Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
Yang, H.; Li, J.; Shen, S.; Xu, G. A Deep Convolutional Neural Network Inspired by Auditory Perception for Underwater Acoustic Target Recognition. Sensors 2019, 19, 1104. [Google Scholar] [CrossRef]
Qi, P.; Yin, G.; Zhang, L. Underwater acoustic target recognition using RCRNN and wavelet-auditory feature. Multimed. Tools Appl. 2024, 83, 47295–47317. [Google Scholar] [CrossRef]

Figure 1. Pipline of ALSI model. This model takes three kinds of features as input and consists of three levels of feature fusion modules: Macro fusion, Fine-grained fusion, and comprehensive integration. The first two level fusions take temporal embeddings extracted by wav2vec 2.0 and CQT spectrogram as input. The last level merges the information of the first two levels and the embedding of the Mel-spectrogram. The arrows of different colors represent the information transfer process for each dominant feature.

Figure 2. The time−frequency spectrogram of the 12 categories of data samples in the ShipsEar dataset (a) CQT feature. (b) Mel-spectrogram.

Figure 3. The Scale ResNet block. This block is to compress the time-domain information, and only retains the frequency-domain information as frequency embedding.

Figure 4. Residual Hybrid Attention Fusion. This block contains both MHSA and MHCA. Through this block, different attribute information can be aligned efficiently.

Figure 5. Performance of ALSI. (a) Confusion matrix. (b) Accuracy curves.

Figure 6. The performance of the model on each category.

Figure 7. t-SNE plots of different input features (a) single wav2vec 2.0. (b) wav2vec 2.0 + CQT. (c) wav2vec 2.0 + Mel-spectrogram. (d) wav2vec 2.0 + CQT + Mel-spectrogram.

Table 1. Sample size and label information of each category.

Category	Label	Size
Fishboat	Class 0	1683
Motorboat	Class 1	1751
Mussel boat	Class 2	1667
Natural ambient noise	Class 3	1420
Ocean liner	Class 4	1875
Passengers	Class 5	2078
RORO	Class 6	1513 ¹

¹ Each sample is a two-second wav file with a sample rate of 22,050 Hz.

Table 2. The performance between the proposed model in this paper and existing work.

Model	Feature (s)	Accuracy	Precision	Recall	F1-Score
Yamnet [16]	MFCC	$78.72 %$	$68.94 %$	$83.58 %$	0.7264
VGGish [45]	Log-Mel	$86.57 %$	$83.30 %$	$85.82 %$	0.8427
ADCNN [46]	Raw Wave	$93.58 %$	$92.36 %$	$97.49 %$	0.9469
CRNN9 [10]	3D Log-Mel	$91.17 %$	$77.64 %$	$95.43 %$	0.8463
LSTM-based [9]	STFT	$94.77 %$	$91.32 %$	$98.14 %$	0.9449
DRW-AE [21] *	Wavelet	$94.49 %$	-	-	-
ResNet18 [47] *	3D fusion features	$94.30 %$	-	-	-
ALSI (ours)	Raw Wave, CQT, Log-Mel	$96.37 %$	$96.74 %$	$96.36 %$	0.9607

* No other results were mentioned in the literature.

Table 3. Ablation experiments of different input features.

Temporal	Frequency (CQT)	Frequency (Mel)	Accuracy	Acc Diff ¹
✓ ²	-	-	$82.60 %$	$14.38 %$
✓	✓	-	$94.34 %$	$5.840 %$
✓	-	✓	$90.54 %$	$8.730 %$
✓	✓	✓	$96.37 %$	$3.540 %$

¹ Accuracy difference between training and validation. ² ✓ indicates that the feature is used in the model.

Table 4. Ablation experiments of different fusion methods ¹.

Level 1	Level 2	Level 3	Accuracy
✓ ²	-	✓	$90.45 %$
-	✓	✓	$57.26 %$
✓	✓	✓	$96.37 %$

¹ When MLP is used in all fusion phases, the model does not converge and is not listed in the table. ² ✓ indicates that the module is included in the model.

Table 5. Performance with different hyper-parameters.

Level 1	Level 2	Accuracy
1	1	$94.38 %$
1	2	$94.13 %$
1	3	$93.88 %$
2	1	$93.03 %$
2	2	$96.37 %$
2	3	$93.96 %$
3	1	$91.42 %$
3	2	$95.06 %$
3	3	$94.47 %$

Table 6. Performance with different model structures.

Structure	Accuracy
Parallel	$14.14 %$ (Not converge)
Concatenation	$96.37 %$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pu, Z.; Zhang, Q.; Xue, Y.; Zhu, P.; Cui, X. A Novel Multi-Feature Fusion Model Based on Pre-Trained Wav2vec 2.0 for Underwater Acoustic Target Recognition. Remote Sens. 2024, 16, 2442. https://doi.org/10.3390/rs16132442

AMA Style

Pu Z, Zhang Q, Xue Y, Zhu P, Cui X. A Novel Multi-Feature Fusion Model Based on Pre-Trained Wav2vec 2.0 for Underwater Acoustic Target Recognition. Remote Sensing. 2024; 16(13):2442. https://doi.org/10.3390/rs16132442

Chicago/Turabian Style

Pu, Zijun, Qunfei Zhang, Yangtao Xue, Peican Zhu, and Xiaodong Cui. 2024. "A Novel Multi-Feature Fusion Model Based on Pre-Trained Wav2vec 2.0 for Underwater Acoustic Target Recognition" Remote Sensing 16, no. 13: 2442. https://doi.org/10.3390/rs16132442

APA Style

Pu, Z., Zhang, Q., Xue, Y., Zhu, P., & Cui, X. (2024). A Novel Multi-Feature Fusion Model Based on Pre-Trained Wav2vec 2.0 for Underwater Acoustic Target Recognition. Remote Sensing, 16(13), 2442. https://doi.org/10.3390/rs16132442

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Multi-Feature Fusion Model Based on Pre-Trained Wav2vec 2.0 for Underwater Acoustic Target Recognition

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Time–Frequency Feature Extraction

3.2. Scale ResNet Block

3.3. Residual Hybrid Attention Fusion Block

3.4. Multi-Stage Supplement Integration

3.5. Model Learning

4. Experiments

4.1. Experiments Setup

4.2. Results and Analysis

4.2.1. Model Performance

4.2.2. Performance with Different Input Features

4.2.3. Model Performance with Different Fusion Methods

4.2.4. Performance with Different Hyper-Parameters

4.2.5. Performance with Different Model Structures

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI