An End-to-End Underwater Acoustic Target Recognition Model Based on One-Dimensional Convolution and Transformer

Yang, Kang; Wang, Biao; Fang, Zide; Cai, Banggui

doi:10.3390/jmse12101793

Open AccessArticle

An End-to-End Underwater Acoustic Target Recognition Model Based on One-Dimensional Convolution and Transformer

by

Kang Yang

,

Biao Wang

^*

,

Zide Fang

and

Banggui Cai

Ocean College, Jiangsu University of Science and Technology, Zhenjiang 212100, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(10), 1793; https://doi.org/10.3390/jmse12101793

Submission received: 23 August 2024 / Revised: 4 October 2024 / Accepted: 7 October 2024 / Published: 9 October 2024

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Underwater acoustic target recognition (UATR) is crucial for defense and ocean environment monitoring. Although traditional methods and deep learning approaches based on time–frequency domain features have achieved high recognition rates in certain tasks, they rely on manually designed feature extraction processes, leading to information loss and limited adaptability to environmental changes. To overcome these limitations, we proposed a novel end-to-end underwater acoustic target recognition model, 1DCTN. This model directly used raw time-domain signals as input, leveraging one-dimensional convolutional neural networks (1D CNNs) to extract local features and combining them with Transformers to capture global dependencies. Our model simplified the recognition process by eliminating the need for complex feature engineering and effectively addressed the limitations of LSTM in handling long-term dependencies. Experimental results on the publicly available ShipsEar dataset demonstrated that 1DCTN achieves a remarkable accuracy of 96.84%, setting a new benchmark for end-to-end models on this dataset. Additionally, 1DCTN stood out among lightweight models, achieving the highest recognition rate, making it a promising direction for future research in underwater acoustic recognition.

Keywords:

convolutional neural network; transformer; underwater acoustic target recognition (UATR)

1. Introduction

Acoustic target recognition is a core component of underwater detection technology, having significant applications in national defense, marine environment monitoring, underwater resource exploration, and navigation [1]. Traditional acoustic target recognition methods involve extracting distinguishable features, followed by target identification using classifiers or template matching [2,3,4,5,6]. However, due to the complexity of the underwater environment, such as the variability of acoustic channels, absorption and scattering effects, and diverse marine environmental noise [7], traditional methods have notable limitations in extracting distinguishable and robust features, making high-precision acoustic target recognition challenging.

In recent years, data-driven machine learning methods, especially deep learning techniques, have made significant advancements in acoustic target recognition, owing to their powerful representation and generalization capabilities. Studies [8,9,10,11,12] employed the spectral features of sonar signals, inputting them into convolutional neural networks (CNNs) for recognition, which significantly improved performance compared to traditional methods. To mitigate issues such as overfitting and vanishing gradients in neural networks, Zheng et al. [13] proposed a sparsely structured network (GoogleNet). This network extracted more abstract features from the time–frequency spectrum, thereby enhancing the distinction between background noise and target signals. To achieve multi-feature fusion, Hong et al. [14] developed a residual network with three-channel input based on ResNet18, which achieved a recognition accuracy of 94.3% on the ShipsEar dataset. To mitigate the impact of residuals on networks, Xue et al. [15] introduced a residual network (ResNet) incorporating a channel attention mechanism (CAM) post-residual block, thereby enhancing recognition performance. Wang et al. [16] proposed a multi-branch CNN utilizing attention mechanisms to accelerate network training and improve recognition rates. Inspired by the Long Short-Term Memory (LSTM) network’s ability to utilize temporal information to learn context, Han et al. [17] designed a hybrid network combining one-dimensional convolution and LSTM for underwater target recognition. To overcome the sequential dependency issue of LSTM, Li et al. [18] introduced the Transformer model into underwater target recognition, proposing a Spectrogram Transformer Model (STM). Feng et al. [19] designed a new layer-wise aggregation model based on the Transformer to enhance recognition accuracy. Although these deep learning models achieved significant success in acoustic target recognition, they all relied on manually designed time–frequency input features. The inevitable loss of fine details in the original waveform during time–frequency transformation became a bottleneck in further improving the accuracy of acoustic target recognition.

In contrast, end-to-end recognition using time-domain signals offered advantages such as preserving complete information, reducing human biases, and simplifying the processing pipeline. This approach was better suited to the complex and variable underwater environment. Doan et al. [20] proposed an underwater target recognition method based on a dense convolutional neural network (DCNN), which automatically extracted audio features without the need for specialized domain knowledge or expert intervention. Hu et al. [21] improved the CNN model by introducing depthwise separable convolutions and dilated convolutions, leading to a significant enhancement in time-domain signal recognition performance. Xiaoping et al. [22] compared the recognition capabilities of CNN and LSTM models for complex acoustic signals. The experimental results showed that using time-domain signals as inputs to classifiers achieved an accuracy rate 5% higher than that of LSTM. To leverage the strengths of one-dimensional convolution and LSTM, Kamal et al. [23] proposed a combined CNN and LSTM model. Experiments in the shallow waters of the Indian Ocean demonstrated that this end-to-end deep learning model achieved a recognition accuracy of 95.2%, fully validating the effectiveness of end-to-end recognition in the time domain. Despite these successes, current models have primarily relied on CNN and LSTM architectures for sequence relationship modeling. Recently, Transformer networks became a research focus due to their powerful parallel processing capabilities and ability to model long-range dependencies. Through the multi-head self-attention mechanism, Transformer networks effectively overcame the limitations of LSTM, offering new possibilities for the further development of end-to-end models in time-domain signal recognition.

Building upon this analysis, we proposed the 1DCTN model, an end-to-end underwater acoustic target recognition approach using raw time-domain waveforms. By combining the local feature extraction capabilities of 1D CNNs with the Transformer’s ability to model long-range dependencies, 1DCTN effectively captured both local and global structures in acoustic signals. This approach not only overcame the limitations of existing methods but also enhanced accuracy. The primary contributions of this paper are as follows:

The 1DCTN model directly processed time-domain signals, streamlining the recognition process by eliminating the need for complex feature engineering. This novel model overcame the inherent limitations of time–frequency domain representation methods, introducing a new way to preserve the full information contained in the raw waveforms.
The 1DCTN model introduced a new method that effectively combined the local feature extraction capabilities of 1D CNNs with the long-range dependency modeling of the Transformer, addressing the limitations of LSTM in managing long-term dependencies and enhancing recognition accuracy.
The 1DCTN model was lightweight, achieving optimal recognition accuracy with low computational complexity, making it an effective solution for resource-constrained scenarios in real-world applications.
Comprehensive validation on the public dataset ShipsEar fully demonstrated the advantages of the 1DCTN model.

2. Materials and Methods

2.1. One-Dimensional Convolution

While classic convolutional neural networks (CNNs) typically use 2D convolution for image processing, 1D convolution is more suitable for sequential data like acoustic signals. The 1D convolution operation efficiently extracts local features from time series data, which is crucial for identifying short-term and long-term dependencies within the sequence. Additionally, compared to the matrix convolution operations in 2D convolution, 1D convolution has fewer model parameters.

In the processing of one-dimensional signals, the 1D convolution operation retains the weight-sharing property of 2D convolution. During the convolution process, the kernel multiplies with the original signal point by point, and the results are summed to produce a new output value. This process can be viewed as a form of smoothing, denoising, or feature extraction for the signal. The calculation process for 1D convolution is

X_{j}^{(l)} = f (\sum_{i = 1}^{M} ω_{i j}^{(l) *} X_{i}^{(l - 1)} + b_{j}^{(l)})

(1)

where

X_{i}^{(l - 1)}

represents the input feature map at layer

l - 1

, * denotes the 1D convolution operation,

f (•)

is the activation function, and

ω_{i j}^{(l)}

and

b_{j}^{(l)}

represent the weights and bias in the 1D convolutional kernel, respectively.

As shown in Figure 1, the process of the 1D convolution operation involves an input vector of length 8 and a convolutional layer with 3 kernels. Each kernel has a size of 3 and a stride of 1. After the operation of each kernel, the corresponding bias

b

is added, resulting in 3 output vectors.

2.2. Multi-Head Self-Attention Mechanism

The multi-head self-attention mechanism is a powerful extension of the self-attention mechanism, first introduced by Vaswani et al. [24] in the 2017 paper “Attention Is All You Need”. This mechanism allows the model to simultaneously focus on different subspaces of the input sequence, thereby capturing richer and more diverse information. The self-attention mechanism excels at capturing long-range relationships and global dependencies in sequential data, making it the core algorithm of Transformer networks.

Specifically, multi-head self-attention (MHSA) involves the parallel computation of multiple scaled dot-product attentions. As shown in Figure 2, each attention head independently computes self-attention, and these results are then concatenated and linearly transformed to produce the final output. Given an input sequence

X

, multiple linear transformations are first applied to obtain the query (

Q

), key (

K

), and value (

V

) matrices. The calculation formulas are

Q_{i} = X^{T} W_{i}^{Q}

(2)

K_{i} = X^{T} W_{i}^{K}

(3)

V_{i} = X^{T} W_{i}^{V}

(4)

where

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

are the weight matrices for the i-th attention head. Next, the attention output for each head is calculated using the following formula

H_{i} = A t t e n t i o n (Q_{i}, K_{i}, V_{i}) = S o f t \max (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i}

(5)

where

H_{i}

represents the self-attention output of the i-th head, and

d_{k}

refers to the dimensionality of both the query and key vectors in the attention mechanism of the Transformer model. Finally, the outputs of all heads are concatenated and passed through a linear transformation to produce the final output, as expressed by the following formula

M H S A (Q, K, V) = W^{0} C o n c a t (H_{1}, H_{2}, \dots H_{m})

(6)

where

W^{0}

represents the weight matrix for the linear transformation.

2.3. 1DCTN Model Architecture

In this study, we proposed a novel network architecture called 1DCTN, which integrates the strengths of 1D-CNNs and the Transformer. The 1DCTN leverages the efficiency of 1D-CNNs in local feature extraction and parameter sharing, while incorporating the Transformer’s multi-head attention mechanism to capture long-term dependencies. This hybrid architecture effectively handles both local and global temporal features, thereby enhancing the model’s ability to understand and process complex signal patterns.

As shown in Figure 3, the 1DCTN model structure consists primarily of two modules: 1D convolution and multi-head self-attention computation. Since the raw time-domain data often have varying magnitudes, using them directly for training can affect the model’s convergence speed and recognition performance. Therefore, we normalized the data using the entire dataset

\hat{x} (i) = \frac{x (i) - μ}{\sqrt{σ^{2} + ϵ}}

(7)

where

μ

and

σ

represent the mean and standard deviation of the samples, and

ε

is a small constant set to 1 × 10⁻⁸ to ensure numerical stability and avoid division by zero. By using the overall mean and standard deviation of the entire dataset, we ensured that all samples were processed consistently, maintaining uniformity across the dataset. The normalized data are then fed into the 1D convolution layer for dimensionality reduction and local feature extraction. This process can be represented as follows

x_{e} = f_{1 D - c o n v} (\hat{x}, c) = \sum_{i} \hat{x} (i) c (i) + b

(8)

where

c

represents the convolution coefficients and

b

is the bias.

After obtaining the local features from the 1D convolution output, the data are passed to the Transformer encoder section to extract global features of the signal, which represent long-range dependencies and overall temporal patterns across the entire input sequence. The core of the 1DCTN model consists of three stacked Transformer encoder layers, all sharing the same structure and parameter settings. The output computation process for the i-th Transformer encoder layer can be represented as follows

{\hat{z}}_{l} = L N (M H S A (z_{l - 1}) + z_{l - 1})

(9)

z_{l} = L N (F F N ({\hat{z}}_{l}) + {\hat{z}}_{l})

(10)

where

L N

denotes the layer normalization operation,

z_{l}

is the output of the l-th Transformer encoder, and

F F N

stands for feedforward neural network. The formula for the layer normalization operation is

L N (z) = γ \frac{z - μ_{z}}{σ_{z}} + β

(11)

where

γ

and

β

are parameter vectors, where

μ_{z}

represents the mean, and

σ_{z}

represents the standard deviation.

After passing through the Transformer encoder, the output feature matrix is sent to a Global Average Pooling (GAP) layer, which averages the entire feature matrix. Finally, the classification is performed using a Multilayer Perceptron (MLP), with softmax as the activation function for the output. The formula is

y_{i} = \frac{e^{x_{i}}}{\sum_{j = 1}^{C} e^{x_{j}}}, \forall i \in 1 \dots C

(12)

where

x_{i}

represents the i-th output of the fully connected layer and C denotes the number of signal classes to be identified.

The cross-entropy loss function is used to measure the difference between the model’s predicted probability distribution and the true distribution. The cross-entropy loss function is defined as follows

ξ (y, \hat{y}) = \sum_{i = 1}^{C} - y_{i} \log ({\hat{y}}_{i})

(13)

where

y_{i}

and

{\hat{y}}_{i}

represent the true label and the predicted probability for the i-th class, respectively. The parameter structure of the 1DCTN model is shown in Table 1. In the table, B represents the batch size, L indicates the sequence length, and M denotes the number of output classes.

3. Dataset and Preprocessing

3.1. Dataset Description

The underwater acoustic target signal data used in our experiments are sourced from the ShipsEar [25] dataset. This dataset was recorded using the DigitalHYD SR-1 recorder and includes acoustic signals from various ships sailing near the port of Vigo, Spain, from the fall of 2012 to the summer of 2013. According to the original dataset annotations, the ships were categorized into five classes (four types of vessels and one background noise category), as detailed in Table 2.

3.2. Spectral Analysis of the Dataset

To gain a deeper understanding of the dataset’s characteristics and to establish a foundation for subsequent processing, we conducted a detailed spectral analysis of the signals. The spectrum was divided into three main intervals: 0–100 Hz, 100–8000 Hz, and 8000–25,000 Hz.

As illustrated in Figure 4, the primary energy and characteristics information of the ship signals (categories A, B, C, and D) are concentrated below 8000 Hz. Notably, in the mid-frequency range of 100–8000 Hz, there are significant individual differences between the signals, reflecting the unique characteristics of different types of ships. The E-type signal represents environmental noise, with consistent amplitude across various frequency bands, displaying the typical characteristics of broadband environmental noise.

3.3. Data Processing and Dataset Partitioning

Based on the spectral analysis and to reduce the computational complexity, we resampled the dataset to 16,000 Hz. To augment the limited original data, we segmented the ShipsEar recordings into 1 s intervals without overlap, resulting in 9166 samples, as detailed in Table 3. For model training, the samples were randomly divided into training, validation, and test sets with a ratio of 8:1:1.

4. Experiments and Results

4.1. Experimental Setup

We implemented and trained the 1DCTN model using the PyTorch 2.0.1 deep learning framework, with the Python interface. The model employed the Stochastic Gradient Descent (SGD) optimizer and was trained for 100 epochs with a batch size of 32 and an initial learning rate of 0.001. The experiments were conducted on a system with an Intel Xeon(R) Gold 6230R CPU @ 2.10 GHz, 128 GB of RAM, and an NVIDIA RTX 3060 GPU with CUDA capability 8.6, using the PyCharm 2023.1 integrated development environment.

To comprehensively evaluate the classification performance of the 1DCTN model, we used several key metrics, including accuracy, recall, precision, and F1-score. These metrics were calculated using the following formulas

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(14)

P r e c i s i o n = \frac{T P}{T P + F P}

(15)

R e c a l l = \frac{T P}{T P + F N}

(16)

F 1 - s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(17)

where

T P

,

T N

,

F P

, and

F N

stand for true positive, true negative, false positive, and false negative, respectively.

4.2. Comparative Evaluation

4.2.1. Performance Comparison with Time–Frequency Features

To evaluate the effectiveness of 1DCTN and explore the potential advantages of directly using time-domain signals compared to time–frequency features, we used mel spectrograms, mel-frequency cepstral coefficients (MFCCs), and raw time-domain signals as model inputs. For the different input features, we kept the overall model architecture unchanged, making adjustments only to the input layer to accommodate the different data formats.

Figure 5 illustrates the training loss and validation accuracy curves for the three input features throughout the training process. The model trained with raw time-domain signals demonstrated notably faster convergence within the first 20 epochs compared to those using mel spectrograms or MFCCs. This rapid initial convergence suggests that the model can more quickly extract meaningful features from raw time-domain data. In terms of validation accuracy, the time-domain model consistently outperformed those trained with time–frequency features across the entire training period. This superior performance indicates enhanced generalization capability, suggesting that raw time-domain signals provide richer and more discriminative information.

Table 4 presents a comprehensive comparison of accuracy, precision, recall, and F1-score for the three types of input features on the test set. The results clearly demonstrate that using raw time-domain signals as input yields superior performance across all evaluation metrics.

4.2.2. Performance Comparison with LSTM Network Architectures

The architecture of a neural network plays a critical role in underwater acoustic target recognition tasks. While LSTM networks have been widely used for sequential data processing, Transformer architectures have recently shown superior capabilities in capturing long-range dependencies. To evaluate the advantages of the Transformer over LSTM in our proposed model, we conducted a comparative experiment. The 1D convolutional component of the network was kept constant, while the sequence processing module was alternated between the Transformer and LSTM.

Figure 6 illustrates the training loss and validation accuracy curves for both the 1DCTN and 1D-CNN LSTM models. The results reveal significant differences in performance and learning dynamics. The 1DCTN model demonstrated superior convergence characteristics, including faster convergence speed and lower final loss values. This suggests that the Transformer architecture is more efficient at extracting relevant features from underwater acoustic signals, potentially due to its ability to capture complex patterns and long-range dependencies more effectively. In terms of validation accuracy, the 1DCTN model not only achieved higher accuracy early in the training process but also maintained its superiority throughout, indicating enhanced generalization capability.

Quantitatively, the 1DCTN model achieved an accuracy of 96.84% on the test set, outperforming the LSTM model by 2.93%. To provide a more comprehensive comparison, we visualized the recognition results on the test set using confusion matrices (Figure 7) and category-wise performance metrics (Figure 8). The confusion matrices show that the 1DCTN model exhibits clearer diagonal dominance, indicating more accurate classifications across all categories. The bar charts comparing precision, recall, and F1-scores for each category reveal that the 1DCTN network consistently outperforms the 1D-CNN LSTM network across all metrics and categories.

4.2.3. Comparison with Other Lightweight Models

Lightweight models play a crucial role in underwater target recognition due to their high real-time performance, low energy consumption, wide adaptability, and cost-effectiveness. To further evaluate the performance of the proposed 1DCTN model, we compared it with several lightweight models reported in the current literature. The comparisons focused on three key metrics: accuracy, number of model parameters, and floating-point operations (FLOPs).

The results in Table 5 demonstrated that the 1DCTN model achieved an excellent balance between accuracy, model complexity, and computational efficiency. It achieved the highest accuracy, 96.8%, among the compared models, outperforming the next best model, ResNet18, by 1.9%. While the 1DCTN model did not have the fewest parameters (0.45 M compared to Autoencoder-decoder’s 0.18 M), it still maintained a relatively lightweight structure. The slight increase in parameters compared to some other models is justified by the significant improvement in accuracy. Additionally, it maintained reasonable computational complexity (0.27 G FLOPs) while directly processing time-domain signals.

5. Conclusions

This study introduced a novel end-to-end underwater acoustic target recognition model, 1DCTN, which combined the local feature extraction capabilities of 1D CNNs with the long-range dependency modeling of Transformers. By directly processing raw time-domain signals, 1DCTN preserved complete acoustic information, overcoming the limitations of traditional time–frequency domain methods. On the ShipsEar dataset, 1DCTN achieved a recognition accuracy of 96.84%, setting a new benchmark for end-to-end models on publicly available underwater acoustic datasets. Moreover, while maintaining a lightweight structure, 1DCTN outperformed comparative models in accuracy. These advancements pave the way for new approaches in end-to-end underwater acoustic signal processing. Future research will explore the model’s performance in diverse acoustic environments, its potential for real-time applications, and its applicability to other underwater acoustic tasks.

Author Contributions

Conceptualization, K.Y. and B.W.; methodology, K.Y.; software, B.C.; validation, K.Y., Z.F. and B.C.; formal analysis, Z.F.; investigation, K.Y.; resources, B.W.; data curation, K.Y.; writing—original draft preparation, K.Y.; writing—review and editing, K.Y., B.W. and Z.F.; visualization, K.Y.; supervision, B.W.; project administration, Z.F.; funding acquisition, B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 52071164.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are openly available at http://atlanttic.uvigo.es/underwaternoise/ (accessed on 5 May 2024) in ref. [25].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cho, H.; Gu, J.; Yu, S.C. Robust Sonar-Based Underwater Object Recognition Against Angle-of-View Variation. IEEE Sens. J. 2016, 16, 1013–1025. [Google Scholar] [CrossRef]
Wei, X.; Li, G.H.; Wang, Z.Q. Underwater Target Recognition Based on Wavelet Packet and Principal Component Analysis. Comput. Simul. 2011, 28, 8–290. [Google Scholar]
Das, A.; Kumar, A.; Bahl, R. Marine Vessel Classification Based on Passive Sonar Data: The Cepstrum-Based Approach. IET Radar Sonar Navig. 2013, 7, 87–93. [Google Scholar] [CrossRef]
Meng, Q.; Yang, S.; Piao, S. The Classification of Underwater Acoustic Target Signals Based on Wave Structure and Support Vector Machine. J. Acoust. Soc. Am. 2014, 136, 2265. [Google Scholar] [CrossRef]
Jahromi, M.S.; Bagheri, V.; Rostami, H.; Keshavarz, A. Feature Extraction in Fractional Fourier Domain for Classification of Passive Sonar Signals. J. Signal Process. Syst. 2019, 91, 511–520. [Google Scholar] [CrossRef]
Ke, X.; Yuan, F.; Cheng, E. Underwater Acoustic Target Recognition Based on Supervised Feature-Separation Algorithm. Sensors 2018, 18, 4318. [Google Scholar] [CrossRef]
Erbe, C.; Marley, S.A.; Schoeman, R.; Smith, J.N.; Trigg, L.E.; Embling, C.B. The Effects of Ship Noise on Marine Mammals—A Review. Front. Mar. Sci. 2019, 6, 606. [Google Scholar] [CrossRef]
Kirsebom, O.S.; Frazao, F.; Simard, Y.; Roy, N.; Matwin, S.; Giard, S. Performance of a Deep Neural Network at Detecting North Atlantic Right Whale Upcalls. J. Acoust. Soc. Am. 2020, 1474, 2636–2646. [Google Scholar] [CrossRef]
Yin, X.H.; Sun, X.D.; Liu, P.S.; Wang, L.; Tang, R.C. Underwater Acoustic Target Classification Based on LOFAR Spectrum and Convolutional Neural Network. In Proceedings of the 2nd International Conference on Artificial Intelligence and Advanced Manufacture (AIAM), Manchester, UK, 15–17 October 2020; ACM: New York, NY, USA, 2020; pp. 59–63. [Google Scholar]
Jiang, J.; Shi, T.; Huang, M.; Xiao, Z. Multi-Scale Spectral Feature Extraction for Underwater Acoustic Target Recognition. Measurement 2020, 166, 108227. [Google Scholar] [CrossRef]
Miao, Y.; Zakharov, Y.V.; Sun, H.; Li, J.; Wang, J. Underwater Acoustic Signal Classification Based on Sparse Time–Frequency Representation and Deep Learning. IEEE J. Ocean. Eng. 2021, 46, 952–962. [Google Scholar] [CrossRef]
Liu, F.; Shen, T.; Luo, Z.; Zhao, D.; Guo, S. Underwater Target Recognition Using Convolutional Recurrent Neural Networks with 3-D Mel-spectrogram and Data Augmentation. Appl. Acoust. 2021, 178, 107989. [Google Scholar] [CrossRef]
Zheng, Y.; Gong, Q.; Zhang, S. Time-Frequency Feature-Based Underwater Target Detection with Deep Neural Network in Shallow Sea. J. Phys. Conf. Ser. 2021, 1756, 012006. [Google Scholar] [CrossRef]
Hong, F.; Liu, C.; Guo, L.; Chen, F.; Feng, H. Underwater Acoustic Target Recognition with a Residual Network and the Optimized Feature Extraction Method. Appl. Sci. 2021, 11, 1442. [Google Scholar] [CrossRef]
Xue, L.; Zeng, X.; Jin, A. A Novel Deep-Learning Method with Channel Attention Mechanism for Underwater Target Recognition. Sensors 2022, 22, 5492. [Google Scholar] [CrossRef]
Wang, B.; Zhang, W.; Zhu, Y.; Wu, C.; Zhang, S. An Underwater Acoustic Target Recognition Method Based on AMNet. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Han, X.C.; Ren, C.; Wang, L.; Bai, Y. Underwater Acoustic Target Recognition Method Based on A Joint Neural Network. PLoS ONE 2022, 17, e0266425. [Google Scholar] [CrossRef]
Li, P.; Wu, J.; Wang, Y.; Lan, Q.; Xiao, W. STM: Spectrogram Transformer Model for Underwater Acoustic Target Recognition. J. Mar. Sci. Eng. 2022, 10, 1428. [Google Scholar] [CrossRef]
Feng, S.; Zhu, X. A Transformer-Based Deep Learning Network for Underwater Acoustic Target Recognition. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1505805. [Google Scholar] [CrossRef]
Doan, V.S.; Huynh-The, T.; Kim, D.S. Underwater Acoustic Target Classification Based on Dense Convolutional Neural Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1500905. [Google Scholar] [CrossRef]
Hu, G.; Wang, K.; Liu, L. Underwater Acoustic Target Recognition Based on Depthwise Separable Convolution Neural Networks. Sensors 2021, 21, 1429. [Google Scholar] [CrossRef]
Song, X.; Cheng, J.; Gao, Y. A New Deep Learning Method for Underwater Target Recognition Based on One-Dimensional Time-Domain Signals. In Proceedings of the 2021 OES China Ocean Acoustics (COA), Harbin, China, 14–17 July 2021; pp. 1048–1051. [Google Scholar]
Kamal, S.; Chandran, C.S.; Supriya, M.H. Passive Sonar Automated Target Classifier for Shallow Waters Using End-to-End Learnable Deep Convolutional LSTMs. Eng. Sci. Technol. Int. J. 2021, 24, 860–871. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Santos-Domínguez, D.; Torres-Guijarro, S.; Cardenal-López, A.; Pena-Gimenez, A. ShipsEar: An Underwater Vessel Noise Database. Appl. Acoust. 2016, 113, 64–69. [Google Scholar] [CrossRef]

Figure 1. Illustration of the 1D convolution process.

Figure 2. Multi-head self-attention computation process.

Figure 3. Architecture of the proposed 1DCTN model.

Figure 4. Spectra of ship signals in different categories. (a) Spectrum of ship signal in category A. (b) Spectrum of ship signal in category B. (c) Spectrum of ship signal in category C. (d) Spectrum of ship signal in category D. (e) Spectrum of ship signal in category E.

Figure 5. Training and validation curves for different input features. (a) Training loss curves for different input features. (b) Validation accuracy curves for different input features.

Figure 6. Training and validation curves. (a) Training loss curves for 1DCTN and 1D-CNN LSTM. (b) Validation accuracy curves for 1DCTN and 1D-CNN LSTM.

Figure 7. Confusion matrices on the test set. (a) Confusion matrix for 1D-CNN LSTM. (b) Confusion matrix for 1DCTN.

Figure 8. Comparison of performance metrics for each category. (a) Precision comparison between 1D-CNN LSTM and 1DCTN. (b) Recall comparison between 1D-CNN LSTM and 1DCTN. (c) F1-score comparison between 1D-CNN LSTM and 1DCTN.

Table 1. Layer configuration and output shape of the 1DCTN model.

Layer	Output Shape	Configuration
Input	(B, 1, L)	-
Conv1D + MaxPool	(B, 32, L)	32 filters, 5 × 1 kernel, pad 2
Conv1D + MaxPool	(B, 64, L/2)	64 filters, 5 × 1 kernel, pad 2
Conv1D + MaxPool	(B, 128, L/4)	128 filters, 5 × 1 kernel, pad 2
Reshape	(L/64, B, 128)	-
Transformer Encoder	(L/64, B, 128)	3 layers, 4 heads, dropout 0.1, FFN dim 128
Global Avg Pooling	(B, 128)	-
MLP	(B, M)	-

Table 2. Categories and ship types in the ShipsEar dataset.

Category	Ship Types
A	fishing boats, trawlers, mussel boats, tugboats, dredgers
B	motorboats, pilot boats, sailboats
C	passenger ferries
D	ocean liners, ro-ro vessels
E	background noise recordings

Table 3. Number of samples per category after data processing.

Category	Acoustic Signal Serial Number	The Number of Samples
A	15, 28, 46–49, 66, 73–76, 80, 93–96	1808
B	26, 27, 29, 30, 50–52, 56, 57, 68, 70, 72, 77, 79	1304
C	6, 10, 40, 42, 43, 52–54, 59–65, 67	2632
D	18–20, 22, 24, 25, 58, 69, 71, 78	2282
E	81–92	1140

Table 4. Performance comparison (%) of different input features on the test set.

Feature	Accuracy	Precision	Recall	F1-Score
Mel	90.24	91.03	90.97	90.99
MFCCs	92.71	92.35	92.75	92.54
Time domain	96.84	96.85	96.84	96.84

Table 5. Comparison of 1DCTN with other lightweight underwater acoustic target recognition (UATR) models.

Model	Feature	Acc (%)	Params (M)	Floats (G)
CRNN-9 [12]	Mel	91.4	0.95	2.57
ResNet18 [14]	CCTZ	94.9	0.33	0.11
Autoencoder-decoder [6]	RSSD	93.3	0.18	0.41
AMNET-N [16]	STFT	92.2	0.51	0.14
1DCTN	Time-Domain	96.8	0.45	0.27

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, K.; Wang, B.; Fang, Z.; Cai, B. An End-to-End Underwater Acoustic Target Recognition Model Based on One-Dimensional Convolution and Transformer. J. Mar. Sci. Eng. 2024, 12, 1793. https://doi.org/10.3390/jmse12101793

AMA Style

Yang K, Wang B, Fang Z, Cai B. An End-to-End Underwater Acoustic Target Recognition Model Based on One-Dimensional Convolution and Transformer. Journal of Marine Science and Engineering. 2024; 12(10):1793. https://doi.org/10.3390/jmse12101793

Chicago/Turabian Style

Yang, Kang, Biao Wang, Zide Fang, and Banggui Cai. 2024. "An End-to-End Underwater Acoustic Target Recognition Model Based on One-Dimensional Convolution and Transformer" Journal of Marine Science and Engineering 12, no. 10: 1793. https://doi.org/10.3390/jmse12101793

APA Style

Yang, K., Wang, B., Fang, Z., & Cai, B. (2024). An End-to-End Underwater Acoustic Target Recognition Model Based on One-Dimensional Convolution and Transformer. Journal of Marine Science and Engineering, 12(10), 1793. https://doi.org/10.3390/jmse12101793

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An End-to-End Underwater Acoustic Target Recognition Model Based on One-Dimensional Convolution and Transformer

Abstract

1. Introduction

2. Materials and Methods

2.1. One-Dimensional Convolution

2.2. Multi-Head Self-Attention Mechanism

2.3. 1DCTN Model Architecture

3. Dataset and Preprocessing

3.1. Dataset Description

3.2. Spectral Analysis of the Dataset

3.3. Data Processing and Dataset Partitioning

4. Experiments and Results

4.1. Experimental Setup

4.2. Comparative Evaluation

4.2.1. Performance Comparison with Time–Frequency Features

4.2.2. Performance Comparison with LSTM Network Architectures

4.2.3. Comparison with Other Lightweight Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI