A Novel Deep Learning Method for Underwater Target Recognition Based on Res-Dense Convolutional Neural Network with Attention Mechanism

Jin, Anqi; Zeng, Xiangyang

doi:10.3390/jmse11010069

Open AccessArticle

A Novel Deep Learning Method for Underwater Target Recognition Based on Res-Dense Convolutional Neural Network with Attention Mechanism

by

Anqi Jin

and

Xiangyang Zeng

^*

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(1), 69; https://doi.org/10.3390/jmse11010069

Submission received: 16 December 2022 / Revised: 25 December 2022 / Accepted: 29 December 2022 / Published: 2 January 2023

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Long-range underwater targets must be accurately and quickly identified for both defense and civil purposes. However, the performance of an underwater acoustic target recognition (UATR) system can be significantly affected by factors such as lack of data and ship working conditions. As the marine environment is very complex, UATR relies heavily on feature engineering, and manually extracted features are occasionally ineffective in the statistical model. In this paper, an end-to-end model of UATR based on a convolutional neural network and attention mechanism is proposed. Using raw time domain data as input, the network model combines residual neural networks and densely connected convolutional neural networks to take full advantage of both. Based on this, a channel attention mechanism and a temporal attention mechanism are added to extract the information in the channel dimension and the temporal dimension. After testing the measured four types of ship-radiated noise dataset in experiments, the results show that the proposed method achieves the highest correct recognition rate of 97.69% under different working conditions and outperforms other deep learning methods.

Keywords:

underwater acoustic target recognition; deep learning; attention mechanism; dense convolutional neural network; working condition

1. Introduction

With the development of science and technology, underwater acoustic target recognition (UATR) is widely used in marine economics and military activities. UATR technology is an information processing technology that uses passive target radiation noise received by sonar, active target echoes, and other sensor information to extract target features and identify target types or ship types [1,2,3]. It is inevitable that ships will radiate noise to the surrounding environment when sailing in the ocean. Ship-radiated noise, as the main analysis object of underwater acoustic technology, contains rich ship target feature information. Traditional underwater acoustic target recognition technology is limited to manual feature extraction using expert knowledge and shallow classifiers [4,5,6]. Handcrafted features can be classified into time-domain features, frequency-domain features, time-frequency domain features, and time-domain modulation features [7,8,9]. However, due to the strong time-varying and non-Gaussian nature of underwater acoustic signals, object classification performance based on traditional feature frameworks is poor, generalization is limited, and processing efficiency is low.

Deep learning, with its strong autonomous data processing and feature learning capabilities, has attracted widespread attention since its initial proposal [10,11,12,13]. In recent years, the deep learning method as a kind of brain-machine learning method has become a research hotspot in the field of UATR. Kamal et al. [14] used DBN structures for underwater target recognition and obtained a maximum correct recognition rate of 90.23% for a classification problem with 40 classes. By comparing convolutional neural networks with conventional recognition methods, Ferguson et al. [15] found that convolutional neural networks can acquire targets with greater robustness over a wider range in the recognition of surface targets in shallow waters. Muhammad Irfan et al. [16] achieved excellent results on a real dataset by modeling multiple features with separable convolutional autoencoder networks. The source of this dataset is Ocean Networks Canada. Inspired by the perception of timbre by human ear hearing, Yang et al. [17,18] proposed a series of auditory timbre perception convolutional neural networks. Good recognition results were achieved by modeling the frequency domain of the time-domain signal with a one-dimensional convolutional kernel, followed by feature extraction using two-dimensional convolution. To solve the degradation problem of the network, He [19] proposed a ResNet network model. The researchers [20,21,22] introduced ResNet into UATR and achieved good recognition results. Doan et al. [23] introduced DenseNet into UATR using time-domain data as network input, and achieved higher recognition accuracy than other networks in the case of a low signal-to-noise ratio. Unlike the general single-target recognition task, Sun et al. [24] used ResNet and DenseNet for the multi-target recognition task. The proposed method can effectively identify synthetic multi-target ship signals when multiple spectrums are used as network inputs. Gao et al. [25] combined a deep convolutional generative adversarial network (DCGAN) and a densely connected convolutional network (DenseNet) to do classification using the generated samples, which effectively alleviated the problem of insufficient underwater acoustic data.

The attention mechanism in deep neural networks is similar to the human attention mechanism, and the core goal is also to select the information that is more critical to the current task from the many pieces of information available. The attention mechanism plays a crucial role in improving the learning ability of neural networks [26,27,28]. More representative is SENet [29], which learns the weight of the feature map through the loss function and uses the weight to recalibrate the importance of the feature map, which is a channel-level attention mechanism. In UATR, Xue et al. [30] proposed a channel attention mechanism deep learning method, which uses two layers of channel attention to learn the intrinsic characteristics of the target, which can effectively distinguish targets under different working conditions. The S-ResNet model is proposed [31] in the literature to achieve a good balance between classification accuracy and model complexity, and the validity of the method is demonstrated by five different types of underwater target data collected from sea trials and lake tests. The effectiveness of the method was demonstrated by five different types of underwater target data collected from sea trials and lake tests. The above UATR method only uses channel attention, which has certain limitations. The transformer model based on the self-attention mechanism is also introduced into UATR. Peng et al. [32] used the Fbank spectrum, MFCC parameters, and STFT spectrum as inputs to the network and cut the spectrum into many blocks in order to fit the transformer network. Using pre-training parameters and data augmentation, it is demonstrated that the transformer model can be applied to UATR. Sheng et al. [33] used a transformer network to extract the global and local features of the spectrogram and validated the feasibility of the method on several public data sets. The transformer model requires a large amount of data, and as more data is measured, the model will become more suitable for UATR.

Ship-radiated noise in the actual marine environment is affected by multiple factors of environmental noise and ocean passage. During the ship’s operation, it will show various different working conditions according to the actual situation’s needs. Under different working conditions, it also leads to differences in the radiated noise that affect the correct rate of target identification. To solve this problem, combining the advantages of ResNet and DenseNet, this paper proposes a new network called Channel-Temporal Attention Res-DenseNet (CTA-RDnet).

In this paper, we propose a new UATR method that uses temporal and channel attention mechanisms to extract ship-radiated noise features. The main contributions of this paper are as follows:

A new end-to-end network is constructed called Channel-Temporal Attention Res-DenseNet (CTA-RDnet). Using the residuals module to ensure that the network does not degrade and the dense connectivity module to reuse features multiple times.
The raw time-domain underwater acoustic data is used as input to obtain features with a high generalization ability by using channel attention to obtain the importance of different channel features for the current task and temporal attention to obtain the importance of feature time series points.
Ablation studies and comprehensive experiments on measured ship-radiated noise datasets with four working conditions validate the effectiveness of the proposed network model.

The structure of the article is as follows: Section 2 provides an overview of the theory of ResNet and DenseNet. In Section 3, CTA-RDNet-based UATR method is described in detail. Section 4 shows the experimental methods and results on the real data set. Section 5 provides a discussion and summary.

2. Related Work

2.1. ResNet Model

When the number of model layers increases, the network performance will inevitably degrade. In the face of such degradation, Kaiming He’s team proposes a ResNet [19] model. Unlike ordinary neural networks, the ResNet network model is designed with a residual unit structure (also called shortcut connections) by means of cross-layer connections. As shown in Figure 1, the residual unit contains two types of connections: one is a nonlinear mapping connection like a normal neural network, and the other is a bypassed short circuit connection. Generally, this short-circuit cross-layer connection will bypass 2 to 3 layers, but it can span more layers if necessary. If the input of a residual unit is denoted as

x

the nonlinear mapping is denoted as

F (•)

and

H (x)

is denoted as the result of the computation of the residual unit, then their arithmetic relationship can be denoted as:

H (x) = F (x) + x

(1)

after adding the short-circuit connection across layers, and generally speaking,

F (x)

is called a residual mapping, which needs to be learned following the iterations of the model.

The residual unit used in this paper has the structure shown in Figure 2. The size of the convolutional kernel in the middle layer of the bottleneck unit is 3 × 1, and the size of the convolutional kernel on both sides is 1 × 1, which morphologically resembles the neck of a bottle, hence the name bottle neck unit.

2.2. DenseNet Model

The DenseNet network model was proposed by Dr. Huang’s team based on the ResNet model [34]. Usually, to enhance the recognition performance and generalization ability of the model, the model is often designed to be “wider,” while the DenseNet model focuses on the features. It uses the dense connection structure to make the hidden layer of the DenseNet model visible. Features are reused multiple times while avoiding redundant feature extraction. The DenseNet model consists of a preprocessing module, a densely connected module, a transition layer, and finally, a classification module, where the dense connectivity module is the core of the model. In the densely connected module, each feature map is reused multiple times, greatly avoiding the learning of redundant features, so the network can use fewer convolutional kernels for training, greatly reducing the training parameters and speeding up the model training. The structure of a densely connected module is shown in Figure 3. A densely connected module with L layers has a total of L(L + 1)/2 connections, where each layer’s input is a stitching of the feature maps output from all previous layers and also inputs its own feature maps into each subsequent layer.

In the densely connected module, the feature map

x

output from the

l th

layer is calculated using the feature map

x_{0}, x_{1}, \dots, x_{l - 1}

s from the previous

(l - 1) th

layers, as

x_{l} = H_{l} ([x_{0}, x_{1}, \dots, x_{l - 1}])

(2)

In the densely connected module, assuming that the number of feature maps output by each nonlinear operation is represented by

H_{l}

k, then the number of feature maps output at layer

l

is k*(

l

− 1) + k₀, where k₀ is the number of input channels, and k is denoted as the growth rate. In general, the value of hyperparameter k (growth rate) tends to be small, which makes the network model appear “narrower”, thus reducing the number of parameters needed to train the DenseNet model, and at the same time ensuring the recognition performance of the network model as much as possible.

3. Materials and Methods

The core issue in UATR is feature extraction. Good features should generally have the following characteristics: strong generalization ability; invariance to different working conditions of the target. The ResNet model can expand the neural network to hundreds of layers with the help of the residual structure, which can alleviate the defect of gradient dispersion so that the deep network model can still get better training results; the DenseNet model uses dense connections to maximize using every feature extracted by the neural network, and given that it has fewer parameters than the ResNet model, the network is easier to train. The attention mechanism can effectively improve the efficiency of feature extraction.

Based on the above models, this section proposes a CTA-RDNet model based on the UATR task. This section focuses on the proposed CTA-RDNet.

3.1. The Structure of CTA-RDNet

CTA-RDNet performs well in extracting ship radiation noise features because it combines ResNet and Densenet with the addition of a channel attention mechanism and a temporal attention mechanism. The structure of CTA-RDNet is shown in Figure 4. The entire network structure consists of three modules: a pre-processing module, a feature extraction module, and a classification module. The preprocessing module consists of a 3 × 1 convolutional layer and a 3 × 1 maximum pooling layer. The feature extraction module consists of a residual module, three attention layers, three dense blocks, and three transition layers to fully extract the ship-radiated noise depth features. The classification module consists of an average pooling layer and a fully connected layer and is classified by softmax. Among them, the attention layer will be detailed in Section 3.2. The specific parameter settings for each layer of the CTA-RDNet are shown in Table 1. The pseudo-code of the CTA-RDnet model is shown in Table 2.

From the structure of the CTA-RDNet model in Table 1, it can be seen that when the time-domain waveform of the underwater acoustic signal is input to the network, the data first passes through a convolutional layer, which, on the one hand, increases the dimensionality of the feature map to ensure feature complexity and, on the other hand, reduces the size of the feature map to half of its original size. The feature data then pass through a maximum pooling layer in which the feature map size is further compressed to half the output of the previous layer to reduce the computational effort. The feature data enter a residual block consisting of bottleneck units, which does not change the shape of the feature map but rather the number of channels of the feature map to extract sufficient and efficient features as much as possible while placing the residual block in a shallow layer of the network to facilitate gradient propagation. Then, an attention layer consisting of channel attention and temporal attention are passed to obtain important channel information and temporal information. Immediately afterward, the feature data enters an intermediate layer consisting of three transition layers alternating with three densely connected modules in a series, which also contain two attention layers. Considering the characteristics of the attention layers, we put the latter two attention layers behind the translation layers to filter out the unimportant features, activate the effective features in the translation layers, and improve the utilization of the effective features.

The structure of the dense module and the translation module is shown in Figure 5. In the transition layer, the size of the feature map is compressed to half of the original size to reduce the computation, and in the dense connection module, the dense connection will make full use of the features extracted from the previous layers, and the module will not change the size of the feature map. Then, the feature data will pass through an averaging pooling layer, which will integrate all the previously extracted features and further compress the size of the feature map to 1 × 1 to reduce the computational effort; finally, the feature data will pass through a classification layer, using the softmax function as the activation function to complete the task of underwater acoustic target classification.

3.2. Attention Mechanism Structure in CTA-RDNet

In our daily lives, our brain receives a large amount of input through various senses, which are of different types and carry different semantic meanings, and our brain is able to process them in an orderly manner. In this process, attention mechanisms play a crucial role in the complex cognitive functions of our brain, which enable the selection of information through the perception of associations between information and events, allowing our brain to “consciously” process and respond to information. The common ones are visual attention, auditory attention, and verbal attention.

When we encounter a scene in our daily lives, such as hearing a sound or seeing a picture, we focus on the important areas and process them quickly. The above process [35,36] can be expressed as

A t t e n t i o n = f (k (x), x)

(3)

Here

k (x)

can denote the generation of attention, which corresponds to the process of focusing on important regions.

f (k (x), x)

process input

x

based on attention

k (x)

, which is consistent with processing critical areas and acquiring information. According to the above definition, almost all existing attentional mechanisms can be written into the above equation. For example, the self-attention model [27]. Attentional mechanisms are widely used in a variety of information processing and analysis tasks, such as speech signals [37,38] and visual signals [39,40], which greatly improve the nonlinear representation ability of neural networks and the extraction and abstraction of high-level semantics.

Inspired by previous studies, this paper proposed a channel-temporal attention mechanism. Two types of attention in series in order. This subsection will detail the two attention mechanisms used by the model: channel attention and temporal attention.

3.2.1. Channel Attention

The signal passing through multiple convolution kernels produces multiple channels of features. The channel attention mechanism is to obtain the importance of each feature channel of the ship’s radiation noise data in the channel dimension by automatic learning to compose a weight matrix. The larger the value of the weight, the more important the corresponding channel, and the higher the relevance of that channel to the key information, and the same weight in the time dimension, which allows the neural network to focus on certain feature channels. In simple terms, the information in each channel is aggregated by averaging pooling, and then the information in each channel is passed through two fully connected layers to extract the correlation between channels to construct the attention information so that the characteristics of underwater acoustic signals at the channel level can be obtained. The structure of the channel attention module is shown in Figure 6.

For one-dimensional underwater acoustic time-domain data, the introduction of channel attention to different signal fragments of features allows adaptive weighting for information filtering, highlighting the features with important information and suppressing the invalid features. First, input an underwater acoustic feature map

H

of size

C \times L

, where

C

as the number of channels and

L

is the feature length; then, the feature map of size

C \times L

is compressed, and the global average pooling

H = [h_{1}, h_{2}, \dots, h_{C}]

is performed to compress each one-dimensional underwater acoustic feature of channel length

L

into a real number, to obtain a global feature

v

of size

C \times 1

, where the

c - th

element

v

s can be expressed as

v_{c} = F_{s q} (h_{c}) = \frac{1}{L} \sum_{i = 1}^{L} h_{c} (i)

(4)

The correlation between channels is modeled by two fully connected layers forming a bottleneck structure to generate the weights

s

:

s = F_{e x} (v, W) = σ (g (v, W)) = σ (W_{2} δ (W_{1} v))

(5)

where

W_{1} \in R^{c \times \frac{c}{r}}, W_{2} \in R^{\frac{c}{r} \times c}

. To reduce model complexity and enhance generalization, the first fully connected layer acts as a dimensionality reduction,

δ

denotes the ReLU activation function, which is then restored to the original channel dimension by a second fully connected layer and

σ

denotes the Sigmoid activation function to obtain a normalized weight value between 0 and 1.

Finally, the features of each channel are weighted with the corresponding weights to output a weighted feature map:

h_{c}^{'} = F_{scale} (h_{c}, s_{c}) = s_{c} h_{c}

(6)

Thus, we get the features after attention.

3.2.2. Temporal Attention

The ship-radiated noise signal is a time series signal, and the input to the network is a one-dimensional time domain signal. Learning from the original waveform of the signal enables one to exploit the fine time structure of the signal and obtain more information. The channel attention mechanism notices the “what”: what features are meaningful, which means which channel features are meaningful. We use temporal attention to notice “when”: What time features are important. The structure of the temporal attention module is shown in Figure 7. The temporal attention mechanism acts on the time dimension, which is constant and compresses the channel dimension. For a two-dimensional plane, one weight is learned for each pixel point, and a weight matrix is obtained for the H × W feature map, while for a one-dimensional underwater acoustic sequence, one weight is learned at each time point and can be referred to as the “temporal attention mechanism”.

A weight matrix is obtained for a feature map of length

L

, with equal weights of in

C

channel dimensions. The weights are the same in

C

channel dimensions. In the 1D temporal attention module, the feature map output from the channel attention module is used as input, and two

1 \times L

feature descriptions are obtained using maximum pooling and average pooling in the channel dimension, and then channel-based splicing (concat) is performed on the two features to obtain the

2 \times L

features. After a one-dimensional convolution to reduce the dimensionality into one channel, the sigmoid activation function is then used to obtain the temporal attention weight

M_{s}

, and finally, the weighted features

H^{″}

are obtained by multiplying

M_{s}

with the corresponding element

H^{'}

.

M_{s} (H^{'}) = σ (f^{7} ([A v g P o o l (H^{'}); M a x P o o l (H^{'})])) = σ (f^{7} ([H_{avg}^{' s}; H_{\max}^{' s}]))

(7)

H^{″} = M_{s} (H^{'}) \otimes H^{'}

(8)

where

σ

is the Sigmoid activation function

f^{7}

denotes the 1D convolution with a convolution kernel size of 7, and

\otimes

denotes the corresponding element multiplication.

Combining channel attention and temporal attention by order allows the temporal and cross-channel relationships of features to be used to tell the network what to focus on and when to focus. More specifically, this combination emphasizes useful channels and enhances the informativeness of underwater acoustic signal features to get the best features possible given the limited samples and changing working conditions.

4. Experiment and Analysis

4.1. Data Preparation

The ship-radiated noise dataset used in this paper was collected by our group in September 2018. The data set contains four different types of targets, as shown in Figure 8. The names of the four ships are New Century, Shu hang, Guo tai, and Tin Ship. Two hydrophone line arrays were used to collect the radiation noise signal of the ship during navigation; each line array contains eight elements. The real-time position of the ship is recorded by GPS to facilitate the determination of the ship’s working status when performing data processing.

The receiving array (submersible marker) is placed in the east–west ferry channel and the north–south sand carrier channel side. It is easy to collect and obtain the radiation noise of various ships. The ship’s sailing trajectory is shown in Figure 9 below. There are two formations in the diagram: A and B. The ships start from the starting point indicated by the pentagram and sail along the numbers 1, 2, 3, 4, 5, 6, 7, 8. There are four ships in operation.

Even if the same underwater acoustic target navigates in the same marine environment under different working conditions, it can lead to different radiated noise, to the extent that it affects the correct rate of UATR. The complexity of the marine environment also increases the complexity of the working and generation conditions. Therefore, it is important to study the difference in the radiation noise of underwater acoustic targets under typical working conditions and thus extract robust features with good robustness to improve the performance of the recognition system. In this paper, we choose straight-line running, turning, acceleration, and deceleration conditions as the typical conditions for the study, and the measured data are marked to obtain the noise under the four working conditions. A straight line means going straight ahead at an even speed. The data used in this paper are all from the same array element. Details of the data are shown in Table 3.

If the acoustic signal data collected by the sonar is directly extracted from the features and then classified and identified, the environmental factors associated with the signal will have an impact on the features and will not achieve the desired effect. Therefore, it is necessary to preprocess the data to make the information in the sample data more obvious and reduce the impact on feature extraction. Since the original datasets are all oceanic measured data, there are problems such as excessive noise and blank segments in some data. Firstly, some of the poorly captured audio is manually eliminated, and the remaining audio is removed from the blank segments left in the capture. Since there is more noise in the audio, the audio is also subjected to adaptive noise reduction to facilitate the extraction of the features below. The experiments involved four types of targets, each containing four types of working conditions: straight line, turning, acceleration, and deceleration, for a total of 16 radiated noise segments, each lasting 10 min. The duration of each frame is 0.1 s with no frameshift, so each target contains 6000 samples for each condition. Due to the large span of receiving distances at different signals, the gradient value will be affected when the model is back propagated for calculation, which is not conducive to the convergence of the model, so the normalization operation is performed on the data. There are a total of 24,000 samples for each working condition and 24,000 samples for each target. The total number of samples is 96,000.

4.2. Experimental Settings

To validate the results, we compare three models (simple CNN, ResNet, and DenseNet) with the proposed CTA-RDnet. The settings for each model are shown below:

(1): The CTA-RDnet model is set up as described in the previous section. The loss function chosen for all neural network models is the cross-entropy loss; the gradient descent method uses Adam; the learning rate is 0.0001, and the batch size is 128. The number of training epochs is set as 100.
(2): The simple CNN model consists of two convolutional layers, two pooling layers, and fully connected layer.
(3): The ResNet model consists of a convolutional layer and a pooling layer at the input, four residual blocks in the middle, and a final classification layer in series with each other.
(4): The DenseNet model consists of a preprocessing layer at the input, a classification layer at the output, and four densely connected modules in the middle with alternating connections to the three transition layers.

All models were trained at a workstation. The workstation is configured with two NVIDIA 3090 GPUs, an Intel i9-10900K CPU, and 128GB of internal memory.

4.3. Mixed Working Condition Experiment

4.3.1. Comparison of Different Methods

Mixed working conditions experiments mean that both the training and test sets contain data under all conditions of working, which is an overall experiment. Table 4 shows the recognition accuracy of the four different deep learning models under mixed working conditions in experiments. The CTA-RDnet model achieved the highest recognition accuracy of 96.79% under mixed working conditions. This is 9.23% higher than the simple CNN model, 5.21% higher than DenseNet, and 5.76% higher than ResNet. The results of DenseNet are slightly better than those of ResNet. Figure 10 shows the confusion matrix of the recognition results. Since the material and size difference between the Tin ship and the other three types of ships is relatively large, the Tin ship has the best recognition result. The ROC curve is plotted based on the classification results on the testing dataset, and the result is shown in Figure 11.

Overall, the recognition accuracy of the model is decreasing as the size of the training set decreases. The proposed CTA-RDnet model maintains a high recognition rate of 93.90% when the sample size of the training set is maintained at 16,000, and the recognition rate decreases greatly when the sample size decreases to 10,000. Figure 12 shows the recognition speed of the four networks. Due to the simplicity of the model, Simple CNN has the fastest recognition speed, but the recognition accuracy is not high. The CTA-RDnet has a faster recognition speed while achieving the highest recognition rate.

The CTA-RDnet model can obtain good recognition accuracy under mixed working conditions; however, it is uncertain as to the effectiveness of the attention mechanism and the Res-Dense model. Next, the effectiveness of the proposed attention mechanism is investigated using ablation experiments.

4.3.2. Ablation Experiments

In order to verify the effectiveness of the proposed attention mechanism, we designed four models for experiments, as follows:

(1): Only the residual blocks, dense blocks, and transition layers are retained, and all attention layers are removed, which is a simplified version of CTA-RDnet.
(2): Remove the temporal attention module while keeping only the channel attention module.
(3): Keep only the temporal attention module and remove the channel attention module.
(4): The complete CTA-RDnet model.

The details of the four models are shown in Table 5.

We use the experiment under mixed working conditions as a study. The experimental results are shown in Table 6.

Compared to ResNet and DenseNet, RDnet has higher recognition accuracy. The performance of CA-RDnet is better than RDnet, and the recognition accuracy is improved by 2.57%. TA-RDnet is also better than RDnet, and the recognition accuracy is improved by 1.38%. The performance of the proposed CTA-RDnet is better than the other comparable models, with 96.79% recognition accuracy. In summary, the proposed channel attention mechanism and temporal attention mechanism can effectively improve the performance of the model. The fact that CTA-RDnet outperforms all the comparison models proves the effectiveness of the proposed channel temporal attention module.

4.4. The Effect of Working Conditions on Model Performance

During the operation of a ship, it will show various working conditions according to the actual needs of the situation. When the ship’s working conditions are different, the radiation noise of the ship will also change. When the ship is in a straight working condition, the ship’s engine output power basically remains unchanged, the whole hull maintains a relatively stable state, and the ship’s radiation noise is also relatively stable; when the ship is in the deceleration working condition, the ship’s engine output power decreases, the resistance between the hull and the water surface increases, and the ship’s noise also decreases. When the ship is in the acceleration condition, the engine power increases, and the vibration and friction of various onboard equipment are changed, which leads to an increase in the ship’s radiation noise. When the ship is in the turning condition, the attitude of the ship is very different compared with the first three conditions, and the ship’s radiation noise is also changed. This subsection compares the performance of the four neural network models based on the measured data and also investigates the recognition performance of the models under complex conditions, such as different working conditions.

4.4.1. Matching of Working Conditions

Working Condition matching means that the testing and training are from the same working condition data, which contains four types: straight-line, turning, acceleration, and deceleration. Figure 13 displays the recognition performance of the four neural network models—CNN, ResNet, DenseNet, and CTA-RDNet—under constant working conditions.

The four neural network models—CNN, ResNet, DenseNet, and CTA-RDNet—can all successfully perform the UATR task when the ship is working under four different conditions. When the ship is running straight, the radiated noise is the most stable and easy to distinguish, and the correct recognition rate of each model is the highest. The correct recognition rate in the deceleration and turning conditions is slightly lower than that in the straight-line condition, indicating that the radiated noise of the ship changes at this time but does not seriously affect the classification effect. And when the ship is in the acceleration condition, the correct recognition rate of each model has a large decrease, which indicates that the acceleration has a relatively large effect on the radiated noise of the ship. The reason is that as the speed of the ship increases, the power of the ship’s engine increases rapidly, resulting in a dramatic change in the radiated noise characteristics of the engine. At the same time, the vibration state of the various equipment carried onboard the ship, etc., also changed, which had a greater impact on the overall radiated noise of the ship, thus leading to a decrease in its separability.

The correct recognition rate of the CTA-RDnet model is about 5% higher than that of the ResNet and DenseNet models, indicating the superiority of the proposed model for the UATR task.

4.4.2. Mismatching of Working Conditions

This subsection discusses the recognition performance of four neural network models—CNN, ResNet, DenseNet, and CTA-RDnet—when the training sample and the test sample working conditions are different, which means a mismatching of working conditions. Because the straight-line working condition has the best recognition effect and the acceleration working condition has the worst, these two working conditions are chosen as the representatives for the training set and the other working conditions for the testing set. The experimental results are shown in Figure 14 and Figure 15.

As seen in Figure 14 and Figure 15, in practice, ships are subject to a variety of operating conditions. When the mismatch between the training set and the test set occurs, the four neural network models—CNN, ResNet, DenseNet, and CTA-RDnet—can still successfully complete the task of UATR, but the correct recognition rate decreases when compared with the consistent working conditions. For example, when the training sample is the straight-line working condition and the testing sample is the accelerated working condition, the correct recognition rate of each model decreases by up to 18.96% compared with that when the straight-line working condition is used as the testing sample; when the accelerated working condition is used as the training sample, the situation is alleviated, and the correct recognition rate decreases by about 10% compared with that when the accelerated working condition is used as the testing sample, which indicates that under acceleration conditions, the state of the vessel itself changes greatly and the radiation noise becomes unstable and less distinguishable. When the training and testing samples come from straight-line and turning conditions, respectively, the correct recognition rate can still be maintained at a high level, which is not much different from that in straight-line and straight-line, which indicates that although the ship itself moves differently in straight-line and turning conditions, its radiation noise characteristics are relatively similar. The CTA-RDnet model has the highest correct identification rate, which indicates that it has better performance for the working conditions mismatch problem.

4.5. Critical Analysis and Discussion

In this paper, a CTA-RDnet model is proposed to study performance under different working conditions. Researchers have introduced various network models and attention mechanisms into UATR with good results. Related studies have focused on the effects of noise, data imbalance, and data time on UATR performance, with little attention paid to the effects of working conditions on recognition effectiveness. This study combines ResNet and DenseNet with the attention module and achieves better recognition results than other deep learning methods in mixed, matching, and mismatching working conditions. In practice, because the target task is different, its working conditions will also change, resulting in the same target radiation noise with the working conditions changing. Hydroacoustic target recognition systems often encounter inconsistencies between the operating conditions contained in the model training and the current operating conditions of the target, a phenomenon that can lead to a decrease in the correct recognition rate in practical applications. The proposed CTA-RDnet is able to extract features for work condition robustness and improve the performance of the identification system under mismatching working conditions. However, under certain conditions of mismatching, such as straight-line to acceleration, the proposed method is still not accurate enough, although it is more effective than other methods. Subsequent studies focus on the mismatching in acceleration conditions.

Table 7 is a list of abbreviations for terms used in the paper.In practical applications, UTAR performance is not only influenced by the working conditions but also by the marine environment. The underwater environment is highly complex and deviates location by location. In complex marine environments, the target radiation noise and its characteristics are highly susceptible to interference noise and channel variations. In the underwater environment, noise is mainly generated by three aspects: the underwater acoustic channel, collection equipment, and the environment. In some cases, the marine environment has a greater impact on identification performance than the target working conditions. Ideally, we would like the model to be able to identify targets in different seas. It would be ideal if the identification task could be done without using data from other sea areas. Unfortunately, for now, this is difficult, even when the same ship is sailing in different waters. Data collected from other marine areas is needed to train again; otherwise, the results may not be good. In future research, we will apply the model to different sea areas for identification.

5. Conclusions

This paper proposes CTA-RDnet for underwater acoustic target recognition to study the ship-radiated noise under different working conditions. Combining the advantages of the ResNet model and the DenseNet model, a channel-temporal attention mechanism is added on this basis. The model can effectively improve the recognition performance by using feature reuse on the one hand and alleviating the gradient dispersion problem with the help of residual structure on the other. At the same time, the attention mechanism can significantly improve the model’s performance. The recognition performance of the model was investigated under mixed, matching, and mismatching working conditions. The ablation demonstrates the validity of the proposed channel-temporal attention mechanism. The experimental results show that the proposed CTA-RDnet is able to achieve up to 97.69% recognition accuracy better than several other deep learning models under different working conditions. The CTA-RDnet can provide effective recognition of underwater acoustic targets under multiple working conditions. In addition to the effect of working conditions, the ocean channel also has a great impact on recognition performance. In future work, CTA-RDnet will be extended and validated on other datasets in different ocean areas.

Author Contributions

Conceptualization, A.J.; methodology, A.J.; software, A.J.; validation, A.J.; formal analysis, A.J.; investigation, A.J.; resources, A.J.; data curation, A.J.; writing—original draft preparation, A.J.; writing—review and editing, X.Z.; visualization, A.J.; supervision, X.Z.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant number 52271351.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationship that could appear to have influenced the work reported in this paper.

References

Fang, S.; Du, S.; Luo, X.; Han, N.; Xu, X. Development of underwater acoustic target feature analysis and recognition technology. Bull. Chin. Acad. Sci. 2019, 34, 297–305. [Google Scholar]
Erol-Kantarci, M.; Mouftah, H.T.; Oktug, S. A survey of architectures and localization techniques for underwater acoustic sensor networks. IEEE Commun. Surv. Tutor. 2011, 13, 487–502. [Google Scholar] [CrossRef]
Zhufeng, L.; Xiaofang, L.; Na, W.; Zhang, Q. Present status and challenges of underwater acoustic target recognition technology: A review. Front. Phys. 2022, 10, 1018. [Google Scholar] [CrossRef]
Meng, Q.; Yang, S.; Piao, S. The classification of underwater acoustic target signals based on wave structure and support vector machine. J. Acoust. Soc. Am. 2014, 136, 2265. [Google Scholar] [CrossRef]
Jian, L.; Yang, H.; Zhong, L. Underwater Target Recognition Based on Line Spectrum and Support Vector Machine. In Proceedings of the International Conference on Mechatronics, Control and Electronic Engineering (MCE2014), Shenyang, China, 29–31 August 2014. [Google Scholar]
Seok, J.; Bae, K. Target Classification Using Features Based on Fractional Fourier Transform. Ieice Trans. Inf. Syst. E97.D 2014, 97, 2518–2521. [Google Scholar] [CrossRef] [Green Version]
Kang, C.; Zhang, X.; Zhang, A.; Lin, H. Underwater acoustic targets classification using welch spectrum estimation and neural networks. Adv. Neural Netw. 2004, 3173, 930–935. [Google Scholar]
Das, A.; Kumar, A.; Bahl, R. Marine vessel classification based on passive sonar data: The cepstrum-based approach. IET Radar Sonar Navig. 2013, 7, 87–93. [Google Scholar] [CrossRef]
Zhang, L.; Wu, D.; Han, X.; Zhu, Z. Feature extraction of underwater target signal using mel frequency cepstrum coefficients based on acoustic vector sensor. J. Sens. 2016, 2016, 7864213. [Google Scholar] [CrossRef] [Green Version]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 2. [Google Scholar] [CrossRef] [Green Version]
Toneva, M.; Wehbe, L. Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain). arXiv 2019, arXiv:1905.11833. [Google Scholar]
Schwartz, D.; Toneva, M.; Wehbe, L. Inducing brain-relevant bias in natural language processing models. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Bhattacharyya, S.; Pan, I.; Mukherjee, A.; Dutta, P. Hybrid Intelligence for Image Analysis and Understanding; John Wiley & Sons: Hoboken, NJ, USA, 2017. [Google Scholar]
Kamal, S.; Mohammed, S.K.; Pillal, P.R.S.; Supriya, M.H. Deep learning architectures for underwater target recognition. In Proceedings of the 2013 Ocean Electronics, Kochi, India, 23–25 October 2013. [Google Scholar]
Ferguson, E.L.; Ramkrishnanr Williams, S.B.; Jin, C.T. Convolutional neural networks for passive monitoring of a shallow water environment using a single sensor. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 5–9 March 2017. [Google Scholar]
Irfan, M.; Jiangbin, Z.; Ali, S.; Iqbal, M.; Masood, Z.; Hamid, U. DeepShip: An underwater acoustic benchmark dataset and a separable convolution based autoencoder for classification. Expert Syst. Appl. 2021, 183, 115270. [Google Scholar] [CrossRef]
Li, J.; Yang, H. The underwater acoustic target timbre perception and recognition based on the auditory inspired deep convolutional neural network. Appl. Acoust. 2021, 182, 108210. [Google Scholar] [CrossRef]
Honghui, Y.; Junhao, L.; Meiping, S. Underwater acoustic target multi-attribute correlation perception method based on deep learning. Appl. Acoust. 2022, 190, 108644. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Luo, X.; Zhang, M.; Liu, T.; Huang, M.; Xu, X. An Underwater Acoustic Target Recognition Method Based on Spectrograms with Different Resolutions. J. Mar. Sci. Eng. 2021, 9, 1246. [Google Scholar] [CrossRef]
Hong, F.; Liu, C.; Guo, L.; Chen, F.; Feng, H. Underwater Acoustic Target Recognition with ResNet18 on ShipsEar Dataset. In Proceedings of the 2021 IEEE 4th International Conference on Electronics Technology (ICET), Chengdu, China, 7–10 May 2021; pp. 1240–1244. [Google Scholar]
Domingos, L.C.F.; Santos, P.E.; Skelton, P.S.M.; Brinkworth, R.S.A.; Sammut, K. An investigation of preprocessing filters and deep learning methods for vessel type classification with underwater acoustic data. IEEE Access 2022, 10, 117582–117596. [Google Scholar] [CrossRef]
Doan, V.S.; Huynh-The, T.; Kim, D.S. Underwater acoustic target classification based on dense convolutional neural network. IEEE Geosci. Remote Sens. Lett. 2020, 19, 1500905. [Google Scholar] [CrossRef]
Sun, Q.; Wang, K. Underwater single-channel acoustic signal multitarget recognition using convolutional neural networks. J. Acoust. Soc. Am. 2022, 151, 2245–2254. [Google Scholar] [CrossRef]
Gao, Y.; Chen, Y.; Wang, F.; He, Y. Recognition Method for Underwater Acoustic Target Based on DCGAN and DenseNet. In Proceedings of the 2020 IEEE 5th International Conference on Image, Vision and Computing (ICIVC), Beijing, China, 10–12 July 2020; pp. 215–221. [Google Scholar]
Xu, K.; Ba, J.L.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Xue, L.; Zeng, X.; Jin, A. A Novel Deep-Learning Method with Channel Attention Mechanism for Underwater Target Recognition. Sensors 2022, 22, 5492. [Google Scholar] [CrossRef]
Jiang, Z.; Zhao, C.; Wang, H. Classification of Underwater Target Based on S-ResNet and Modified DCGAN Models. Sensors 2022, 22, 2293. [Google Scholar] [CrossRef] [PubMed]
Li, P.; Wu, J.; Wang, Y.; Lan, Q.; Xiao, W. STM: Spectrogram Transformer Model for Underwater Acoustic Target Recognition. J. Mar. Sci. Eng. 2022, 10, 1428. [Google Scholar] [CrossRef]
Feng, S.; Zhu, X. A Transformer-Based Deep Learning Network for Underwater Acoustic Target Recognition. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.; Mu, T.-J.; Zhang, S.-H.; Martin, R.R.; Cheng, M.-M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Moritz, N.; Hori, T.; Le, J. Streaming automatic speech recognition with the transformer model. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 6074–6078. [Google Scholar]
Pham, N.-Q.; Nguyen, T.-S.; Niehues, J.; Müller, M.; Stüker, S.; Waibel, A. Very deep self-attention networks for end-to-end speech recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019; pp. 66–70. [Google Scholar]
Guo, M.H.; Cai, J.X.; Liu, Z.N.; Mu, T.J.; Martin, R.R.; Hu, S.M. PCT: Point cloud transformer. Comput. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-LOCAL Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]

Figure 1. Residual unit.

Figure 2. Bottleneck Structure.

Figure 3. Densely connected structure.

Figure 4. Schematic diagram of the proposed CTA-RDNet.

Figure 5. Structure of dense blocks and translation layers. (a) Denseblock structure. (b) Transition structure.

Figure 6. The structure of the channel attention.

Figure 7. The structure of the temporal attention.

Figure 8. Four types of targets. (a) Tin Ship. (b) Shu hang. (c) Guo tai. (d) New Century.

Figure 9. Ship trajectory map. The numbers 1–8 represent the order of the ship’s track.

Figure 10. The confusion matrix of the CTA-RDnet.

Figure 11. The ROC curve of the CTA-RDnet.

Figure 12. Prediction time of each model.

Figure 13. The recognition accuracy of different models under matching of working conditions.

Figure 14. The recognition accuracy of different models under mismatching working conditions (straight line).

Figure 15. The recognition accuracy of different models under mismatching of working conditions (acceleration).

Table 1. Parameter settings for each layer of the CTA-RDNet.

Layer	Parameters of Each Layer
pre-processing layer	3 × 1 conv, 16, stride 2 3 × 1 max pool, stride 2
residual block	$[\begin{array}{l} 1 \times 1 conv, 16 \\ 3 \times 1 conv, 16 \\ 1 \times 1 conv, 64 \end{array}]$ × 3
attention layer1	channel attention temporal attention
transition layer1	1 × 1 conv 2 × 1 average pool, stride 2
dense block1	$[\begin{array}{l} 1 \times 1 conv \\ 3 \times 1 conv \end{array}]$ × 4
transition layer2	1 × 1 conv 2 × 1 average pool, stride 2
attention layer2	channel attention temporal attention
dense block2	$[\begin{array}{l} 1 \times 1 conv \\ 3 \times 1 conv \end{array}]$ × 6
transition layer3	1 × 1 conv 2 × 1 average pool, stride 2
attention layer3	channel attention temporal attention
dense block3	$[\begin{array}{l} 1 \times 1 conv \\ 3 \times 1 conv \end{array}]$ × 2
classification layer	average pool Fully-connected, softmax

Table 2. Pseudocode for the CTA-RDnet model.

The CTA-RDNet model

Input:

x

//input time domain signals
Ouput: model
Begin
Read(

x

);//Read the input

x

.
Preprocess;// 3 × 1 Conv + 3 × 1 max pooling
Shallow Feature Extraction;//Three modules, including one residual block, one attention layer and one transition layer
Deep Feature Extraction;//Seven modules, including three dense block, two attention layer and two transition layer.
Average Pooling;//global average pooling.
Classification;// Fully-connected, softmax
Create model;//
Return model
end

Table 3. Details of the dataset used.

	Straight-Line	Turning	Acceleration	Deceleration	Sum
Tin Ship	6000	6000	6000	6000	24,000
Shu hang	6000	6000	6000	6000	24,000
Guo tai	6000	6000	6000	6000	24,000
New Century	6000	6000	6000	6000	24,000
sum	24,000	24,000	24,000	24,000	96,000

Table 4. Experimental results of different models.

Model	Accuracy%
Simple CNN	87.56
ResNet	91.13
DenseNet	91.58
CTA-RDnet	96.79

Table 5. Four models for the ablation experiment.

Model	Channel Attention	Temporal Attention
Res-Densenet (RDnet)	×	×
CA-RDnet	√	×
TA-RDnet	×	√
CTA-RDnet	√	√

Table 6. Experimental results of the ablation model.

Model	Accuracy %
Res-Densenet (RDnet)	93.04
CA-RDnet	95.61
TA-RDnet	94.42
CTA-RDnet	96.79

Table 7. List of abbreviations.

Abbreviation	Full Name
UATR	underwater acoustic target recognition
CTA-RDnet	Channel-Temporal Attention Res-DenseNet
CNN	Convolutional Neural Network

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, A.; Zeng, X. A Novel Deep Learning Method for Underwater Target Recognition Based on Res-Dense Convolutional Neural Network with Attention Mechanism. J. Mar. Sci. Eng. 2023, 11, 69. https://doi.org/10.3390/jmse11010069

AMA Style

Jin A, Zeng X. A Novel Deep Learning Method for Underwater Target Recognition Based on Res-Dense Convolutional Neural Network with Attention Mechanism. Journal of Marine Science and Engineering. 2023; 11(1):69. https://doi.org/10.3390/jmse11010069

Chicago/Turabian Style

Jin, Anqi, and Xiangyang Zeng. 2023. "A Novel Deep Learning Method for Underwater Target Recognition Based on Res-Dense Convolutional Neural Network with Attention Mechanism" Journal of Marine Science and Engineering 11, no. 1: 69. https://doi.org/10.3390/jmse11010069

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Deep Learning Method for Underwater Target Recognition Based on Res-Dense Convolutional Neural Network with Attention Mechanism

Abstract

1. Introduction

2. Related Work

2.1. ResNet Model

2.2. DenseNet Model

3. Materials and Methods

3.1. The Structure of CTA-RDNet

3.2. Attention Mechanism Structure in CTA-RDNet

3.2.1. Channel Attention

3.2.2. Temporal Attention

4. Experiment and Analysis

4.1. Data Preparation

4.2. Experimental Settings

4.3. Mixed Working Condition Experiment

4.3.1. Comparison of Different Methods

4.3.2. Ablation Experiments

4.4. The Effect of Working Conditions on Model Performance

4.4.1. Matching of Working Conditions

4.4.2. Mismatching of Working Conditions

4.5. Critical Analysis and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI