**3. Architecture of camResNet in Underwater Acoustic Target Recognition Method** *3.1. Architecture of camResNet*

The camResNet model is excellent for extracting classification-related feature information because it adds the channel attention mechanism based on the ResNet model. The process of the camResNet model includes three steps: feature structure building, feature extraction, and feature classification, as shown in Figure 2.

**3. Architecture of camResNet in Underwater Acoustic Target Recognition Method**

The camResNet model is excellent for extracting classification-related feature information because it adds the channel attention mechanism based on the ResNet model. The process of the camResNet model includes three steps: feature structure building, feature

**Figure 2.** The architecture of the camResNet model. **Figure 2.** The architecture of the camResNet model.

extraction, and feature classification, as shown in Figure 2.

*3.1. Architecture of camResNet*

The low-dimensional underwater acoustic signal limits the ability of convolution networks to extract high-dimensional abstract features. So, the feature structure building module decomposes the input acoustic signal into base signals using a set of one-dimensional convolutions as deep convolution filters, which can obtain high-dimensional input data. Different convolution kernels of N are set in the deep convolution filters ( , ) *F F*<sup>1</sup> *F*<sup>2</sup> *F<sup>N</sup>* , and each convolution layer contains a two-dimensional convolution kernel. The output of the feature module contains 16 groups of signals, so 16 one-dimensional convolution layers are needed. The specific formula is as follows: The low-dimensional underwater acoustic signal limits the ability of convolution networks to extract high-dimensional abstract features. So, the feature structure building module decomposes the input acoustic signal into base signals using a set of one-dimensional convolutions as deep convolution filters, which can obtain high-dimensional input data. Different convolution kernels of N are set in the deep convolution filters *F*(*F*1, *F*<sup>2</sup> · · · *FN*), and each convolution layer contains a two-dimensional convolution kernel. The output of the feature module contains 16 groups of signals, so 16 one-dimensional convolution layers are needed. The specific formula is as follows:

$$y\_i^m = f(\mathbf{x}^m \times \boldsymbol{\omega}\_i^m + b\_i^m) \tag{3}$$

where *m x* is the *m*-th input sample, *m i* denotes the convolution kernel of the *i*-th output channel of the *m*-th sample, *m bi* denotes the bias function of the *i*-th output channel of the *m*-th sample, and *m i y* is the *i*-th channel output value of the *m*-th sample. The sym- *m y* , where *x <sup>m</sup>* is the *m*-th input sample, *ω<sup>m</sup> i* denotes the convolution kernel of the *i*-th output channel of the *m*-th sample, *b m i* denotes the bias function of the *i*-th output channel of the *m*-th sample, and *y m i* is the *i*-th channel output value of the *m*-th sample. The symbol × means dot product. Finally, the output feature group of the *i*-th layer is *y m i* , formed through the ReLU function *f*(·).

bol means dot product. Finally, the output feature group of the *i*-th layer is *i* formed through the ReLU function *f* () . The number and frequency of the spectrum are the primary basis for underwater acoustic signal target recognition. The spectrum energy that will shift with the change of distance between the target and the hydrophone is called unstable spectra. The spectrum energy that will not shift with the change of distance between the target and the hydrophone is called stable spectra. The camResNet model can extract the stable spectrum of the underwater acoustic target as the feature to recognize the target category accurately when the spectra of the target are shifted due to the Doppler effect. The stable spectra contain many harmonic signals. The fundamental frequency is the shaft frequency signal of the propeller, and the relationship of the harmonic groups is the multiplier. For a Bbladed propeller, each B is a set of pulses with a period T, and the repetition period of the pulses is T/B. The 2N + 1st set of pulses in the time domain signal is selected, and its *k*-th The number and frequency of the spectrum are the primary basis for underwater acoustic signal target recognition. The spectrum energy that will shift with the change of distance between the target and the hydrophone is called unstable spectra. The spectrum energy that will not shift with the change of distance between the target and the hydrophone is called stable spectra. The camResNet model can extract the stable spectrum of the underwater acoustic target as the feature to recognize the target category accurately when the spectra of the target are shifted due to the Doppler effect. The stable spectra contain many harmonic signals. The fundamental frequency is the shaft frequency signal of the propeller, and the relationship of the harmonic groups is the multiplier. For a B-bladed propeller, each B is a set of pulses with a period *T*, and the repetition period of the pulses is *T*/B. The 2*N* + 1st set of pulses in the time domain signal is selected, and its *k*-th Fourier transform is denoted as *F k N* (*ω*). The specific formula of power spectral density by this random process is as follows [28]:

$$\begin{aligned} \mathbf{s}(\omega) &= E\{\mathbf{s}\_k(\omega)\} = E\left\{ \lim\_{N \to \infty} \frac{1}{(2N+1)T} \left| F\_N^{(k)}(\omega) \right|^2 \right\} \\ &= \left| \mathbf{g}(\omega) \right|^2 \left\{ (2N+1)(\mathcal{U}\_1 - \mathcal{U}\_2) + \sum\_{p=-2N}^{2N} (2N+1-|p|) \cos \omega p T [\mathcal{U}\_2 + \mathcal{U}\_3 \cdot 2 \cos \omega (T/2) + \mathcal{U}\_4 \cdot \cos \omega (T/4)] \right\} \end{aligned} \tag{4}$$

where *E*{·} is the expected value, *ω* denotes angular frequency *g*(*ω*) Fourier spectrum, representing the time domain waveform. The specific formula of *U* is as follows:

$$\begin{cases} \mathcal{U}\_1 = \overline{a\_0^2} + \overline{a\_1^2} + \overline{a\_2^2} + \overline{a\_3^2} \\ \mathcal{U}\_2 = \overline{a}\_0^2 + \overline{a}\_1^2 + \overline{a}\_2^2 + \overline{a}\_3^2 \\ \mathcal{U}\_3 = \overline{a}\_0 \cdot \overline{a}\_1 + \overline{a}\_1 \cdot \overline{a}\_3 \\ \mathcal{U}\_4 = \overline{a}\_0 \cdot \overline{a}\_1 + \overline{a}\_1 \cdot \overline{a}\_2 + \overline{a}\_2 \cdot \overline{a}\_3 + \overline{a}\_3 \cdot \overline{a}\_0 \end{cases} \tag{5}$$

where *a<sup>i</sup>* denotes the amplitude of the pulse number *i* in a set of signals. *a<sup>i</sup>* denotes the average value of *a<sup>i</sup>* . The fundamental frequency and the first group of harmonic signals can be used as stable signal characteristics because the modulation spectrum of the actual vessel radiation noise decays rapidly with the increasing number of groups of spectra. The obtained multidimensional information with the feature structure building module is called the original information, which is the input of the feature extraction module. The feature extraction module contains two ResNet models with the channel attention mechanism. A convolution kernel size of 1 × 64 is a good trade-off between the quality of the recognition and the computational cost of the model for underwater acoustic. The first layer of the residual network contains two convolutions. Each convolution operation maps 16 sets of base signals to another 16 sets of base signals to extract the deep features of the signal. The convolution operation consists of 16 convolution layers, each containing 16 different filters *F*(*F*1, *F*<sup>2</sup> · · · *FN*). So, 16 × 16 one-dimensional convolution layers are needed. The specific formula is as follows:

$$y\_i^m = \sum\_{k=1}^N f(\mathbf{x}\_k^m \times \boldsymbol{\omega}\_{ik}^m + b\_{ik}^m) \tag{6}$$

where *x m ik* denotes the input value of the *<sup>k</sup>*-th channel in the *<sup>m</sup>*-th sample, *<sup>ω</sup><sup>m</sup> ik* denotes the *k*-th convolution kernel of the *i*-th layer convolution of the *m*-th sample, *b m ik* denotes the *k*-th bias function of the *i*-th layer convolution of the *m*-th sample, and *y m ik* is the output of the *i*-th layer convolution of the *m*-th sample. The symbol × means dot product. The output feature group of the *k*-th convolution of the *i*-th convolution layer is formed through the activation function *f*(·), which uses the ReLU function.

Finally, all the convolution outputs in the *i*-th layer are summed up as the convolution output value of the *i*-th layer. The second convolution is the same as the first convolution operation in order to obtain deeper underwater acoustic features. A channel attention mechanism is added to each one-residual network to enhance the stable spectrum features and further enhance the network's performance in extracting underwater acoustic signals.

Section 3.2 describes the channel attention mechanism of the feature structure building module in detail.

The feature classification uses a fully convolutional network to map the high-dimensional features from the output of the feature extraction module to a lower dimension with the size of the classification class. The details are listed as follows.

Stage 1: In feature structure, the data shape of the input layer is a four-dimensional matrix 64 × 1 × 1 × 800. The shape changes from 64 × 16 × 1 × 800 to 64 × 16 × 1 × 800 by convolutional layer. The batch normalization layer is applied, followed by a ReLU activation function and max pooling with the stride of 2 × 1.

Stage 2: The feature extraction module contains two residual modules, called block-1 and block-2. The input shape of block-1 is 64 × 1 × 1 × 400. The shape changes from 64 × 16 × 1 × 400 to 64 × 16 × 1 × 400 by two convolutions with a convolution kernel of 64 × 1 and a stride of 1 × 1. Batch normalization is applied after each convolution and connected between the two convolutions using the activation function ReLU. Finally, add the channel attention mechanism, marked with the dashed yellow box in Figure 2, which will be described in detail in Section 3.2 of the paper. The obtained data are summed with the original data as the output of block-1.

Stage 3: The input shape of block-2 is 64 × 1 × 1 × 400. The shape changes from 64 × 16 × 1 × 400 to 64 × 16 × 1 × 200 by convolution with a convolution kernel 64 × 1 and a step of 2 × 1. Batch normalization and a ReLU activation function are applied. The

second convolution does not change the shape of the data and adds the channel attention mechanism. The obtained data are summed with the original data as the output of block-2. mechanism. The obtained data are summed with the original data as the output of block-2.

second convolution does not change the shape of the data and adds the channel attention

21. Batch normalization and a ReLU activation function are applied. The

by convolutional layer. The batch normalization layer is applied, followed by a ReLU ac-

connected between the two convolutions using the activation function ReLU. Finally, add the channel attention mechanism, marked with the dashed yellow box in Figure 2, which will be described in detail in Section 3.2 of the paper. The obtained data are summed with

Stage 2: The feature extraction module contains two residual modules, called block-

6411400. The shape changes from

6411400. The shape changes from

641

by convolution with a convolution kernel

by two convolutions with a convolution kernel of

11. Batch normalization is applied after each convolution and

*Sensors* **2022**, *22*, x FOR PEER REVIEW 6 of 20

tivation function and max pooling with the stride of 2 × 1.

64161400

64161200

1 and block-2. The input shape of block-1 is

to

the original data as the output of block-1.

to

Stage 3: The input shape of block-2 is

and a stride of

64161400

64161400

and a step of

641

Stage 4: This paper uses a fully convolutional networks model, in which a cubic convolutional network is used to map high-dimensional features to low-dimensional features in the decision module. Stage 4: This paper uses a fully convolutional networks model, in which a cubic convolutional network is used to map high-dimensional features to low-dimensional features in the decision module.

#### *3.2. Structure of Channel Attention Mechanism Based on Underwater Acoustic of camResNet 3.2. Structure of Channel Attention Mechanism Based on Underwater Acoustic of camResNet*

The changes in the distance between the target and the hydrophone lead to a Doppler effect, which is the frequency move. The Doppler frequency compensation is challenging, as the underwater acoustic channel is low-frequency filtering. The method in this paper can extract the stable spectral features under the Doppler frequency shift by the channel attention mechanism, which can automatically acquire the critical information in each feature channel by learning to enhance the valuable features and suppress the less useful features for the current task. The changes in the distance between the target and the hydrophone lead to a Doppler effect, which is the frequency move. The Doppler frequency compensation is challenging, as the underwater acoustic channel is low-frequency filtering. The method in this paper can extract the stable spectral features under the Doppler frequency shift by the channel attention mechanism, which can automatically acquire the critical information in each feature channel by learning to enhance the valuable features and suppress the less useful features for the current task.

The amount of information on the channels is different, and the channel attention mechanism increases the weight to that of the channel with high information. It can improve the model's capability. First, squeeze the information out of each channel and then add a lightweight gating system to optimize the channel information and output the channel weights. The channel attention mechanism of this paper is divided into two parts. Figure 3 shows the channel attention mechanism model. The first part is the primary part, which weighs each channel, and the second part is the auxiliary part of formation extraction, which is another channel information after transposing the information. The amount of information on the channels is different, and the channel attention mechanism increases the weight to that of the channel with high information. It can improve the model's capability. First, squeeze the information out of each channel and then add a lightweight gating system to optimize the channel information and output the channel weights. The channel attention mechanism of this paper is divided into two parts. Figure 3 shows the channel attention mechanism model. The first part is the primary part, which weighs each channel, and the second part is the auxiliary part of formation extraction, which is another channel information after transposing the information.

**Figure 3.** Channel attention mechanism network model. **Figure 3.** Channel attention mechanism network model.

The first part analyzes the waveform features in each channel separately. First, process the data with a convolution kernel *H W* and the stride of *W* ; the shape changes The first part analyzes the waveform features in each channel separately. First, process the data with a convolution kernel *H* × *W* and the stride of *W*; the shape changes from *H* × *W* × *C* to 1 × 1 × *C*. Where *H* represents the length of the input data, *W* represents the width of the input data. The specific formula is as follows:

$$\mathbf{x}^{(m)} = \sum\_{i=1}^{M} \sum\_{k=1}^{N} f\left(\mathbf{x}\_{k}^{(m)} \times \boldsymbol{\omega}\_{ik}^{(m)} + \boldsymbol{b}\_{ik}^{(m)}\right) \tag{7}$$

where *x* (*m*) *k* denotes the *k*-th channel data in the input information of the channel attention mechanism module, *ω* (*m*) *ik* denotes the weight of the *k*-th channel of the *i*-th layer of convolution, *b* (*m*) *ik* denotes the bias of the *k*-th channel of the *i*-th layer of convolution, and *x* (*m*+1) denotes the output value of *x* (*m*) after one convolution.

The data of each channel characterize the global features of each channel. In order to be able to learn the nonlinear characteristics between the channels independently, this paper uses a gating system with an activation function. The specific formula is as follows.

$$\mathbf{x}^{(m+3)} = \sum\_{i=1}^{M} \sum\_{k=1}^{N} \sigma \left( \delta \left( \mathbf{x}\_{k}^{(m+1)} \times \boldsymbol{\omega}\_{ik}^{(m+1)} + \boldsymbol{b}\_{ik}^{(m+1)} \right) \times \boldsymbol{\omega}\_{ik}^{(m+2)} + \boldsymbol{b}\_{ik}^{m+2} \right) \tag{8}$$

where *x* (*m*+1) *k* is the global feature of size 1 × 1 × *C*. *ω* (*m*+1) *ik* and *ω* (*m*+2) *ik* are the weights of the network mapping. In order to obtain the features of the network channel, convolutional mapping is used, and the feature points before mapping are *r* times after mapping, so *ω* (*m*+1) *ik* ∈ *R c <sup>r</sup>* ×*c* , *ω* (*m*+2) *ik* ∈ *R <sup>c</sup>*<sup>×</sup> *<sup>c</sup> <sup>r</sup>* . *δ* is the ReLU activation function, and *σ* is the sigmoid activation function.

The second part synthesizes the signal characteristics in all channels. Process the data with a convolution kernel 1 × 64 and the stride of 1; the shape changes from *H* × *W* × *C* to *H* × *W* × 1. The multi-layer convolutional network has a solid ability to extract sufficient recognition information, and the output of the network contains a large number of stable signals with a small number of unstable signals. One-dimensional data of the same size are extracted from the network's output as the channel weights of the original signal, which can effectively enhance the spectrum energy contained in the channel.

The two parts of the channel attention mechanism weigh the signal features from different perspectives. Finally, the two weighted pieces of information are fused as the output of the channel attention mechanism.

#### **4. Model Evaluation**

*4.1. Dataset*

The eight hydrophones are fixed at the same level in eight different places at the same interval. This paper randomly selects four sets of hydrophones at equal intervals as input data. The data used in the experiments contain four classes of vessels, and the third of the four types of signals is the radiated noise of the iron vessel, while the first, second and fourth types are vessels of the same material and similar hull size.

To study the recognition effect of camResNett under different Doppler frequency shifts, four different working conditions were intercepted in each class of experimental data. Each class of the data obtained has four modes of operation: straight ahead at a constant speed, straight forward acceleration, straight-ahead deceleration, and turning. Figure 4 shows the spectrogram of different working conditions by the fourth type of vessel. *Sensors* **2022**, *22*, x FOR PEER REVIEW 8 of 20

**Figure 4.** The spectrogram by the fourth type of vessel. (**a**) the spectrogram of straight motion. (**b**) the spectrogram of decelerating motion. (**c**) the spectrogram of accelerating motion. (**d**) the spectrogram of turning movement. **Figure 4.** The spectrogram by the fourth type of vessel. (**a**) the spectrogram of straight motion. (**b**) the spectrogram of decelerating motion. (**c**) the spectrogram of accelerating motion. (**d**) the spectrogram of turning movement.

*f f*

Figure 4a is the time-frequency relationship of the signal by the vessel of straight motion. It shows that there is acceleration when the vessel is just starting, and the fre-

*v*

*f*

cos

(9)

*u*

. Figure

is the speed of the underwater acous-

is the angle between the line of the vertical distance

varies with the change of

*u*

1

1

tic signal propagating in the channel, u is the speed of the vessel motion, and f is the fre-

connecting the ship and the hydrophone and the line connecting the ship and the hydrophone. The signal will have a stable frequency shift when the vessel movement speed is constant. In the passive recognition process, the stable spectrum feature after the frequency shift is the primary information for recognizing the target. However, when the

4b,c are time-frequency diagrams of the ship in the motion state of acceleration and deceleration. The low-frequency spectra are the stable spectra, and the spectrum above 400Hz will change with time. Figure 4d is the time-frequency diagram by the vessel of turning, and a large number of unstable spectra appear in the time-frequency diagram because the

To further observe the energy distribution of the frequencies with the vessel for different operating conditions, Figure 5 shows the power spectral density for the different operating conditions by the fourth type of vessel, which is the Fourier transform of the

−

0

=

is the original frequency of the vessel, *v*

keeps changing, and the

formula with the Doppler shift is as follows:

*u*

correlation function with the 0.5 s window length.

where

0 *f*

target accelerates, the

keeps changing.

quency after the Doppler shift.

Figure 4a is the time-frequency relationship of the signal by the vessel of straight motion. It shows that there is acceleration when the vessel is just starting, and the frequency shifts to high frequency. The speed reaches stability within a brief period, and a stable spectrum characteristic appears, which contains line and continuous spectra. The formula with the Doppler shift is as follows:

$$f = f\_0 \cdot \frac{1}{1 - \frac{\mu \cos \theta}{v}} \tag{9}$$

where *f*<sup>0</sup> is the original frequency of the vessel, *v* is the speed of the underwater acoustic signal propagating in the channel, *u* is the speed of the vessel motion, and *f* is the frequency after the Doppler shift. *θ* is the angle between the line of the vertical distance connecting the ship and the hydrophone and the line connecting the ship and the hydrophone. The signal will have a stable frequency shift when the vessel movement speed is constant. In the passive recognition process, the stable spectrum feature after the frequency shift is the primary information for recognizing the target. However, when the target accelerates, the *u* keeps changing, and the *f* varies with the change of *u*. Figure 4b,c are time-frequency diagrams of the ship in the motion state of acceleration and deceleration. The low-frequency spectra are the stable spectra, and the spectrum above 400 Hz will change with time. Figure 4d is the time-frequency diagram by the vessel of turning, and a large number of unstable spectra appear in the time-frequency diagram because the *θ* keeps changing.

To further observe the energy distribution of the frequencies with the vessel for different operating conditions, Figure 5 shows the power spectral density for the different operating conditions by the fourth type of vessel, which is the Fourier transform of the correlation function with the 0.5 s window length. *Sensors* **2022**, *22*, x FOR PEER REVIEW 9 of 20

**Figure 5.** The power spectrum density by the fourth type of vessel. (**a**) The power spectrum density of straight motion. (**b**) The power spectrum density of decelerating motion. (**c**) The power spectrum density of accelerating motion. (**d**)The power spectrum density of turning movement. **Figure 5.** The power spectrum density by the fourth type of vessel. (**a**) The power spectrum density of straight motion. (**b**) The power spectrum density of decelerating motion. (**c**) The power spectrum density of accelerating motion. (**d**) The power spectrum density of turning movement.

Figure 5 shows the power spectrum density by the fourth type of vessel. A set of resonant waves at a fundamental frequency of 200 Hz occur stably under four different operat-

The same class of targets contains different Doppler shift signals, which will increase the difficulty of recognition, with the original signal compressed or broadened. This method extracts the stable features of the same class of vessels under different working

To study the difference between the categories with four types, the straight motion working condition of each type of vessel is chosen to exhibit a time-frequency relationship. Figure 6 shows the pictures and time-frequency diagrams of the four types of vessels, containing class I, class II, and class III and IV vessels. The background noise of the four vessels has relatively apparent differences, but there are similar low-frequency spectra.

tion. The high-frequency spectral density varies significantly, and the low-frequency spectral density is more stable than the high frequency under different working conditions. Figure 5b,c show the acceleration and deceleration. Compared with Figure 5a, the power spectral density in high frequency is higher than in the straight motion, and some frequency points in the high frequency are changed. Figure 5d shows turning, and many spectral den-

conditions.

sity power spikes appear in the high frequency compared with Figure 5a.

Figure 5 shows the power spectrum density by the fourth type of vessel. A set of resonant waves at a fundamental frequency of 200 Hz occur stably under four different operating conditions. High-frequency points are shifted when the vessel is in an accelerated motion. The high-frequency spectral density varies significantly, and the low-frequency spectral density is more stable than the high frequency under different working conditions. Figure 5b,c show the acceleration and deceleration. Compared with Figure 5a, the power spectral density in high frequency is higher than in the straight motion, and some frequency points in the high frequency are changed. Figure 5d shows turning, and many spectral density power spikes appear in the high frequency compared with Figure 5a.

The same class of targets contains different Doppler shift signals, which will increase the difficulty of recognition, with the original signal compressed or broadened. This method extracts the stable features of the same class of vessels under different working conditions.

To study the difference between the categories with four types, the straight motion working condition of each type of vessel is chosen to exhibit a time-frequency relationship. Figure 6 shows the pictures and time-frequency diagrams of the four types of vessels, containing class I, class II, and class III and IV vessels. The background noise of the four vessels has relatively apparent differences, but there are similar low-frequency spectra.

As can be observed in Figure 6b,d, a clear line spectrum in the low-frequency band is very similar. Figure 6h has two precise line spectra, respectively, similar to the line spectra in Figure 6b,d. No clear line spectrum is observed in Figure 6f, but the energy distribution at low frequencies is similar to that in Figure 6h. Figure 6 shows that the spectrum is very similar to the different vessel types, in which the spectrum energy is concentrated in the low frequency and continuous. So, it is difficult to distinguish the vessel category with the traditional method.

#### *4.2. Data Pre-Processing*

There are 800 feature points (0.1 s) for a frame and no overlap between frames. If the maximum feature point of the sample is less than 0.1, eliminate the small value frame sample, ensuring that the recognition results are not affected by the particular sample points. After eliminating the small samples, the samples contain 7097 samples. Use 1/4 of the data as the test set and 3/4 of the data as the training set after normalizing the samples. The prepared data have 5322 samples as the training set and 1774 samples as the test set. In total, 200 samples are randomly selected as the validation set in each class, and the validation set contains 800 samples in total. The training method is a batch method, in which 64 samples are randomly selected in each batch, and the selected samples will not be used as alternative samples in the next batch.

#### *4.3. Experimental Results*

#### 4.3.1. Discussion of Model Structure

This reports the experimental results of the model with Doppler shifts signals. The straight condition is considered a signal without a Doppler shift. The other conditions are considered a Doppler shift. The experiment chose four conditions as input data.

The first experiment illustrates the relationship between the recognition rate and the number of residual layers, where the size of the number of residual layers changes in the set of {1,2,3,4}. According to the results in Table 1, two residual layers have the best recognition effect, and the recognition rate will decrease by increasing the number of residual layers.

**Table 1.** Recognition rate for different numbers of residual layers.


**Figure 6.** The pictures and spectrograms of the different vessels. (**a**) the pictures of class I vessel. (**b**) the spectrogram of class I vessel. (**c**) the pictures of class II vessel. (**d**) the spectrogram of class II vessel. (**e**) the pictures of class III vessel. (**f**) the spectrogram of class III vessel. (**g**) the pictures of class IV vessel. (**h**) the spectrogram of class IV vessel. **Figure 6.** The pictures and spectrograms of the different vessels. (**a**) the pictures of class I vessel. (**b**) the spectrogram of class I vessel. (**c**) the pictures of class II vessel. (**d**) the spectrogram of class II vessel. (**e**) the pictures of class III vessel. (**f**) the spectrogram of class III vessel. (**g**) the pictures of class IV vessel. (**h**) the spectrogram of class IV vessel.

As can be observed in Figure 6b,d, a clear line spectrum in the low-frequency band

bution at low frequencies is similar to that in Figure 6h. Figure 6 shows that the spectrum is very similar to the different vessel types, in which the spectrum energy is concentrated

(**g**) (**h**)

Two residual layers are appropriate for the number of samples in the experiment, and the different number of samples matches the different number of layers. If the ResNet network is not over-fitted or under-fitted, the over-fitting phenomenon will occur and decrease the recognition accuracy when adding the channel attention mechanism. If the ResNet network is under-fitted, adding the channel attention mechanism will compensate for this under-fitting phenomenon. The number of model parameters needs to match the number of samples, and the number of parameters increases after adding the channel attention mechanism.

The second experiment illustrates the relationship between the recognition rate and the size of the convolutional kernel. The size of the 1D convolutional kernel varies in the set of {3, 5, 7, 9, 11, 15, 17, 21, 25, 33, 41, 49, 57, 64, 75, 85, 95}. Table 2 shows that a kernel size of 64 is best for the recognition rate. The scale size of the target needs to match the actual perceptual field after the addition of convolutional kernels because the underwater acoustic target is submerged in background noise, and a large amount of ocean background noise is extracted if there is no match.


**Table 2.** Recognition rate for convolution kernels of different sizes.

#### 4.3.2. Classification Experiment Results

In the experimental data, four-vessel classes are used to train different deep-learning network models, and the information of each network model is described below.


activation function. The convolutional operation with a convolutional kernel size of 1 × 64 and a step size of 1 is chosen. The batch method with a batch size of 64 is used for training. The optimization method is chosen during training using a gradient descent method, and the learning rate is 0.001. For optimization, the gradient descent algorithm is used.


A test set was used to evaluate the model's recognition ability. Table 3 shows the recognition rate with straight motion and four different working conditions. The recognition rate of amRestNet and SE\_ResNet are similar when the data contain straight data. The recognition rate of amRestNet is higher than SE\_ResNet when the data contain four different working conditions. Both amRestNet and SE\_ResNet can extract valid feature information when the data contain a single working condition. However, the SE\_ResNet is not as effective as amRestNet in extracting stabilization features when different working conditions are included and have different Doppler frequencies.

**Table 3.** Recognition rate of proposed model and compared models with straight motion data and four different working conditions.


Table 3 shows that the camResNet model has a recognition rate of 98.2%, which is 1.1–15.8% higher than the other networks. The DBN model is a basic neural network model based on probabilistic statistics, and its input signal is a frequency domain signal. The GAN model is the adversarial model, which mainly contends with small-sample data, and its input signal is the time domain signal. The DenseNet model can simplify the network complexity and reduce network parameters by designing the dense block, and its input signal is the frequency domain signal. The ResNet model uses residual learning to update the network parameters, and its input signal is the time domain signal. The U\_Net model uses up-sampling and down-sampling to extract multi-scale features, which can improve the recognition effect, and its input signal is the time domain signal.

The DBN model has different optimization methods compared to other models, which use probabilistic models to optimize the parameters, so the recognition rate of the DBN model is lower than other networks. The recognition rate of U\_Net is lower than the GAN model and the DenseNet model because the up-sample and down-sample can lose some feature information. The SE\_ResNet model has an excellent performance in recognition rate because the ResNet model has the balance between network depth and recognition rate of small samples. The camResNett model is better than the other models in terms of the recognition rate because the channel attention mechanism deals with underwater signals' sparsity and multi-scale characteristics.

In the display of recognition experiment results, we use recognition accuracy, recall rate, precision, and F1-score to evaluate the recognition performance of the networks. The formulae for each indicator are as follows.

$$Precision = \frac{TP}{TP + FP} \tag{10}$$

$$Recall = \frac{TP}{TP + FN} \tag{11}$$

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \tag{12}$$

$$F1\text{-score} = \frac{2TP}{2TP + FP + FN} \tag{13}$$

*TP*, *TN*, *FP*, and *FN* are true positive, true negative, false positive, and false negative. Table 4 shows the precision, recall rate, F1-score, and accuracy of the test sample, while Table 5 shows the confusion matrix.

**Table 4.** Recognition results of camResNet.


**Table 5.** Confusion matrix of camResNet.


Class I of the vessel includes three acceleration signals, three deceleration signals, five straight-ahead signals, and seven turn signals. Class II of the vessel includes three acceleration signals, three deceleration signals, three straight-ahead signals, and six turn signals. Class III consists of three acceleration signals, deceleration signals, straight-ahead signals, and five turn signals. Class IV consists of three acceleration signals, deceleration signals, straight-ahead signals, and turn signals. The vessels of the different categories have similar sizes but different materials, and the third category material is significantly different to the materials from the other three. In Table 5, the probability of incorrectly recognizing Class II of the vessel as Class III of the vessel is the highest. This is followed by the probability of incorrectly recognizing Class III of the vessel as Class II of the vessel. This indicates that camResNet extracts shallow physical features and deep category features, which is related to the Doppler effect. Class II of the vessel and Class III of the vessel contain the most similar samples in the composition structure of working conditions, resulting in many samples with similar Doppler shifts. Table 4 shows that the recognition effects of Class I of the vessel and Class IV of the vessel are better than Class II of the vessel and Class III of the vessel, which may appear confusing.

The precision of Class I of the vessel is the highest, and the probability of incorrectly recognizing Class I of the vessel as Class I of the vessel is the highest because Class I of the vessel contains many straight samples and has a prominent stable spectrum without a Doppler shift. Class IV of the vessel has the highest recall, which indicates that the samples of different working conditions in Class IV are more balanced than the others and have more stable Doppler shift characteristics than the others.

4.3.3. Visualization of Energy Distribution by the Architecture of camResNet Power Spectral Density density of the original signal for Class IV of the vessel. The comparison indicates that the apparent fundamental frequency signal in the original signal still exists after processing the camResNet model. In Figure 7, the camResNet model's output contains not only stable

*Sensors* **2022**, *22*, x FOR PEER REVIEW 15 of 20

To further assess the feature extraction capability of the camResNet model, the trained camResNet model was fed by Class IV of the vessel because the spectrogram and the power spectral density are displayed in Figures 4 and 5. Figure 7 shows the time-frequency diagram and the power spectral density of the output. signals but also some high-frequency signals, which indicates the camResNet model can avoid extracting unstable signals that are quickly Doppler shifted and recovers stable signals that are submerged in high frequencies.

the vessel after processing the camResNet model, and Figure 5 shows the power spectrum

**Figure 7.** The spectrogram and power spectrum density of Class IV with camResNet. (**a**,**c**,**e**) The spectrogram of extracted features for Class IV vessel by camResNet; (**b**,**d**,**f**) The power spectrum density of extracted features for Class IV vessel by camResNet. **Figure 7.** The spectrogram and power spectrum density of Class IV with camResNet. (**a**,**c**,**e**) The spectrogram of extracted features for Class IV vessel by camResNet; (**b**,**d**,**f**) The power spectrum density of extracted features for Class IV vessel by camResNet.

*t-SNE Feature Visualization Graphs* The above experiment shows that the camResNet model can extract signals of stable frequencies in underwater acoustic signals. To further analyze the ability to extract features by camResNet, the distance of the original features and camResNet output features is visualized using the t-SNE method. Figure 8 shows the distance characteristics of the original signal and the output of the camResNet model when different working conditions are used as the input data. Figure 8a shows the t-SNE of the original underwater acoustic signal, which indicates that the original underwater acoustic signal has weak separability. Figure 8b shows the t-SNE of the output signals with the input of four different working conditions in the camResNet model. Figure 8c–f show the t-SNE of the output signals after Figure 7a,c,e show the signal after the camResNet model by Class IV of the vessel, and Figure 4 shows the spectrogram of the original signal from Class IV vessel. The comparison indicates that the energy of the feature is still concentrated in the low frequency after the camResNet model. Figure 7b,d,f show the power spectrum density of Class IV of the vessel after processing the camResNet model, and Figure 5 shows the power spectrum density of the original signal for Class IV of the vessel. The comparison indicates that the apparent fundamental frequency signal in the original signal still exists after processing the camResNet model. In Figure 7, the camResNet model's output contains not only stable signals but also some high-frequency signals, which indicates the camResNet model can avoid extracting unstable signals that are quickly Doppler shifted and recovers stable signals that are submerged in high frequencies.

t-SNE Feature Visualization Graphs *Sensors* **2022**, *22*, x FOR PEER REVIEW 16 of 20

> The above experiment shows that the camResNet model can extract signals of stable frequencies in underwater acoustic signals. To further analyze the ability to extract features by camResNet, the distance of the original features and camResNet output features is visualized using the t-SNE method. Figure 8 shows the distance characteristics of the original signal and the output of the camResNet model when different working conditions are used as the input data. Figure 8a shows the t-SNE of the original underwater acoustic signal, which indicates that the original underwater acoustic signal has weak separability. Figure 8b shows the t-SNE of the output signals with the input of four different working conditions in the camResNet model. Figure 8c–f show the t-SNE of the output signals after putting straight motion, acceleration, deceleration, and turning conditions into the camResNet model, respectively. putting straight motion, acceleration, deceleration, and turning conditions into the camResNet model, respectively. From the figures, it can be seen that the camResNet model can classify all the working condition signals well and classify each kind of working condition signals well. In particular, the camResNet model can still classify the signals of acceleration conditions well when the acceleration conditions contain a large number of unstable Doppler shift signals. The classification maps show that the camResNet model can extract abstract and stable internal features under different conditions.

**Figure 8.** The t-SNE visualized graphs. (**a**) The t-SNE visualized graphs of original hydroacoustic signal; (**b**) The t-SNE visualized graphs' output by camResNet; (**c**) The t-SNE visualized graphs of straight motion by camResNet; (**d**) The t-SNE visualized graphs of accelerating motion by camRes-Net; (**e**) The t-SNE visualized graphs of decelerating motion by camResNet; (**f**) The t-SNE visualized graphs of turning movement by camResNet. **Figure 8.** The t-SNE visualized graphs. (**a**) The t-SNE visualized graphs of original hydroacoustic signal; (**b**) The t-SNE visualized graphs' output by camResNet; (**c**) The t-SNE visualized graphs of straight motion by camResNet; (**d**) The t-SNE visualized graphs of accelerating motion by camResNet; (**e**) The t-SNE visualized graphs of decelerating motion by camResNet; (**f**) The t-SNE visualized graphs of turning movement by camResNet.

4.3.4. Recognition Results for Different Data of camResNet

From the figures, it can be seen that the camResNet model can classify all the working condition signals well and classify each kind of working condition signals well. In particular, the camResNet model can still classify the signals of acceleration conditions well when the acceleration conditions contain a large number of unstable Doppler shift signals. The classification maps show that the camResNet model can extract abstract and stable internal features under different conditions. *Sensors* **2022**, *22*, x FOR PEER REVIEW 17 of 20

#### 4.3.4. Recognition Results for Different Data of camResNet Three different network models were used to compare the recognition results of un-

Three different network models were used to compare the recognition results of underwater acoustic signals, which contained four working conditions. The models of DenseNet and SE\_ResNet have a more extraordinary ability to recognize and were used for comparison with the camResNet model. The training method determines that the training and test data are the same—one of four working conditions. The recognition results were averaged by repeating the test five times, and the obtained experimental results are shown in Figure 9. The solid blue line is the recognition rate, which uses the data of straight motion working conditions as the training data and test data. The blue dotted line is the recognition rate, which uses the data of turn working conditions as the training data and test data. The solid red line is the recognition rate, which uses the data of deceleration working conditions as the training data and test data. The yellow dashed line is the recognition rate, which uses the data of acceleration working conditions as the training data and test data. derwater acoustic signals, which contained four working conditions. The models of DenseNet and SE\_ResNet have a more extraordinary ability to recognize and were used for comparison with the camResNet model. The training method determines that the training and test data are the same—one of four working conditions. The recognition results were averaged by repeating the test five times, and the obtained experimental results are shown in Figure 9. The solid blue line is the recognition rate, which uses the data of straight motion working conditions as the training data and test data. The blue dotted line is the recognition rate, which uses the data of turn working conditions as the training data and test data. The solid red line is the recognition rate, which uses the data of deceleration working conditions as the training data and test data. The yellow dashed line is the recognition rate, which uses the data of acceleration working conditions as the training data and test data.


The network is trained and tested using data under one working condition, which is easy to overfit by a deeper model of DenseNet. The SE\_ResNet model uses self-coding to compress channel features but does not consider the sparse characteristics of underwater The network is trained and tested using data under one working condition, which is easy to overfit by a deeper model of DenseNet. The SE\_ResNet model uses self-coding to compress channel features but does not consider the sparse characteristics of underwater acoustic targets. The camResNet model builds two different channel attention mechanisms,

acoustic targets. The camResNet model builds two different channel attention mechanisms, which fully consider the sparsity of underwater acoustic signal and the continuity

To further verify the recognition performance of the camResNet model, three network

which fully consider the sparsity of underwater acoustic signal and the continuity spectrum, and they have better recognition results than the other models.

*Sensors* **2022**, *22*, x FOR PEER REVIEW 18 of 20

The distributions of the training and test sets in the above experiments were identical. To further verify the recognition performance of the camResNet model, three network models were trained using four working conditions and tested under one working condition. The recognition results were averaged by repeating the test five times, and the obtained experimental results are shown in Figure 10. The solid blue line is the recognition rate, which uses the data of straight motion working conditions as the test data. The blue dotted line is the recognition rate, which uses the data of turn working conditions as the test data. The solid red line is the recognition rate, which uses the data of deceleration working conditions as the test data. The yellow dashed line is the recognition rate, which uses the data of acceleration working conditions as the test data. models were trained using four working conditions and tested under one working condition. The recognition results were averaged by repeating the test five times, and the obtained experimental results are shown in Figure 10. The solid blue line is the recognition rate, which uses the data of straight motion working conditions as the test data. The blue dotted line is the recognition rate, which uses the data of turn working conditions as the test data. The solid red line is the recognition rate, which uses the data of deceleration working conditions as the test data. The yellow dashed line is the recognition rate, which uses the data of acceleration working conditions as the test data. (1) The maximum recognition rate of camResNet is 0,976; the minimum recognition rate


SE\_ResNet uses compressed information to obtain channel weights to obtain certain stable features, so the recognition ability under different working conditions is better than that of DenseNet. The stable signal of the Doppler shift represents multi-scale information, which causes extract information with one scale to lose helpful information. The camRes-Net model uses convolution operation to extract channel information from two aspects. The first part uses the convolution kernel superposition to expand the perceptual field and extract features of different scales. The second part extracts the feature from the local features of all information. Fusing the two features as the weights of channels can comprehensively extract the stable features under the Doppler frequency shift. Hence, the camResNet model has better recognition results for different working conditions data SE\_ResNet uses compressed information to obtain channel weights to obtain certain stable features, so the recognition ability under different working conditions is better than that of DenseNet. The stable signal of the Doppler shift represents multi-scale information, which causes extract information with one scale to lose helpful information. The camResNet model uses convolution operation to extract channel information from two aspects. The first part uses the convolution kernel superposition to expand the perceptual field and extract features of different scales. The second part extracts the feature from the local features of all information. Fusing the two features as the weights of channels can comprehensively extract the stable features under the Doppler frequency shift. Hence, the camResNet model has better recognition results for different working conditions data containing the Doppler frequency shift information.

containing the Doppler frequency shift information.

**5. Conclusions**

#### **5. Conclusions**

The camResNet model adds a channel attention mechanism to the ResNet model based on the characteristics of underwater acoustic signals. This channel attention mechanism can enhance the stable spectral features and remove the unstable signals caused by the Doppler shifts. The experiments compare the recognition ability of six different deep-learning models under different Doppler shift frequencies. The results show that the recognition rate of the camResNet model is higher than that of the other network models. The camResNet model has a recognition rate of 98.2%, which is 1.1–15.8% higher than the other networks. The precision, recall rate, F1-score, and accuracy are used to demonstrate that the data used in the experiments are balanced between the classes and that the experimental results are valid. Test the effectiveness of the proposed method with the same distribution and different distributions for the training and test sets. The three network models with better recognition results are selected for testing. In the same training set and test set distribution, the recognition rate of camResNet varies from 0.003 to 0.023 for different working conditions. In contrast, the recognition rate of DenseNet varies from 0.015 to 0.019 for different distributions of the training set and test set. The results show that the proposed method is more suitable when the training and test sets are identically distributed. Further, using visualization methods to learn the features of the signal extracted by the camResNet model, the results show that the camResNet model can extract the stable multi-group harmonic signals and restore some weak high-frequency stable signals in the original signal.

The camResNet model can effectively extract the features of underwater acoustic signals with the Doppler shift. The following work will use the camResNet model to recognize the underwater acoustic signals with the Doppler shift for small samples, solving the problem of data-driven underwater acoustic signals in deep learning.

**Author Contributions:** Conceptualization, L.X.; methodology, L.X.; software, L.X.; validation, L.X.; formal analysis, L.X.; investigation, L.X.; resources, L.X.; data curation, L.X.; writing—original draft preparation, L.X.; writing—review and editing, X.Z.; visualization, A.J.; supervision, X.Z.; project administration, X.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** National Natural Science Foundation of China: 11774291.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare that they have no known competing financial interests or personal relationships that could appear to have influenced the work reported in this paper.

#### **References**

