2.1. Time-Domain Filter with Convolution Kernel
The FIR filter can retain the target signal and suppress the interference. It performs weighting processing on the continuous input signal, then obtains the filtered signal by accumulation. An
Nth-order FIR filter multiplies
N times and accumulates
N-1 times to complete one filtering operation. This process is expressed as follows:
where
and
represent the output and input of the filter,
denotes the filtering coefficient, and
N denotes the order of the filter. As for one-dimensional convolution operations with convolution kernels of odd length, it can be expressed as:
where
and
denote the output and input of the 1D-CK,
denotes the weight of the 1D-CK,
N denotes the kernel size of the 1D-CK,
s denotes the stride of the convolutional layer,
denotes the bias, and
denotes the activation function. Therefore, an FIR filter can be designed based on a single 1D-CK that the bias and activation function of it are removed. When the stride and kernel size of convolution are set to 1 and
N, 1D-CK can be treated as a
Nth-order FIR filter with a delay of
. For the classification task, the FIR filter conducted by 1D-CK can be regarded as an adaptive filter or a fixed-parameter filter, which can be optimized parameters by gradient descent adaptively or filtered the raw signal according to fixed optimal parameters, named TFCK. TFCK is shown in
Figure 2.
The parameter design method of TFCK is the same as that of the traditional FIR filter. For a specific underwater acoustic target recognition task, TFCK can learn the filter’s frequency response by gradient descent. In other words, it can automatically search for the frequency range suitable for the current classification task from the data. The adjustment strategy of TFCK includes two stages: the pre-trained stage and the training stage. In the pre-trained stage, all parameters of 1D-CK are adaptively learned according to a specific classification task. By observing the feedback of the neural network, we can analyze the inner workings and behavior of the models, strengthen the extraction of high-value information and suppress the perception of noise by the neural network. In the training stage, we can adjust and fix the parameters of 1D-CK according to the feedback of the network in the pre-trained stage, optimize the ability of the neural network to suppress low-value information and make the subsequent network easier to learn the generalized embedding features. Two stages are shown in
Figure 3.
Figure 4 shows the frequency response of TFCK in the classification task of the ShipsEar [
36] dataset.
It can be seen that in the classification task of ShipsEar, TFCK is very sensitive to low-frequency information at the initial stage of training. The training process gradually amplifies the importance of low-frequency information, and the neural network finds several peaks with similar intervals. At the last stage of the training process, the perception of the neural network is finally stabilized within a range, and the redundant information for the current network architecture and the classification task is fed back. The high-frequency information of ship radiated noise is seriously lost through the underwater acoustic channel, and the low-frequency information can spread further in the underwater environment. Usually, the identifiable information from ship radiated noise received by the hydrophone is concentrated in the low-frequency of the raw signal. This result means that the knowledge of TFCK learned from the data is consistent with the cognition of experts in underwater acoustic target recognition, which is also the same as the objective laws of physics. According to the result, the differentiated information among categories in the data set is mainly concentrated below 600 Hz. Therefore, the parameters of TFCK can be optimized to complete the classification task better than before. The parameter of Equation (2) is set to (4), and the 1D-CK is set according to the traditional low-pass FIR filter of 1500 Hz cutoff frequency.
2.2. Time-Frequency Analysis Module of Attention-Based Neural Network
As a classical time-frequency analysis method, the short-time Fourier transform (STFT) can reflect the frequency change of ship radiated noise over time. Firstly, TFA-ATNN uses a set of 1D-CKs to embed the Fourier transform into the neural network, so that the neural network is able to extract time-frequency features. Secondly, the attention mechanism is conducted to improve the perception ability of the neural network for frequency. Discrete Fourier transform (DFT) can decompose frequency components from complex time-domain waveforms and is an important method for signal analysis and processing. In this paper, DFT is realized based on 1D-CK. DFT can be expressed as Equations (3) and (4):
where
denotes the time-domain signal sequence, and the discrete Fourier transform and its inverse transform are
and
. In order to realize the discrete Fourier transform by convolution neural layer, Equation (3) is expressed in matrix form, as follows:
where
is used to replace
. In addition, according to the Euler formula, Equation (3) can be decomposed into the representation of an imaginary component and a real component. The decomposition is as follows:
The typical convolutional neural network layer is represented by Equation (7):
where
represents the output feature,
represents the convolution kernel,
represents the feature set,
represents the bias,
denotes the convolution kernel number,
denotes the layer number, and
is the convolution operation of the convolutional neural network. In particular, for the one-dimensional convolution layer that only inputs single-channel time series signals, when the length of the input signal and the convolution kernel are equal and the bias, activation function and padding are removed, Equation (7) can be transformed into Equation (8):
where
represents the output of the convolution kernel,
represents the input signal, and
represents the weight of the convolution kernel. In order to realize the Fourier transform, we set two groups of convolution kernels to calculate the imaginary component and real component of the Fourier transform, respectively. The weights of these two groups of convolution kernels can be fixed according to the sine basis functions and cosine basis functions in Equation (6). The number of convolution kernels is constrained by the number of points in the Fourier transform, and it is also equal to the number of points in the Fourier transform divided by two and plus one. Therefore, a one-dimensional convolution layer based on removing bias, activation function and padding can realize Fourier transform operation. Furthermore, the STFT with different time resolution and frequency resolution can be realized by optimizing the size, step length and basis function of the convolution kernel, as shown in
Figure 5. This module is called the basic STFT module based on the convolutional neural network (BSTFT-CNN). In addition, in order to attenuate sidelobe height and weaken the impact of spectrum leakage, Hanning window action on convolution kernels. Specifically, the Hanning window with length
M is expressed as Equation (9):
Since the parameters of convolution kernels in the BSTFT-CNN are initialized by the standard sine basis functions and cosine basis functions, the network after the BSTFT-CNN will learn the frequency components extracted by the BSTFT-CNN indiscriminately. However, there are obvious frequency domain characteristics in ship radiated noise. The attention module of the short-time Fourier transform adopts the full connection layer with shared parameters (FCSP), hoping that the neural network can learn to automatically fuse the stable frequency components in the input signal and achieve a more stable recognition effect than before. This paper embeds an attention mechanism into the BSTFT-CNN to construct a convolution neural network time-frequency feature extraction module with an attention mechanism, which is called TFA-ATNN, as shown in
Figure 6. The TFA-ATNN adopts the FCSP, hoping that the neural network will automatically extract the stable frequency components in the input signal to improve the stability of the recognition algorithm.
The TFA-ATNN is mainly divided into three stages. In stage 1, the TFA-ATNN uses two FCSPs to learn the information that stably exists in the imaginary or real component and combine the components and phase spectrum as a combined features. In stage 2, the FCSP is used to fuse the combined feature. Feature fusion generates two attention maps, which are used for enhancement of imaginary and real components, respectively. In the last stage, a learnable factor is used to combine the enhanced spectrum generated by the result of enhancement. Finally, the combination result is the output features that could be sent to the network for embedding and extracting.
Specifically, the size of one of the component outputs by the BSTFT-CNN is
, where
is the number of time points and
is the number of frequency points. In order to make it easier for the model to capture the stable frequency distribution, FCSP is proposed to constrain the relationship between frequency components. Its operation is expressed as Equation (10):
where
is one of the output features of the BSTFT-CNN, which includes features of real components and imaginary components.
is the parameter matrix of the FCSP, and its size is
in stage 1, which depends on input and output.
is the matrix output by the FCSP.
represents the imaginary component,
represents the real component, and subscript 1 represents stage 1. The number of neural units in the single layer of the FCSP is equal to the number of frequency points.
In stage 2, the TFA-ATNN combines the imaginary component, the real component and the phase component together, which is different from the classical attention mechanisms, such as SENet [
37] and CBAM [
38], to obtain attention information from the current layer to ensure that the model has sufficient information to extract the relationship between frequency components. The feature matrix is fused by concat, and the fused matrix is called
. After the
is obtained, it is input into the FCSP of stage 2, as in Equation (11):
where
is the gate function, here it is sigmoid.
dot multiplication
to get
, which is called the enhancement imaginary component.
dot multiplication
to get
, called the enhancement real component. Finally, the weight factor
with learning ability is used to fuse enhanced spectrums of the real and imaginary components. It is convenient for the neural network to initialize the weights of attention mechanisms adaptively according to tasks and parameters. Equation (12) describes the enhancement process:
where
denotes the features extracted by the TFA-ATNN. Except for the parameters in the BSTFT-CNN, all parameters in the TFA-ATNN are adjusted during the training stage. The parameters that can be adjusted during the training stage in the TFA-ATNN are presented in
Table 1.