*3.1. Local Mean Decomposition*

The motor vibration signal is nonlinear and non-smooth. LMD adaptively decomposes the original vibration sequence into multiple instantaneous frequencies with physically meaningful product functions (PF). Each PF component is the product of a pure frequency modulation signal and an envelope signal, which can express the time-frequency distribution of the signal energy on the spatial scale. Then the vibration signal matrix is constructed and the original data is enhanced. The process of LMD for vibration signal processing is shown in Figure 3.

**Figure 3.** Local mean decomposition.

The original vibration signal *x*(*t*) is decomposed by LMD and the mean value *mi* of the adjacent local mean points is calculated. The curve is smoothed by the sliding average method to obtain the mean function *mij*. Then the envelope function *aij* is calculated. The mean function is separated from the original vibration signal to obtain *hij*(*t*). Additionally, *hij*(*t*) is demodulated to obtain *sij*(*t*). If *sij*(*t*) is a pure frequency modulation signal, the PF component *PFi*(*t*) and the residual signal *ui* (*t*) are calculated based on the instantaneous amplitude function *ai*(*t*). If *ui* (*t*) is a monotonic function, the decomposition ends and all PF components are obtained. The results of data decomposition are shown in Figure 4, where the original data *X*(*t*) is decomposed into five PF components by LMD.

**Figure 4.** LMD motor vibration signal decomposition.

Convolutional neural networks are often used to process two-dimensional image signals, while the motor vibration signal *X*(*t*) is a one-dimensional time-series signal, as follows

$$X(t) = [\mathbf{x}\_1, \mathbf{x}\_2, \mathbf{x}\_3 \cdot \cdots \cdot \mathbf{x}\_l]. \tag{8}$$

Therefore, the vibration signal is converted into a two-dimensional matrix *X*(*t*) ∈ R*M*×*<sup>N</sup>*

$$X'(t) = \begin{bmatrix} \mathbf{x}\_{11} & \mathbf{x}\_{12} & \cdots & \mathbf{x}\_{1n} \\ \mathbf{x}\_{21} & \mathbf{x}\_{22} & \cdots & \mathbf{x}\_{2n} \\ \vdots & \vdots & \cdots & \vdots & \cdots \\ \mathbf{x}\_{m1} & \cdots & \mathbf{x}\_{mn} & \mathbf{x}\_{mn} \end{bmatrix} \tag{9}$$

Each PF component is converted into two-dimensional data as shown in Figure 5. The PF components are concatenated with the two-dimensional data *X*(*t*) of the original vibration signal in the channel dimension. The final input matrix of the convolutional neural network is obtained. The method enhances the feature representation of the vibration signal in the spatial dimension.

### *3.2. CNN Module Based on Attention Mechanism*

The convolutional neural network takes the multidimensional matrix of the motor vibration signal as input and adaptively extracts the spatial features of the signal. The different features have different effects on the fault diagnosis results. As shown in Figure 5, the same vibration signal decomposes with different PF components. It leads to huge differences between the different channels of the input 3D matrix *Xin* ∈ R*<sup>c</sup>*×*M*×*N*. The different channels have different effects on the diagnosis results for different fault types. Therefore, the attention mechanism is added to the channel dimension to make the model adaptively extract different channel features.

**Figure 5.** Two-dimensional vibration matrix visualization. (**a**) is the original vibration signal. (**b**–**f**) are the PF components.

The structure of the channel attention is shown in Figure 6, where the input matrix *Xin* is convolved to obtain x ∈ R*c*×*m*×*n* and ⊗ represents element-by-element multiplication.

$$\mathfrak{x} = w\_i \otimes X\_{in} + b\_i \tag{10}$$

**Figure 6.** Channel attention module.

Then the *m* × *n* dimensions are compressed to 1 × 1 by global average pooling. The global feature distribution of the input matrix in the channel dimension is captured to obtain the feature map

$$map = \frac{1}{m \times n} \sum\_{i=1}^{m} \sum\_{j=1}^{n} \mathbf{x}(i, j) \tag{11}$$

The feature maps are adjusted nonlinearly by the fully connected layer (FC). The module uses the sigmoid function to obtain the attentional weights of the channel dimensions *Catte*

$$C\_{attc} = \sigma(w\_s \cdot (\text{Relu}(w\_r \cdot map + b\_r)) + b\_s) \tag{12}$$

Finally, the input features *Xin* are multiplied with the channel weights to rescale the features in the channel dimension.

The channel dimension completes the rescaling of the original features, and the channel attention adjusts the different channel features. However, there are also large differences in the data of different fault types of vibration signals in the same channel, as shown in Figure 7. Convolutional neural networks also need to consider the influence of different location features on the diagnosis results when extracting features. Therefore, this paper

makes the network focus on the features of vibration signals in spatial dimensional features by position attention.

**Figure 7.** Data visualization of different fault types in the same channel.

The structure of the position attention is shown in Figure 8. The input features *x* ∈ R*c*×*m*×*n* are computed separately for max pooling and average pooling to obtain feature maps *fmax* ∈ R1×*m*×*n* and *favg* ∈ R1×*m*×*n*. Then the feature maps are concatenated in the channel dimension. Finally, the feature maps adopt convolutions and a sigmoid activation function to obtain the position attention *Patten*

$$P\_{atten} = \sigma(\text{corr}(\text{concat}(f\_{max}, f\_{avg}))).\tag{13}$$

**Figure 8.** Position attention module.

### **4. Spatiotemporal Feature Fusion Network**

The structure of the spatiotemporal feature fusion network is shown in Figure 9. The STNet uses a GRU to extract the temporal features of one-dimensional vibration signals. The GRU branch introduces the attention mechanism to synthesize the effect of each moment state on the performance in the long sequence signal. Meanwhile, the original vibration sequence is decomposed by LMD for time-frequency analysis. The original vibration data and each PF component are converted into multidimensional matrices as the input of the CNN. The CNN branch adaptively extracts the spatial features of the input matrix by convolutions. Meanwhile, considering the channel features and the influence of different fault features, the CNN branch adds channel attention and position attention to selectively enhance the spatial features of the signal. The attention mechanism acquires rich contextual information. Finally, the spatial and temporal features of the vibration signal are fused, and the softmax layer classifies the fused features.

**Figure 9.** Spatiotemporal feature fusion network.

STNet is a dual-stream network consisting of a GRU branch and CNN branch. The specific network layers are shown in Table 1, where Conv-BN denotes the convolution layer and batch normalization layer, and FC is the fully connected layer. The input of the CNN branch is the vibrational signal matrix with the size of 6 × 32 × 32. The network uses the convolution kernel with the size of 3 × 3 to extract features. The padding type of the convolution kernel is "SAME". Then, the kernel is normalized by the BN layer with a Relu activation function. The CNN branch recalibrates the original features by channel attention and position attention. The spatial resolution of the feature map at each stage becomes half that of the previous stage, and the number of channels becomes twice that of the previous stage. The network obtains a feature map with the size of 128 × 8 × 8 by three stages of feature extraction. The captured features are then fed into the fully connected layer with 1024 neurons. The input of the GRU branch is the original vibration signal with 1024 sampling points. The network obtains the temporal features through the 2-layer GRU attention unit, and the features are fed into the fully connected layer with 128 neurons. The fully connected layers of the CNN branch and GRU branch are concatenated, and the number of neurons is 1152. The network is nonlinearly adjusted by two fully connected layers. Finally, the diagnosis results of eight faults are output by the softmax function.

When the STNet extracts features, there are significant differences between the spatial features extracted by the CNN and the temporal features extracted by the GRU. Therefore, the CNN auxiliary loss function and GRU auxiliary loss function are added respectively during the training process. The auxiliary loss function supervises the temporal features and spatial features extracted by the network separately to reduce the generation of invalid information. The auxiliary loss function not only promotes the backpropagation of the network but also enhances the canonical representation of temporal and spatial features. The final loss function (*L*total) of the network is shown as follows

$$L = \frac{1}{N} \sum\_{i} L\_i = -\frac{1}{N} \sum\_{i} \sum\_{c=1}^{M} y\_{ic} \log(p\_{ic}) \tag{14}$$

$$L\_{\text{total}} = \alpha L\_{\text{CNN}} + \beta L\_{\text{GRU}} + L\_{\text{loss}} \tag{15}$$

where *M* is the number of categories; *yic* is the symbolic function; *pic* is the probability that sample *i* belongs to *c*; *α* and *β* are the weights of the auxiliary loss function.


