*5.2. Experiment Analysis*

When the network extracts features using the GRU, only the features in the time domain of the vibration signal are captured. However, the vibration signal also contains rich features in the frequency domain. Therefore, the original vibration data is decomposed by LMD. The decomposition results of each fault type are shown in Figure 10. When abnormal vibration occurs in the accelerometer, each PF component can show the amplitude modulation and frequency modulation signals of the abnormal vibration. The vibration signal is enhanced so that the CNN extracts the vibration features by the original vibration sequence and each PF component.

A convolutional neural network has unique superiority in two-dimensional image recognition due to the special structure of local weight sharing and the presence of the local perceptual field. The visualization results of each fault signal transformed into the twodimensional matrix are shown in Figure 11. The original vibration signal is 1024 sampling points, and the size of the transformed 2D matrix is 32 × 32. Similarly, each PF component is also transformed into a two-dimensional matrix and connected to the two-dimensional matrix of the original vibration signal in the channel dimension. Finally, the input size of the CNN branch is 6 × 32 × 32. The visualization results of the two-dimensional matrix show that the PF component matrices of different faults have large differences in different dimensions, and the fault features extracted by the CNN would have a positive effect on the performance of diagnosis.

**Figure 11.** Two-dimensional matrix visualization of fault data. (**a**) normal; (**b**) 2 turns short circuit; (**c**) 4 turns short circuit; (**d**) 8 turns short circuit; (**e**) air gap eccentricity; (**f**) broken rotor strip; (**g**) bearing seat damage; (**h**) bearing wear.

The input size of the CNN branch is 6 × 32 × 32, and the sequence length of the GRU branch input is 1024. The number of network training epochs is 100. The batch size is 600. The model parameters are updated using the Adam optimization algorithm. The learning rate adjustment strategy is "Poly", with an initial learning rate of 0.001 and a power of 0.9. The loss function is the cross-entropy loss function. The weight of the CNN network auxiliary loss function is 0.1. The weight of the GRU network auxiliary loss function is 0.9. The evaluation index is the accuracy rate. The loss and accuracy curves of the training set and test set with the number of epochs are shown in Figure 12.

**Figure 12.** Training process loss and accuracy variation.

The test set loss increases sharply in the first 10 rounds of training, but the training set and test set losses gradually decrease with the increase of iterations. It indicates that the model is converging and approaching 0. After 60 epochs, the training set loss and test set loss are close to overlapping. The waveforms do not have large fluctuations and there are no overfitting problems.

The model is validated for each type of fault after training, and the results are shown in Table 3. The number of error samples for inter-turn short circuit fault is three, and the number of error samples for bearing seat damage is three. The recognition accuracy of each type of fault is above 99%. The model has high recognition accuracy.


**Table 3.** The result of each category of fault identification.

To verify the performance of each module in the STNet, five ablation experiments are set up. The results are shown in Table 4. The accuracy of the temporal features extracted from the vibration signal using the GRU is 98.58%, while the accuracy of the spatial features captured from the vibration signal using the CNN is 98.83%. The CNN + GRU model with the fusion of temporal and spatial features improves the accuracy by 0.39% and 0.04%, respectively. Compared with the single branch, it indicates that both temporal and spatial features of the vibration signal are indispensable parts for fault diagnosis. The CNN + GRU + attention model with the attention module on the CNN branch and GRU branch improves the accuracy by 0.59% compared to the model without attention. The attention mechanism considers the importance of different features and makes the important features play a significant role in the network. The final accuracy of the STNet with auxiliary loss function is 99.75%. The auxiliary loss function facilitates the network backpropagation to update the parameters and enhances the feature representation of each branch.

**Table 4.** Ablation experiments.


To further investigate the effect of the attention module on the network performance, the attention matrices of the GRU branch and the CNN branch are visualized. Figure 13a represents the channel attention for the three-stage feature extraction in the CNN branch with channel dimensions of 32, 64, and 128. The shallow layer of the CNN branch requires sufficient feature extraction of the vibration signal to preserve all feature information as much as possible. Therefore, the attention varies from 0.48 to 0.51, which is not a large range. Due to the number of network layers increasing and the number of channels increasing, the redundant features are increased. The network needs to suppress the redundant channels, while the effective channel features are enhanced. So, the range of variation of channel attention increases. Figure 13b represents the position attention of the three-stage feature extraction in the CNN branch with dimensions of 16 × 16, 8 × 8, and 8 × 8. The position attention becomes more and more focused because the local features of the convolutional neural network are extracted. Figure 13c represents the attention of the output features of the second GRU in the GRU branch. The GRU module outputs the prediction results of multiple time series. The output represents the impact of each moment on the diagnostic results. It retains the results with high relevance by the attention mechanism, so the GRU attention does not fluctuate greatly.

To further verify the fault diagnosis capability of the STNet, it is compared with BP, 1D-CNN, multichannel-CNN, and inception-LSTM models. The experimental results are shown in Table 5. The BP network diagnoses the fault types by nonlinear mapping without considering the temporal and spatial features of the signal. Therefore, the recognition accuracy is only 96.12%. The 1D-CNN model uses 1D convolution to obtain the abstract features and local features of the vibration signal. The 1D-CNN model improves the accuracy by 2.12% compared to the BP network. The multichannel-CNN model weights different receptive fields and captures contextual information at different scales. The inception-LSTM model extracts temporal information under several different receptive fields with an accuracy of 99.34%. Compared with BP, 1D-CNN, multichannel-CNN, and inception-LSTM models, the STNet obtains the highest accuracy of 99.75%. The STNet combines spatial features and temporal features instead of single features, compared with BP, 1D-CNN, and multichannel-CNN models. Compared with the inception-LSTM model, STNet uses the attention mechanism to select features adaptively. Therefore, both temporal and spatial features have a positive impact on the performance of diagnosis during the analysis of vibration signals. The number of parameters of STNet is 9.2876 M and the number of floating-point operations (FLOPs) is 0.02 G.

**Table 5.** Model comparison experiments.

**Figure 13.** Attention visualization. (**a**) CNN branching channel attention visualization; (**b**) CNN branching position attention visualization; (**c**) GRU branch attention visualization.
