**1. Introduction**

The asynchronous motor is the most widely used mechanical drive equipment in industrial production and has become an important component in fields such as machinery manufacturing [1–3] and intelligent transportation [4,5]. Due to the harsh working environment, overload, and complex electromagnetic relationships, the motor is prone to stator winding inter-turn short circuit, broken rotor strips, air gap eccentricity, and bearing wear [6–8]. During operation, the failure of asynchronous motors may cause huge economic losses and casualties. Therefore, it is very important to evaluate the working state of the motor and detect potential faults to prevent mechanical accidents. Fault diagnosis of motors plays an important role in equipment maintenance, which can improve the quality of machines and reduce maintenance costs.

The common way of motor fault diagnosis is to use vibration signals for analysis. Vibration signals can be collected using acceleration transducers. Abnormal vibration signals can characterize equipment faults, such as asymmetry of the shaft system [9], a loose connection of components [10], and damaged rotor bearings [11]. Therefore, the acquisition and analysis of vibration signals have also become a common fault diagnosis scheme in the field of rotating machinery [12,13]. Fault diagnosis methods based on vibration signals [14,15] mainly include two stages: feature extraction and pattern recognition. The key to the asynchronous motor fault diagnosis technique is extracting feature information from non-smooth vibration signals with time-varying characteristics. In the time domain, some works [15,16] acquired amplitude,

**Citation:** Wang, L.; Zhang, C.; Zhu, J.; Xu, F. Fault Diagnosis of Motor Vibration Signals by Fusion of Spatiotemporal Features. *Machines* **2022**, *10*, 246. https://doi.org/ 10.3390/machines10040246

Academic Editor: Alejandro Gómez Yepes

Received: 4 March 2022 Accepted: 24 March 2022 Published: 30 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

root mean square, and kurtosis for the analysis and diagnosis of vibration signals. However, it was susceptible to environmental noise and the methods have limitations. Some works [17,18] used Fourier transform to convert the signal from the time domain to the frequency domain. But the frequency characteristics of the vibration signal over time cannot be extracted effectively. The time-frequency domain analysis was performed by wavelet transform [19], short-time Fourier transform [20,21], and empirical mode decomposition [22,23], which extracted both time-domain and frequency-domain features. But the above methods are only effective for specific features and have poor adaptivity and robustness.

With the rise of deep learning, some neural networks have been introduced into the field of fault diagnosis [24–26]. The vibration characteristics of the signal can be obtained adaptively by learning the nonlinear mapping between the hidden layers in the network. Deep learningbased methods are less interpretable [27] but have high recognition accuracy. Such methods overcome the disadvantages of traditional methods that require manual feature extraction and have poor adaptability. Shi et al. [28] used a long short-term memory neural network (LSTM) to extract the temporal features of bearing vibration signals. However, the local information of the signal in the spatial dimension was ignored and the full key information could not be maintained when the data sequence is too long. Gao et al. [29] combined one-dimensional convolution and adaptive noise cancellation techniques to suppress the strong interference components in the one-dimensional time series of gearboxes. However, the time-series feature of the vibration signal was not fully utilized due to the limitation of the convolutional neural network field of perception. Zhu et al. [30] reconstructed the one-dimensional time-domain sequence into a two-dimensional data format and used two-dimensional convolution to capture the spatial features of the vibration signal. However, the dependencies between the positions of the spatial features were ignored, resulting in some important features not playing a significant role. Due to the convolutional stride and weight connection, the convolutional neural networks [31,32] cannot accurately obtain the temporal features of the vibration signal. In contrast, recurrent neural networks [33] can handle the temporal features of the signal but do not consider the information of the spatial dimension.

At present, motor fault diagnosis only uses the temporal features or spatial features of vibration signals for analysis. In this paper, spatial features and temporal features are combined to construct a spatiotemporal feature fusion network (STNet). The network solves the problem of accuracy loss caused by excessively long signal sequences and the lack of dependencies of each position. STNet is constructed for fault diagnosis of motor vibration signals. The main contributions of this paper are listed as follows.


The structure of this paper is as follows. Section 2 presents the attention-based mechanism for the GRU to capture the temporal features of vibration signals. Section 3 enhances the data by local mean decomposition and extracts the spatial features of vibration signals using a CNN with channel and position attention. Section 4 proposes a spatiotemporal feature fusion network. Section 5 validates the model by experiments.
