1. Introduction
Rolling bearings are very common in rotating machine parts and are widely used in all kinds of mechanical equipment, including high-speed railways, airplanes, and automobiles [
1]. Using fault diagnosis technology, the health status of rolling bearings can be judged more accurately, which can save on maintenance costs of the mechanical equipment and avoid unnecessary waste [
2]. If rolling bearing failure occurs, it causes property loss and even a threat to the staff’s life and health. Therefore, the fault diagnosis of rolling bearings is of great significance [
3].
After years of development, rolling bearing fault diagnosis has gradually moved from traditional methods to intelligent fault diagnosis. Traditional rolling bearing fault diagnosis methods are usually more dependent on the practitioner’s professional knowledge and work experience, so they can be influenced by the practitioner’s own subjective judgment [
4]. Conventional methods also require regular testing and maintenance of equipment by staff, which, for a large organization in continuous operation, requires a large annual investment to support the work [
5]. Although traditional rolling bearing fault diagnosis technology has played an important role, with the continuous development and progress of science and technology, more intelligent fault diagnosis methods have emerged, one after another. These methods meet the current needs of people because they improve the reliability and accuracy of fault diagnosis [
6].
Intelligent fault diagnosis techniques utilizing artificial intelligence technology have gradually emerged over the years. The acquired data are analyzed using various equipment sensors. Using machine learning, deep learning, and other technologies, we can attain feature extraction and pattern recognition classification from these data [
7] to determine whether there is a potential failure or an abnormal situation occurring in industrial equipment, as well as to determine which type of failure it is [
8].
The use of deep learning techniques in the field of fault diagnosis is in line with the current trend of the positive impact of computers on people’s lives. Compared to previous fault diagnosis methods, deep learning–based methods can realize an “end-to-end” fault diagnosis process, avoiding the troublesome feature extraction process [
9]. In deep learning technology, neural network models are currently very popular, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory networks (LSTMs) [
10]. These network models have also been widely used in the field of fault diagnosis. Liu et al. [
11] proposed the possibility of combining GST and SVD to extract localized damage in rolling bearings and compared it with other commonly used methods to verify the feasibility of the model. Pan et al. [
12] proposed an improved bearing fault diagnosis method combining a CNN and LSTM, where the input is the raw sampled signal without any preprocessing or conventional feature extraction. Huang et al. [
13] used EMD for processing bearing vibration signals to reduce noise and then constructed a convolutional recurrent neural network as a rolling bearing fault diagnosis classifier using the envelope processed via EMD. Lu et al. [
14] proposed a rolling bearing fault diagnosis method combining LSTM and a self-encoder. Self-encoders can automatically learn useful functions from vibration signals. LSTM is used to process time series data, and the LSTM network is used as an encoder and decoder of the self-encoder. Additionally, it has been shown experimentally that the proposed algorithm has good multi-class classification performance. In recent years, the Transformer model has shined in the field of deep learning. The Transformer model is a classic NLP model because of its excellent sequence modeling ability and the advantages of parallel computing [
15]. Its biggest feature is the use of an attention mechanism to calculate its input and output, as well as balanced processing power. It does not adopt the sequential structure of traditional RNN sequence alignment and also avoids the drawbacks of the limited receptive field of CNNs, enabling it to capture global information [
16].
Although the current research on deep learning fault diagnosis methods has achieved initial results, most of the existing research is aimed at one-dimensional vibration signal data as input. However, the research on the deep learning fault diagnosis method that takes two-dimensional data as input is not deep enough. There are few studies on the analysis of two-dimensional data in more complex situations [
17]. Therefore, the potential advantages brought by the bearing fault diagnosis method that converts the vibration signal into two-dimensional image data and then uses the two-dimensional image data as the input of the deep learning model are worth exploring in depth. Visualizing the vibration signal can not only retain the information contained in the vibration signal with high quality but also optimize the preprocessing of the vibration signal data [
18]. At the same time, the deep learning model has good recognition and processing characteristics for the converted two-dimensional image data, and the diagnosis accuracy rate is high.
To sum up, in order to effectively improve the problems of incomplete feature extraction, too-complex feature extraction, and external noise interference in the fault diagnosis process of rolling bearings, and to avoid the problems of limited receptive fields, CNNs and RNNs need to gradually calculate using time sequences when processing sequence data. This article proposes a rolling bearing fault diagnosis method that combines two-dimensional vibration images with Transformer models. The main contributions of this paper are as follows:
- (1)
A rolling bearing fault diagnosis method based on SVD-GST combined with the Vision Transformer is proposed. A fault diagnosis experimental platform is built, and the model is verified to have high accuracy and feasibility through experiments.
- (2)
In the process of using SVD noise reduction, the singular value energy difference spectrum is introduced to determine the order, which solves the problem of how to determine the effective order of the reconstruction matrix after the vibration signal of the rolling bearing is decomposed.
- (3)
It is verified that the Vision Transformer model can mine more hidden fault information and reduce information loss for the two-dimensional vibration images of rolling bearings obtained using GST.
The rest of this paper is composed as follows:
Section 2 introduces the algorithm principle, including SVD, GST, and the Vision Transformer;
Section 3 introduces the rolling bearing fault diagnosis model;
Section 4 builds a fault diagnosis experimental platform and introduces the process of vibration signal acquisition;
Section 5 analyzes the experimental results in various ways as well as compares them with other models; and
Section 6 is the conclusion of this paper.
2. Principle Introduction
2.1. SVD
Singular value decomposition (SVD) is a very important matrix decomposition technique in the field of numerical analysis and linear algebra. It has applications in image processing, data reduction, and signal noise reduction [
19]. By using SVD, we can remove noise from the vibration signal of the rolling bearing so that a cleaner and more accurate vibration signal can be obtained [
20]. In SVD, the problem that needs to be solved at present is how to determine the effective order of the reconstruction matrix after the vibration signal of the rolling bearing is decomposed. Currently, the effective order is determined using methods such as the threshold method and singular entropy increment [
21]. However, these methods require relatively high user experience, so the noise reduction effect is not obvious for the vibration signal of the rolling bearing. In order to solve this method, this paper introduces the singular value energy difference spectrum to determine the order so as to achieve the purpose of noise reduction.
The vibration signal of a rolling bearing is usually a one-dimensional signal, which cannot be directly subjected to SVD [
22], so the one-dimensional vibration signal
must be converted into a two-dimensional matrix. Through the Hankel matrix, the signal can be represented as a low-rank approximation. This paper chooses to construct the Hankel matrix. The Hankel matrix
is shown in Formula (1):
In Formula (1), is expressed as a constructed Hankel matrix, . The noise signal is , and the useful vibration signal is . When , the Hankel matrix noise reduction effect is generally more obvious. Because the size parameter of the Hankel matrix is half the length of the original signal, it may produce a better separation effect, especially in noise reduction applications. This choice can suppress noise to a certain extent and preserve important features of the signal. This choice is mainly considered from the three aspects of capturing signal trend and periodic characteristics, suppressing high-frequency noise, and separating signal and noise.
On the problem of singular value order determination, the order is determined by the singular value energy distribution of the useful signal and the noise signal in the vibration signal of the rolling bearing. The signal energy is shown in Formula (2):
In Formula (2), the signal energy is represented by
E,
represents the singular value, and the total order is q and ends at q. The singular value energy difference spectrum is described below and normalized, as shown in Formula (3):
In Formula (3), the sequence represents the energy difference spectrum. From Formula (3), it can be seen that the energy changes the adjacent orders of the singular value. The singular value energy ratio of the useful signal is relatively large, so a large peak signal is formed. The signal after the peak is generated by the noise signal, and the singular value corresponding to this point is found in the energy difference spectrum. Then, take this point as the order of the reconstructed signal to realize the removal of the noise signal of the rolling bearing.
2.2. GST
Generalized S-transform (GST) is a form of the time–frequency analysis method, which is a combination of time-domain signal analysis and frequency-domain signal analysis [
23]. GST provides more detailed and comprehensive signal characterization in the time–frequency domain by jointly analyzing the signal in the time and frequency domains, which can obtain the instantaneous frequency information of the signal [
24]. The principle of GST is based on the ideas of short-time Fourier transform (STFT) and continuous wavelet transform (CWT). Its core concept is to perform local spectral analysis of the signal at different time points [
25]. The specific principles are as follows:
In Formula (4),
represents
transformation,
represents the signal to be analyzed, and the translation amount is represented by
.
represents the Gaussian window function.
, and
. GST is modified on the S-transform formula. By adding the parameter m to adjust the Gaussian window width, the time–frequency resolution of the S-transform is improved. GST is shown in Formula (5).
In Formula (5), , and . GST is performed on the one-dimensional vibration signal of the rolling bearing after noise reduction to obtain a two-dimensional time–frequency image. By imaging the vibration signal, the information contained in the vibration signal can be preserved at a high quality, and the deep learning model has good recognition and processing characteristics for the converted two-dimensional image data.
2.3. Vision Transformer
The Transformer model is a classic NLP model proposed by the Google team in 2017 [
26]. The Transformer architecture has revolutionized the field of natural language processing and has become the backbone of many state-of-the-art models for a variety of tasks, including machine translation, text generation, question answering, sentiment analysis, and more. Unlike other models, it uses the self-attention mechanism completely to calculate the input and output. It does not adopt the sequential sequence alignment structure of traditional RNNs and also avoids the problem of the limited receptive field of CNNs. This allows the Transformer to capture global information [
27]. The Transformer’s multi-attention mechanism enables the extraction of richer feature representations from raw data. This is particularly important for fault diagnosis tasks, as effective feature extraction can improve the accuracy and robustness of the model.
Compared with other deep learning models (CNN, LSTM, etc.), the Transformer model has the following advantages in terms of more intuitive explanations: (1) Attention mechanism. Traditional deep learning models are basically local perceptual information, and contextual information is limited to a certain location or time. But, when dealing with certain problems, it is very important to have a global understanding of the context. For the Transformer model, the entire sequence can be modeled through the attention mechanism to capture global information. (2) Parallel processing. Traditional models (such as LSTM models) must be processed step by step when dealing with timing issues, and the next step can only be performed after the last time step is processed. This can lead to the underutilization of computing resources. The Transformer’s self-attention mechanism can process the information of all positions at the same time, which can fully and effectively improve efficiency. (3) Applicability to relationship modeling at different distances. The traditional model has problems such as gradient explosion. The Transformer model relies on the self-attention mechanism to perform weighted attention on location information at different distances, effectively dealing with long-distance dependencies.
Figure 1 shows the Transformer model structure.
Although the Transformer model is very good, there is a problem that it is not suitable for two-dimensional images. In order to solve this problem, the Vision Transformer (ViT) model came into being. ViT is an image classification model based on the Transformer architecture proposed by Alexey Dosovitskiy et al. [
28] in 2020. The basic idea of ViT is to split the image into a series of small patches (patches), convert these small patches into sequence data, and then input the Transformer model for processing. The following are the main principles of ViT:
Embedding module. First, the input image is divided into patches of fixed size. These patches are images that do not overlap in spatial dimensions, similar to dividing an image into a regular grid. Cut the image with size
into size
, and the number of cut image blocks is
. Specifically, this is shown in Formula (6).
In Formula (6), the height, depth, and width of the input image are
,
, and
, respectively, and the corresponding height and width after clipping are P. Each patch is mapped to a lower-dimensional vector space by a fully connected layer (often called an embedding layer). The parameters of this embedding layer are learned via model training so that each small block can be effectively represented as a vector. The vector length is
. Then, add a classification vector
, and add a position code
containing spatial information as the input of the Transformer encoder layer [
29].
In Equation (7), the input of the encoding is . is a category token, and its purpose is to realize the classification task; is a linear mapping matrix, and is a position code.
Transformer encoder module. The Transformer encoder is composed of multiple self-attention mechanisms (self-attention) and feed-forward neural network layers, which can learn global and local context dependencies in sequence data [
30]. Its structure is shown in
Figure 2.
It can be seen from
Figure 2 that the information data input into the Transformer encoder is first normalized. After the multi-head self-attention mechanism, the dropout is randomly inactivated, and then, the residual connection is used to fuse with the input information data. The processed data are then normalized. Then, enter the multi-layer perceptron, use the residual connection after dropout, and fuse with the input data again. The multi-layer perceptron (MLP block) consists of a fully connected layer, a GELU activation function, and a dropout module [
31].
The last is the classification module. The output sequence of the Transformer encoder is classified and predicted through a fully connected layer, and the classification result of the image is obtained.
4. Fault Diagnosis Platform Construction
The main components of the rolling bearing fault diagnosis experimental platform are as follows: rolling bearings (6406), a magnetic powder brake (FZ-A-12), a three-phase asynchronous motor (YE3-100L2-4), a piezoelectric acceleration sensor (CAYD051V), a frequency converter (G7R5/P011T4), a data acquisition card (YE6231), and a PC. In this experiment, five fault categories of rolling bearing inner ring faults, rolling element faults, cage fracture faults, outer ring faults, and normal rolling bearings were designed. The specific fault form is shown in
Figure 5.
Figure 5a–e, respectively, show the five states of rolling bearing inner ring failure, rolling element failure, cage fracture failure, outer ring failure, and normal rolling bearing. The physical map of the rolling bearing fault diagnosis experiment platform is shown in
Figure 6, and the experimental process is shown in
Figure 7.
The specific experimental steps are as follows: To ensure safety, first, add an air switch between the power supply and the inverter. Connect the inverter to the motor. The motor and the gearbox are connected by a belt (the rolling bearing in the gearbox is tested in this experiment). The intermediary between the gearbox and the magnetic powder brake is through a coupling. An acceleration sensor is installed on the end cover of the gearbox, the vibration signal is obtained through the sensor, and the signal data information is transmitted to the PC using a data acquisition card.
This experiment is a no-load experiment, so the magnetic powder brake is closed during the experiment. The speed of the three-phase asynchronous motor is 900 r/min. The sampling frequency is 6 kHz. The experimental data information is shown in
Table 2.
In this experiment, 1024 points comprise a set of data lengths. Each state collects 1000 groups, so there are 5000 groups in total. The training set, verification set, and test set are divided according to 7:2:1, that is, 3500 training sets, 1000 verification sets, and 500 test sets. The specific division is shown in
Table 3.
6. Conclusions
This paper proposes a rolling bearing fault diagnosis model based on SVD-GST combined with the Vision Transformer. SVD is used for noise reduction processing to solve the problem that in the fault diagnosis of rolling bearings, the collected vibration signals contain interference from complex noise and redundant components, which affects subsequent feature extraction and pattern recognition. A generalized S-transform is proposed to convert a 1D vibration image into a 2D time–frequency image. It solves the problem that the recognition rate is difficult to further improve because of the loss of signal information in the one-dimensional data processing and industrial practice of the bearing fault diagnosis method based on deep learning, making full use of the advantages of deep learning in image classification and prediction with higher recognition accuracy. At the same time, in order to avoid the problems of limited receptive fields in CNNs and the need for step-by-step calculations in time sequences when an RNN processes sequence data, the Vision Transformer model is proposed. The experimental results show that the multiple average accuracy rate of the fault diagnosis model adopted in this paper is 98.52% for different fault states of rolling bearings. Compared with other model methods, it can effectively improve the fault identification effect of rolling bearings.
For future research on rolling bearing fault diagnosis, multimodal data fusion can be considered. This article only uses the vibration signal of the rolling bearing. In addition, other sensor data, such as current and temperature, can also be considered to obtain more comprehensive fault diagnosis information. The advantages of the Transformer model can be fully utilized in dealing with multimodal data problems. In conclusion, the research direction of applying the Transformer model to rolling bearing fault diagnosis is worthy of further exploration.