Next Article in Journal
Rapid Microchip Electrophoretic Separation of Novel Transcriptomic Body Fluid Markers for Forensic Fluid Profiling
Next Article in Special Issue
Deep Learning for Clothing Style Recognition Using YOLOv5
Previous Article in Journal
Development of Finite Element Models of PP, PETG, PVC and SAN Polymers for Thermal Imprint Prediction of High-Aspect-Ratio Microfluidics
Previous Article in Special Issue
Tactile Perception Object Recognition Based on an Improved Support Vector Machine
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Siamese Vision Transformer for Bearings Fault Diagnosis

1
School of Mechanical Engineering, Guizhou University, Guiyang 550025, China
2
State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
3
School of Mechanical & Electrical Engineering, Guizhou Normal University, Guiyang 550025, China
*
Author to whom correspondence should be addressed.
Micromachines 2022, 13(10), 1656; https://doi.org/10.3390/mi13101656
Submission received: 20 August 2022 / Revised: 16 September 2022 / Accepted: 27 September 2022 / Published: 30 September 2022
(This article belongs to the Special Issue Embedded System for Smart Sensors/Actuators and IoT Applications)

Abstract

:
Fault diagnosis methods based on deep learning have progressed greatly in recent years. However, the limited training data and complex work conditions still restrict the application of these intelligent methods. This paper proposes an intelligent bearing fault diagnosis method, i.e., Siamese Vision Transformer, suiting limited training data and complex work conditions. The Siamese Vision Transformer, combining Siamese network and Vision Transformer, is designed to efficiently extract the feature vectors of input samples in high-level space and complete the classification of the fault. In addition, a new loss function combining the Kullback-Liebler divergence both directions is proposed to improve the performance of the proposed model. Furthermore, a new training strategy termed random mask is designed to enhance input data diversity. A comparative test is conducted on the Case Western Reserve University bearing dataset and Paderborn dataset and our method achieves reasonably high accuracy with limited data and satisfactory generation capability for cross-domain tasks.

1. Introduction

Bearings, as core components of rotating mechanisms, are widely applied in industrial fields. Faults are highly likely to cause entire mechanical system damage and threat to the safety of employees [1,2,3,4]. Playing a crucial role in the maintenance of mechanical equipment, many fault diagnosis methods have been proposed. Traditional signal-based mechanical fault diagnosis methods commonly require manual feature extraction based on knowledge and prior experience [5]. In recent years, deep learning has made progress in many areas, such as computer vision [6,7], natural language processing [8,9] and defect detection [10,11]. Therefore, a large number of fault diagnosis methods based on deep learning have been developed. Zhao et al. [12] designed a novel intelligent fault diagnosis method for diagnosing accurately and steadily rolling bearing faults. Their approach was validated on experimental and practical bearing data. Zhang et al. [13] built a novel neural network that uses raw temporal signals as input. Their method achieved high accuracy under complex working conditions. He et al. [14] proposed a bearing fault diagnosis method based on a new strategy’s sparse auto-encoder whose weights were assigned. Hu et al. [15] proposed a new method using tensor-aligned invariant subspace learning and convolutional neural networks for cross-domain bearings fault diagnosis. Zhu et al. [16] proposed a new fault diagnosis approach based on principal component analysis and deep belief network. The time-consuming and unreliable manual feature extraction method is gradually being replaced by deep learning methods [5,17,18,19,20].
However, deep learning-based methods usually require a large amount of data for model training. Collecting a considerable amount of data for every type of failure under each working condition poses a considerable challenge in actual industrial application scenarios. Some studies of mechanical fault diagnosis have been conducted using limited data. In [21], Zhang et al. applied the Siamese network to fault diagnosis and designed a Siamese CNN model reporting good performance with limited training samples. A novel method termed meta-learning fault diagnosis framework was proposed by Li et al. [22] and performed excellently under complex working conditions. Li et al. [23] designed a deep balanced domain adaptation neural network achieving exciting results using limited labeled training data. Hang et al. [24] used principal component analysis and a two-step clustering algorithm to develop performance in a high-dimensional unbalanced training dataset. A new fault diagnosis approach based on generative adversarial network (GAN) and stacked denoising auto-encoder (SDAE) was proposed by Fu et al. [25], the experimental results representing high diagnosis accuracy under various working conditions. The Feature Space Metric-based Meta-learning Model (FSM3) was designed by Wang et al. [26] to address the challenge of limited training samples. Lu et al. [27] proposed a new cross-domain DC series fault detection framework based on Lightweight Transfer Convolutional Neural Networks. A new support vector data description based on machine learning was proposed by Duan et al. [28] for limited data. Huang et al. [29] proposed a novel method for bearings fault diagnosis under actual conditions and reported that their model achieved good performance under limited data with noise labels. Bai et al. [30] proposed a novel method for bearing fault diagnosis using multi-channel convolution neural network (MCNN) and a multiscale clipping fusion(MSCF) data augmentation algorithm to suit the challenge of limited sensor data.
At the same time, conventional learning-based methods usually assume that training data and testing data are independent and identically distributed. However, it is impractical to collect sufficient data with the same distribution of test data coming from complex work conditions. This requires the training data to cover all possible operating conditions: different working loads, speeds, noise and so on. Such strict assumptions hinder the application of intelligent fault diagnosis methods in actual industry. From a realistic perspective, the training data are usually collected from specific operating conditions, different but similar equipment, or software fault simulations, which may cause different distributions from tested data. Intelligent diagnosis techniques with a strong in-distribution assumption can fail when differences develop. In recent years, numerous research studies have produced a variety of cross-domain diagnosis methods based on transfer learning or domain adaptation employing data with inconsistencies from various source domains to break the identically distributed assumption [31,32]. These studies’ fundamental principle is to build a diagnostic model that can effectively perform in the target domain using the knowledge of the relevant source domain. Exciting performance enhancements have been made in a variety of cross-domain scenarios, such as in various work conditions [33,34] and across different equipment [19,35]. Zhang et al. [34] propose a conditional adversarial domain generalization aiming to extract domain-invariant features from the different source domains and generalize to unseen target domains. Li et al. [34] implemented adversarial domain training to extra generalized features learned from different domains to hold in new working scenarios. Zheng et al. [36] combine priori knowledge and deep domain generalization network for fault diagnosis.
Although the above methods have achieved exciting results in both research directions, studies that put limited data and domain generalization into a unified framework are rare.
In recent years, Transformer has achieved great success in natural language processing and computer vision. Ding et al. [37] applied Transformer to fault diagnosis of rolling bearings and proposed a novel method termed time–frequency Transformer which achieved satisfactory performance. Weng et al. [38] designed a one-dimensional Vision Transformer with Multiscale Convolution Fusion (MCF-1DViT) combining CNN and Vision Transformer for bearing fault diagnosis. They reported that their method can significantly improve diagnosis accuracy and anti-noise ability. Tang et al. [39] introduced integrated learning into the Vision Transformer model for bearing fault diagnosis and achieved good results. The exciting performance of these methods shows the great potential of Transformer in the field of fault diagnosis.
In the current study, we propose a novel fault diagnosis method to improve the model’s generation ability to face the two challenges, i.e., limited training data and domain generation for rolling bearings. First, the time-series signal is converted into a time-frequency graph with short-time Fourier transform (STFT). Second, a Siamese Vision Transformer (SViT) is designed to extract feature vectors efficiently and implement classification tasks. In addition, we design a new loss function, bidirectional Kullback-Liebler divergence (DKLD), to improve the performance of the proposed model. A new training strategy, i.e., the random mask, is also proposed to reduce the overfitting risk of the model. The contributions of this study include the following.
(1)
The proposed SViT based on a Siamese network and ViT obtains satisfactory prediction accuracy in limited data and domain generation tasks.
(2)
We obtain a new loss function by combining the KL divergence of the two directions to improve the proposed model’s performance.
(3)
A novel training strategy, random mask, focusing on increasing the diversity of input data distribution is designed to enhance the generation ability of the model.
(4)
The experimental result shows that the proposed method achieves effective accuracy rates and has satisfactory anti-noise and domain generation ability.
The remainder of this paper is organized as follows. Section 2 details our method, including the Siamese networks, Vision Transformer, the new loss function bidirectional KL divergence and random mask strategy. Section 3 presents the experiments, results and discussion. Finally, conclusions are drawn in Section 4.

2. Siamese Vision Transformer

2.1. The Framework of the Proposed Method

As shown in Figure 1, the proposed method is a Siamese-based neural network using an improved vision transformer as the backbone. The inputs are a pair of time-frequency graphs obtained from raw vibration signals through STFT. First, the time-frequency graphs are divided into 8 × 8 patches. After that, the patches are fed into the Random mask layer r masking the input patches with a random rate p . Second, the 2D patches are flattened into 1D vectors through linear projection. Then the class token (a trainable vector with the same sizeas a patch) is concatenated in the font of the flattened vectors. At the same time, the positional encoders are added to the vectors. Third, the series vectors are fed into the transformer encoder constructed with two transformer encoder layers. At the top of the network, the class token outputs are used to calculate the distance of the two input time-frequency graphs. The details of the layers are shown in Table 1.

2.2. Data Processing

Short-time Fourier transform (STFT) uses a fixed-length nonzero window function to slide along the time axis, truncating the signal int o segments with the same length. Fourier transform can be used to obtain the local frequency spectra of the segments, assuming that these segments are stable. A 2D time-frequency graph is obtained by recombining these local frequency spectra along the time axis. The formula is presented in Equation (1).
S T F T = x ( t ) g ( t τ ) e j ω t d t ,
where x ( t ) is the original signal and g ( t τ ) is the window function applied with the center point at the time τ .

2.3. Siamese Network

The Siamese network algorithm was proposed by Bromley et al. [40,41] for detecting forged signatures in 1994. A typical Siamese network consists of two twin networks with the same structure and parameters. The two networks receive different inputs and are connected by an energy function calculating a metric in high-level feature space. As shown in Figure 2, tying the weights of the two subnetworks ensures that two highly similar inputs are not mapped onto extremely different positions in the feature space by their respective networks. Besides, the network is symmetrical. Thus, whenever two different inputs are presented to the twin network, the top connection layer calculates the same metric, just as the same inputs are inputted into the opposite twin network. The Siamese network can make full use of the limited training samples to achieve efficient feature extraction using the same or different sample pairs as the training samples.
As shown in Equation (2), f is the hidden layer of the model. The output layer is a fully connected layer that uses the distance feature vector as input and outputs the probability that two input data belong to the same category. This layer is obtained using Equation (3), where s i m g is the sigmoid function and F C represents the fully connected layer.
d ( x 1 i , x 2 i ) = | f ( x 1 i ) f ( x 2 i ) | ,
P ( x 1 i , x 2 i ) = s i m g ( F C ( d ( x 1 i , x 2 i ) ) ) ,
The network is optimized with an Adam optimizer, which adaptively sets the learning rate for each parameter.

2.4. Vision Transformer

A transformer is a neural network model that completely relies on a self-attention mechanism to maintain the relationship between input and output [42]. Because of the parallel architecture, which is different from the sequential structure of the traditional recurrent neural network, the transformer can consider the global information comprehensively and be trained in parallel. The architecture of the transformer model is depicted in Figure 3 and primarily comprises an encoder, a decoder and a positional embedding layer. To help the transformer address the issue of long-term dependency more effectively, positional embedding is utilized to add the relative positioning information of the input data to the data processed by the embedding layer. The transformer performs well in many time series tasks based on the above advantages. However, due to the computational complexity of the self-attention mechanism, it requires more memory and computational power in the training and prediction process. Considering the information redundancy between adjacent pixels, to reduce the computational complexity of the model the vision transformer (ViT) was proposed in [43].
Due to its global information sensing capability, ViT achieves exciting performance in the field of image and vision recognition. The structure of the ViT model consists of a projection of flattened patches, a transformer encoder and a classification head. The input image is first divided into a series of patches. These image patches are then passed through an embedding layer and output vectors of a specific length. To preserve the positional relationship of the input image, position embeddings of the same size as embedded vectors are added to the image patches. The sequence of image patches is passed to the transformer encoder, mainly composed of a multi-head attention layer and an MLP layer. The multi-head attention layer extracts different levels of self-attention information from the input through each head. The output of the class token is fed to the MLP head to give the classification result.

2.4.1. Patch Embedding Layer

The Patch Embedding Layer transforms a conventional visual problem into a seq2seq problem through image segmentation and linear projection. As shown in Equation (4), suppose the input image x R h × w × c , where h , w , c represent the image’s height, width and channel, respectively. P ( * ) is the dividing operation and x p R N × ( p × p × c ) denotes the sequence of the divided image, where N , p represent the number of image patches and width of a patch, respectively. L ( * ) is the linear projection and x p R N × D denotes the projected vectors, where D represent the dimension of vector space. C o n c a t ( * ) is the operation of vector concatenate and z R ( N + 1 ) * D denotes the input of the transformer encoder, where c l s _ t o k e n is a learnable parameter with the same size as the mapped vector and the positional coders ( p o s i t i o n _ c o d e r ) of the image patches are added to the vector space.
x p = P ( x ) x p = L ( x p ) z = p o s i t i o n _ c o d e r + ( C o n c a t ( c l s _ t o k e n , x p ) )

2.4.2. Transformer Encoder

A transformer encoder layer is composed of multiple identical stacked module layers. It mainly contains two sub-layers, i.e., the multi-head self-attention layer and the MLP feedforward layer. In order to improve the stability of the model in training, each sub-layer is connected internally using residual and layer normalization.
  • MLP layer
The structure of the MLP is shown in Figure 4, including a fully connected layer, GELU activation function and dropout. In ViT, the Gaussian error linear unit (GELU) activation function is used in the feedforward layer. GELU activation function is expressed as Equation (5).
G e L u ( x ) = x 1 2 [ 1 + e r f ( x 2 ) ]
  • Multiheaded self-attention layer
The self-attention mechanism enables the network model to extract globally valid features, but the single-head attention mechanism can only learn the feature representation of a single representation space. In order to comprehensively extract remote features from global images, the multi-head self-attention mechanism is used to combine features from different feature subspaces.
The calculation formula of self-attention is written as Equations (6) and (7).
A t t e n t i o n ( Q , K , V ) = s o f t max ( Q K T d k ) V ,
( Q , K , V ) = X W ,
where Q , K , V is the query matrix, key matrix and value matrix, respectively. These matrices are calculated by multiplying the feature matrix X with the learnable matrices W , d denotes the dimension of Q , K and V . The multi-head self-attention mechanism uses multiple self-attention heads to learn features from different representation subspaces and finally integrates these subspace features through linear mapping. The multi-head self-attention mechanism can be expressed as Equation (8).
M u l t i H e a d ( Q , K , V ) = C o n c a t ( h e a d 1 , h e a d 2 , , h e a d n ) W
where C o n c a t ( * ) is the operation of concatenate and W denotes the weight matrix of projection.

2.4.3. MLP Head

The MLP header layer consists of a fully connected layer and an activation function for the classification task of diagnosing faults. In this study, the class token vector processed by the transformer encoder is fed to the MLP header and the probability value of each fault category is obtained through the SoftMax function. The final fault category is obtained according to the maximum probability value.

2.5. Bidirectional KL Divergence

Kullback-Liebler (KL) divergence measures the similarity of a probability distribution to a reference probability distribution [44,45]. A KL divergence of 0 indicates that the two distributions are the same. For discrete probability distributions P and Q defined in the same probability space, the KL divergence [46] from Q to P is defined as Equation (9):
D K K L ( P || Q ) = i P i log P i Q i ,
By contrast, the KL divergence from P to Q is defined as Equation (10):
D K L ( Q || P ) = i Q i log Q i P i ,
Equations (8) and (9) clearly show that the KL divergence is asymmetric. As shown in Equation (8), in the KL divergence from Q to P, when P i = 0 , regardless of the value of Q i , P i log P i Q i = 0 . In the two-classification problem, the loss function can only proceed to one term ( D K L ( P | | Q ) = log Q 0 when P 1 = 0 or D K L ( P | | Q ) = log Q 1 when P 0 = 0 ). To fully measure the difference between the label and the predicted value, we design a new loss function, called bidirectional KL divergence (DKLD), as shown in Equation (11), where represents the label value and Q i is the predicting probability of the model.
L D K L D = i P i log P i Q i + i Q i log Q i P i ,
The iteration of gradient descent updates the parameters as shown in Equation (12):
W = W α L D K L D W b = b α L D K L D b
where W is the model’s weight, b is the bias and α is the learning rate. P is a constant and the gradient can be calculated as Equation (13).
L D K L D W = i Q i W ( 1 + log Q i P i P i Q i ) ,
Compared with the gradient of the cross-entropy loss function, as shown in Equation (14), the gradient of DKLD has an additional coefficient 1 + log Q i P i . This coefficient contributes to the gradient regardless of whether P approaches 0 or 1. We expect that this characteristic of DKLD can help to improve the performance of the model in cases with limited training samples. To prevent calculation errors, we limit the value P to [0.001, 1] during calculations.
L c r o s s _ e n t r o p y W = i Q i W ( P i Q i ) ,
The comparison between DKLD and cross-entropy is presented in Table 2.

2.6. Random Mask Strategy

Similar to dropout, the mask strategy randomly deactivates neuron units in each forward propagation with probability p during training. Unlike the dropout utility neuron units, the mask strategy has larger operation granularity and the operating object in this paper is a patch. The deactivated neurons in low-level layers will affect high-level neurons. Applying mask strategy directly to the input layer can achieve the effect of data augmentation and ensemble learning at the same time. Mask is applied on input amounts to feed the input image cropped randomly and irregularly.
Masking patches with a specific distribution was not enough. Motivated by [47,48], we randomly changed the mask rate on each forward propagation to obtain a new input image with the uncertain feature. In this paper, the mask rate p U n i f o r m ( 0.5 , 0.9 ) . The visualization of the random mask strategy is illustrated in Figure 4.

3. Experiments, Results and Discussion

3.1. Experimental Setup

We set up a series of experiments to verify the prediction accuracy and generation ability of SViT on the Case Western Reserve University (CWRU) bearing datasets [49,50] and Paderborn bearing dataset [51]. The test platform is an Ubuntu 18.04, Python 3.7 and Pytorch with an Intel® CORE™ i7 CPU and an Nvidia GTX 3060 GPU.

3.2. Comparison Models and Evaluation Metric

As shown in Table 3, the proposed model was compared with WDCNN, the Siamese CNN, PSDAN, FSM3, DeIN and HCAE. WDCNN, in which the first layer is a wide convolution kernel proposed in [24]. The Siamese CNN was designed by Zhang et al. [29]. PSDAN, FSM3, DeIN and HCAE and were proposed in [26,52,53,54], respectively. The details of the comparison methods are shown in Table 4. The SViT model was proposed by our team and the parameters of the comparison models are listed in Table 1.
Accuracy, precision, recall and F1 score are used to evaluate the performance of the proposed model. They can be obtained by the following equations:
accuracy = T P + T N T P + F P + F N + T N ,
precision = T P T P + F P ,
recall = T P T P + F N ,
F 1 = 2 p r e c i s i o n * r e c a l l p r e c i s i o n + r e c a l l ,
where T P , F P , T N , F N represent true positive, false positive, true negative and false negative, respectively.

3.3. Case Study 1: CWRU Bearing Datasets

To verify the performance of the proposed method, the 12k drive-end bearing fault data in the CWRU bearing datasets are selected as the original experimental data. Data are collected from vibration signals, as shown in Figure 5. Table 5 shows four types of faults in these data: normal, ball fault, inner race fault and outer race fault. Each fault has three subtypes: 0.007 inches, 0.014 inches and 0.021 inches. Thus, we have 10 different fault types. Each type of fault has three different loads: 1, 2 and 3 hp (with motor speeds of 1772, 1750 and 1730 RPM, respectively), as shown in Table 6. The data under different working conditions are set as domain generation experimental data. Datasets A, B and C correspond to working conditions with loads of 1, 2 and 3 hp. Each dataset contained 6000 training samples and 250 test samples, respectively.
We use half of the vibration signals to generate training samples and the remaining signals to generate the test set. As shown in Figure 6, the training samples are generated by a sliding window of 2048 points with 80 points of overlapping steps. The test set samples pass through sliding windows of the same size, but the samples are generated without overlapping. As shown in Table 5, the dataset includes 19,800 training samples and 750 test samples. Finally, the training and test samples of the proposed model are obtained through STFT.

3.3.1. Evaluating the Effectiveness of DKLD

We set up a series of comparative experiments by randomly selecting 60, 90, 120, 200, 300, 600, 900, 1500, 6000 and 19,800 samples from datasets A, B and C. Each experiment uses 60% of the samples as the training set and the remaining samples as the validation set. To verify the proposed DKLD loss function’s effectiveness, we use DKLD and cross-entropy to train our model separately with different samples size and then compare the test results. As shown in Figure 7, in the cases with a small number of training samples, DKLD significantly improves the model’s performance compared with that of cross-entropy. For example, when the sample size is 60 and 90, the accuracy rates of using DKLD are 1.33% and 0.56% higher than that of using cross-entropy, respectively. When the training sample size is increased to 120 and above, the performance of the two-loss functions is exceptionally close, reaching more than 99%.
To improve the understanding of the effect of DKLD, we use t-distributed stochastic neighbor embedding (t-SNE) to visualize the output of the last hidden fully connected layer of the model trained with DKLD and cross-entropy in 60 sample sizes. As shown in Figure 8a,b, the features of DKLD are more divisible than cross-entropy, particularly in the 1 and 3 categories. Figure 8c,d shows the confusion matrix of the results.

3.3.2. The Effect of the Number of Transformer Encoder Layers

To observe the effect of the number of transformer encoders, we tested the performance of the proposed model with the different number of transformer encoders in the cross-domain experiment from dataset C to dataset A (the most difficult cross-domain task [21]). As shown in Figure 9, the proposed model achieved the best performance with two transformer encoders. SViT with two transformer encoders is implemented in follow-up experiments.

3.3.3. Ablation Experiments

To verify the effectiveness of the Random Mask strategy and Siamese network structure, we set up ablation experiments on cross-domain with 600 training samples. The proposed method is removed the Random mask strategy and Siamese network structure in turn. When the Siamese network structure is removed, the distance layer is instead of a fully connected classifier.
As shown in Table 7, (w/o) means without. It can be seen that the Random mask strategy and the Siamese network effectively improve the robustness of the model in cross-domain tasks.

3.3.4. Comparison of Results with Different Samples Sizes

Implementing the same experimental setup as above, we evaluate the performance of various methods by using different numbers of training samples. We repeat the sample selection process five times for each sample size to generate different training sets to reduce the bias when randomly selecting a small training set. For each random training sample set, we repeat the algorithm training four times to address the randomness of the algorithm. Each series of experiments is repeated 20 times. We use one-shot testing in the Siamese CNN and our method.
Figure 10 clearly shows that as the amount of training samples increases, the accuracy of all methods also increases, but their standard deviation decreases. This shows the sensitivity of the intelligent fault diagnosis method based on deep learning to the amount of training data.
Subsequently, we check whether the proposed SViT model’s accuracy is better than those of the other models in the cases with limited training samples (e.g., 60 and 90). In both cases, our model performs better than the other models. Simultaneously, the experimental results indicate that when the training sample size is increased to 900 and above, all the algorithms’ performance becomes increasingly similar and their accuracy rates are all higher than 97%. This comparison proves that the proposed SViT exhibits significant advantages over the comparison algorithms in cases with limited training samples. Even in the case of 60 training samples, the proposed algorithm’s accuracy rate still reaches 97.56%.

3.3.5. Performance in Noisy Environment

In this experiment, we evaluate the performance of the proposed model in a noisy environment. The model is trained with raw data and then tested with samples added with white Gaussian noise with different signal-to-noise ratios (SNRs). SNR is defined as the ratio of the signal power to the noise power and it is frequently expressed in decibels (dB), as follows: S N R d B = 10 log 10 ( P s i g n a l P n i o s e ) , where P s i g n a l denotes the power of the signal and P n i o s e indicates the power of noise. The SNR range is from 4 dB to 10 dB . The higher the SNR value, the stronger the intensity of noise.
In Figure 11, we examine the effect of training sample size on the test accuracy of each model in different noisy environments. In Figure 11a,b, SNR = −4 and 0 represent substantial noise interference. By contrast, in Figure 11c,d, SNR = 4 and 8 represent weak noise interference. The anti-noise capability of the proposed model is better than those of the other models. In particular, the advantage is more apparent in cases with intense noise, as shown in Figure 11a,b. Considering that the proposed method is not specifically designed to improve the anti-noise, according to the report in [21], we speculate that this anti-noise ability is derived from the twin network structure.

3.3.6. Domain Generation Experiments

To further verify the domain generalization ability of the proposed model, we conduct a cross-domain experiment where all models are trained in the source domain and tested in the target domain. It should be noted that the model does not touch the target domain data during training. The experiment was repeated five times for each task. The results of the cross-domain tasks were observed. The classification accuracies of the experiment are shown in Table 8, in which A-B refers to training on dataset A and testing on dataset B. The proposed SViT achieved the best performance among all the methods in all the scenarios. Specifically, SViT achieved an accuracy of 92.24% in C-A task (the most difficult task), which was 13.4%, 31.88%, 12.86%, 2.8%, 12.56% and 11.57% higher than WDCNN, Siamese CNN, PSADAN, FSM3, DeIN and HCAE, respectively. This shows that the proposed method performs better domain generalization than the comparison methods. Table 9, Table 10 and Table 11 demonstrate precision, recall and F1 score compressions for cross domain task C-A with 6000 training samples. The results show that the proposed SViT outperformed all of the compared approaches.
To further understand cross-domain generation ability, the encoded feather of source domain data and target domain data in different cross-domain task are investigated. T-distributed stochastic neighbor embedding (t-SNE) is used to visualize the output of the class token of the model training with 6600 training samples in the source domain, as shown in Figure 12.

3.4. Case Study 2: Paderborn Dataset

3.4.1. Data Description

As shown in Figure 13, there are five modules the Paderborn dataset test rig [51]: (1) electric motor, (2) torque-measurement shaft, (3) rolling bearing test module, (4) flywheel and (5) load motor. Bearings are installed in the test module to collect experimental data. Fault types of bearings include artificial and real damage.
There work conditions are selected to obtain different domain datasets. In dataset D, the test platform runs at n = 1500 rpm with a load torque of M = 0.7 Nm and a radial force on the bearing of F = 1000 N. In dataset E, load torque changes to M = 0.1. In dataset F, radial force changes to F = 400 N. The details of three datasets are shown in Table 12.
In the experiment, datasets contain vibration signals obtained from healthy, artificially damaged bearings and naturally damaged bearings. The datasets filenames selected are shown in Table 13. The details of the datasets selected are in Table 14.

3.4.2. Results and Analysis

Performing the same implementation, Figure 14 shows the cross-domain tasks accuracy of comparison approaches and our method with the increasing number of training samples. The results show that our method outperformed the state-of-the-art methods in all the scenarios.
Table 15 reports the cross-domain tasks accuracy of different methods with 1800 training samples. The proposed method outperformed all comparative methods by 1.80–4.29% on average. Table 16, Table 17 and Table 18 compare the methods in precision, recall and F1 score in the cross-domain task E-D with 1800 training samples. The results also show that our method superior to the alternatives.

4. Conclusions

In this work, an intelligent bearing fault diagnosis method, i.e., SViT has been proposed to face the challenges coming from limited data and domain generation. We have designed a Siamese Vision transformer (SViT) to extract features efficiently. In addition, a loss function called DKLD has been proposed to improve our model’s prediction accuracy and generation capability. Furthermore, a novel random mask training strategy has been conducted with the SViT to reduce the overfitting risk and improve the model’s generation ability. We present the experimental results showing that our method has better generalization ability in the limited data and cross-domain tasks compared with the state-of-the-art approaches.
However, the proposed method in this paper still has some restrictions. For instance, this method is limited to cross-domain tasks on the same equipment. In addition, In the prediction stage of SViT, a little more supporting data in the target domain is still required, which limits the application scenarios of the proposed method.

Author Contributions

Conceptualization, Q.H.; Data curation, A.Z.; Formal analysis, Q.H.; Funding acquisition, S.L.; Investigation, J.Y.; Methodology, Q.H.; Project administration, S.L.; Software, Q.H.; Supervision, S.L.; Validation, Q.B., A.Z. and M.S.; Writing—original draft, Q.H.; Writing—review & editing, Q.B., A.Z., J.Y. and M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Key Research and Development Program [grant numbers 2020YFB1713300], the Guizhou Province University talent Training Base Project [grant numbers (2020)009], the Guizhou Province University Integration Research Platform Project [grant numbers (2020)005], the Guizhou Province Natural Science Foundation of Basic Research Program [grant numbers QKHYB-ZK(2022)130].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support this study are available at the websites https://engineering.case.edu/bearingdatacenter/download-data-file and https://mb.uni-paderborn.de/kat/forschung/datacenter/bearing-datacenter, accessed on 10 August 2022.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Abdelkader, R.; Kaddour, A.; Bendiabdellah, A.; Derouiche, Z. Rolling bearing fault diagnosis based on an improved denoising method using the complete ensemble empirical mode decomposition and the optimized thresholding operation. IEEE Sens. J. 2018, 18, 7166–7172. [Google Scholar] [CrossRef]
  2. Qiao, M.; Yan, S.; Tang, X.; Xu, C. Deep Convolutional and LSTM Recurrent Neural Networks for Rolling Bearing Fault Diagnosis Under Strong Noises and Variable Loads. IEEE Access 2020, 8, 66257–66269. [Google Scholar] [CrossRef]
  3. Gao, Z.; Cecati, C.; Ding, S.X. A Survey of Fault Diagnosis and Fault-Tolerant Techniques—Part I: Fault Diagnosis With Model-Based and Signal-Based Approaches. IEEE Trans. Ind. Electron. 2015, 62, 3757–3767. [Google Scholar] [CrossRef]
  4. Zhang, J.; Wang, S.; Zhou, P.; Zhao, L.; Li, S. Novel prescribed performance-tangent barrier Lyapunov function for neural adaptive control of the chaotic PMSM system by backstepping. Int. J. Electr. Power Energy Syst. 2020, 121, 105991. [Google Scholar] [CrossRef]
  5. Liu, R.; Yang, B.; Zio, E.; Chen, X. Artificial intelligence for fault diagnosis of rotating machinery: A review. Mech. Syst. Signal Process. 2018, 108, 33–47. [Google Scholar] [CrossRef]
  6. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
  7. Bai, Q.; Li, S.; Yang, J.; Song, Q.; Li, Z.; Zhang, X. Object Detection Recognition and Robot Grasping Based on Machine Learning: A Survey. IEEE Access 2020, 8, 181855–181879. [Google Scholar] [CrossRef]
  8. Socher, R.; Bengio, Y.; Manning, C.D. Deep learning for NLP (without magic). In Proceedings of the Tutorial Abstracts of ACL 2012, Association for Computational Linguistics, Jeju Island, Korea, 8–14 July 2012. [Google Scholar]
  9. Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.-r.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
  10. Yang, J.; Li, S.; Gao, Z.; Wang, Z.; Liu, W. Real-time recognition method for 0.8 cm darning needles and KR22 bearings based on convolution neural networks and data increase. Appl. Sci. 2018, 8, 1857. [Google Scholar] [CrossRef]
  11. Tibaduiza, D.; Torres-Arredondo, M.A.; Vitola, J.; Anaya, M.; Pozo, F. A Damage Classification Approach for Structural Health Monitoring Using Machine Learning. Complexity 2018, 2018, 1–14. [Google Scholar] [CrossRef]
  12. Zhao, K.; Jiang, H.; Li, X.; Wang, R. An optimal deep sparse autoencoder with gated recurrent unit for rolling bearing fault diagnosis. Meas. Sci. Technol. 2020, 31, 015005. [Google Scholar] [CrossRef]
  13. Zhang, W.; Li, C.; Peng, G.; Chen, Y.; Zhang, Z. A deep convolutional neural network with new training methods for bearing fault diagnosis under noisy environment and different working load. Mech. Syst. Signal Process. 2018, 100, 439–453. [Google Scholar] [CrossRef]
  14. He, J.; Ouyang, M.; Yong, C.; Chen, D.; Guo, J.; Zhou, Y. A Novel Intelligent Fault Diagnosis Method for Rolling Bearing Based on Integrated Weight Strategy Features Learning. Sensors 2020, 20, 1774. [Google Scholar] [CrossRef] [PubMed]
  15. Hu, C.; Wang, Y.; Gu, J. Cross-domain intelligent fault classification of bearings based on tensor-aligned invariant subspace learning and two-dimensional convolutional neural networks. Knowl.-Based Syst. 2020, 209, 106214. [Google Scholar] [CrossRef]
  16. Zhu, J.; Hu, T.; Jiang, B.; Yang, X. Intelligent bearing fault diagnosis using PCA-DBN framework. Neural Comput. Appl. 2020, 32, 10773–10781. [Google Scholar] [CrossRef]
  17. Zhao, R.; Yan, R.; Chen, Z.; Mao, K.; Wang, P.; Gao, R.X. Deep learning and its applications to machine health monitoring. Mech. Syst. Signal Process. 2019, 115, 213–237. [Google Scholar] [CrossRef]
  18. Zhiyi, H.; Haidong, S.; Lin, J.; Junsheng, C.; Yu, Y. Transfer fault diagnosis of bearing installed in different machines using enhanced deep auto-encoder. Measurement 2020, 152, 107393. [Google Scholar] [CrossRef]
  19. Yang, B.; Lei, Y.; Jia, F.; Xing, S. An intelligent fault diagnosis approach based on transfer learning from laboratory bearings to locomotive bearings. Mech. Syst. Signal Process. 2019, 122, 692–706. [Google Scholar] [CrossRef]
  20. Wang, R.; Feng, Z.; Huang, S.; Fang, X.; Wang, J. Research on Voltage Waveform Fault Detection of Miniature Vibration Motor Based on Improved WP-LSTM. Micromachines 2020, 11, 753. [Google Scholar] [CrossRef]
  21. Zhang, A.; Li, S.; Cui, Y.; Yang, W.; Dong, R.; Hu, J. Limited Data Rolling Bearing Fault Diagnosis With Few-Shot Learning. IEEE Access 2019, 7, 110895–110904. [Google Scholar] [CrossRef]
  22. Li, C.; Li, S.; Zhang, A.; He, Q. Meta-Learning for Few-Shot Bearing Fault Diagnosis under Complex Working Conditions. Neurocomputing 2021, 439, 197–211. [Google Scholar] [CrossRef]
  23. Li, Q.; Tang, B.; Deng, L.; Wu, Y.; Wang, Y. Deep balanced domain adaptation neural networks for fault diagnosis of planetary gearboxes with limited labeled data. Measurement 2020, 156, 107570. [Google Scholar] [CrossRef]
  24. Hang, Q.; Yang, J.; Xing, L. Diagnosis of Rolling Bearing Based on Classification for High Dimensional Unbalanced Data. IEEE Access 2019, 7, 79159–79172. [Google Scholar] [CrossRef]
  25. Fu, Q.; Wang, H. A Novel Deep Learning System with Data Augmentation for Machine Fault Diagnosis from Vibration Signals. Appl. Sci. 2020, 10, 5765. [Google Scholar] [CrossRef]
  26. Wang, D.; Zhang, M.; Xu, Y.; Lu, W.; Yang, J.; Zhang, T. Metric-based meta-learning model for few-shot fault diagnosis under multiple limited data conditions. Mech. Syst. Signal Process. 2021, 155, 107510. [Google Scholar] [CrossRef]
  27. Lu, S.; Ma, R.; Sirojan, T.; Phung, B.T.; Zhang, D. Lightweight transfer nets and adversarial data augmentation for photovoltaic series arc fault detection with limited fault data. Int. J. Electr. Power Energy Syst. 2021, 130, 107035. [Google Scholar] [CrossRef]
  28. Duan, L.; Xie, M.; Bai, T.; Wang, J. A new support vector data description method for machinery fault diagnosis with unbalanced datasets. Expert Syst. Appl. 2016, 64, 239–246. [Google Scholar] [CrossRef]
  29. Huang, N.; Chen, Q.; Cai, G.; Xu, D.; Zhang, L.; Zhao, W. Fault Diagnosis of Bearing in Wind Turbine Gearbox Under Actual Operating Conditions Driven by Limited Data With Noise Labels. IEEE Trans. Instrum. Meas. 2021, 70, 3502510. [Google Scholar] [CrossRef]
  30. Bai, R.X.; Xu, Q.S.; Meng, Z.; Cao, L.X.; Xing, K.S.; Fan, F.J. Rolling bearing fault diagnosis based on multi-channel convolution neural network and multi-scale clipping fusion data augmentation. Measurement 2021, 184, 109885. [Google Scholar] [CrossRef]
  31. Zheng, H.; Wang, R.; Yang, Y.; Yin, J.; Li, Y.; Li, Y.; Xu, M. Cross-domain fault diagnosis using knowledge transfer strategy: A review. IEEE Access 2019, 7, 129260–129290. [Google Scholar] [CrossRef]
  32. Yan, R.; Shen, F.; Sun, C.; Chen, X. Knowledge transfer for rotary machine fault diagnosis. IEEE Sens. J. 2019, 20, 8374–8393. [Google Scholar] [CrossRef]
  33. Li, J.; Shen, C.; Kong, L.; Wang, D.; Xia, M.; Zhu, Z. A New Adversarial Domain Generalization Network Based on Class Boundary Feature Detection for Bearing Fault Diagnosis. IEEE Trans. Instrum. Meas. 2022, 71, 1–9. [Google Scholar] [CrossRef]
  34. Zhang, Q.; Zhao, Z.; Zhang, X.; Liu, Y.; Sun, C.; Li, M.; Wang, S.; Chen, X. Conditional adversarial domain generalization with a single discriminator for bearing fault diagnosis. IEEE Trans. Instrum. Meas. 2021, 70, 1–15. [Google Scholar] [CrossRef]
  35. Wang, H.; Bai, X.; Tan, J.; Yang, J. Deep prototypical networks based domain adaptation for fault diagnosis. J. Intell. Manuf. 2020, 33, 973–983. [Google Scholar] [CrossRef]
  36. Zheng, H.; Yang, Y.; Yin, J.; Li, Y.; Wang, R.; Xu, M. Deep domain generalization combining a priori diagnosis knowledge toward cross-domain fault diagnosis of rolling bearing. IEEE Trans. Instrum. Meas. 2020, 70, 1–11. [Google Scholar] [CrossRef]
  37. Ding, Y.; Jia, M.; Miao, Q.; Cao, Y. A novel time–frequency Transformer based on self–attention mechanism and its application in fault diagnosis of rolling bearings. Mech. Syst. Signal Process. 2022, 168, 108616. [Google Scholar] [CrossRef]
  38. Weng, C.; Lu, B.; Yao, J. A One-Dimensional Vision Transformer with Multiscale Convolution Fusion for Bearing Fault Diagnosis. In Proceedings of the 2021 Global Reliability and Prognostics and Health Management (PHM-Nanjing), Nanjing, China, 15–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
  39. Tang, X.; Xu, Z.; Wang, Z. A Novel Fault Diagnosis Method of Rolling Bearing Based on Integrated Vision Transformer Model. Sensors 2022, 22, 3878. [Google Scholar] [CrossRef]
  40. Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a ”siamese” time delay neural network. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 30 November–3 December 1993; pp. 737–744. [Google Scholar]
  41. Chicco, D. Siamese Neural Networks: An Overview. In Artificial Neural Networks. Methods in Molecular Biology; Cartwright, H., Ed.; Springer Protocols; Humana: New York, NY, USA, 2020; Volume 2190, pp. 73–94. [Google Scholar]
  42. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  43. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  44. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  45. Kullback, S. Information Theory and Statistics; Courier Corporation: North Chelmsford, MA, USA, 1997. [Google Scholar]
  46. MacKay, D.J.; Mac Kay, D.J. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  47. He, Q.; Li, S.; Li, C.; Zhang, J.; Zhang, A.; Zhou, P. A Hybrid Matching Network for Fault Diagnosis under Different Working Conditions with Limited Data. Comput. Intell. Neurosci. 2022, 2022, 3024590. [Google Scholar] [CrossRef]
  48. Xu, H.; Ding, S.; Zhang, X.; Xiong, H.; Tian, Q. Masked Autoencoders are Robust Data Augmentors. arXiv 2022, arXiv:2206.04846. [Google Scholar]
  49. Lou, X.; Loparo, K.A. Bearing fault diagnosis based on wavelet transform and fuzzy inference. Mech. Syst. Signal Process. 2004, 18, 1077–1095. [Google Scholar] [CrossRef]
  50. Smith, W.A.; Randall, R.B. Rolling element bearing diagnostics using the Case Western Reserve University data: A benchmark study. Mech. Syst. Signal Process. 2015, 64–65, 100–131. [Google Scholar] [CrossRef]
  51. Lessmeier, C.; Kimotho, J.K.; Zimmer, D.; Sextro, W. Condition monitoring of bearing damage in electromechanical drive systems by using motor current signals of electric motors: A benchmark data set for data-driven classification. In Proceedings of the European Conference of the Prognostics and Health Management Society, Bilbao, Spain, 5–8 July 2016; pp. 05–08. [Google Scholar]
  52. Qin, Y.; Yao, Q.; Wang, Y.; Mao, Y. Parameter sharing adversarial domain adaptation networks for fault transfer diagnosis of planetary gearboxes. Mech. Syst. Signal Process. 2021, 160, 107936. [Google Scholar] [CrossRef]
  53. Li, S.; Yang, W.; Zhang, A.; Liu, H.; Huang, J.; Li, C.; Hu, J. A Novel Method of Bearing Fault Diagnosis in Time-Frequency Graphs Using InceptionResnet and Deformable Convolution Networks. IEEE Access 2020, 8, 92743–92753. [Google Scholar] [CrossRef]
  54. Wu, X.; Zhang, Y.; Cheng, C.; Peng, Z. A hybrid classification autoencoder for semi-supervised fault diagnosis in rotating machinery. Mech. Syst. Signal Process. 2021, 149, 107327. [Google Scholar] [CrossRef]
  55. Zhang, W.; Peng, G.; Li, C.; Chen, Y.; Zhang, Z. A New Deep Learning Model for Fault Diagnosis with Good Anti-Noise and Domain Adaptation Ability on Raw Vibration Signals. Sensors 2017, 17, 425. [Google Scholar] [CrossRef] [Green Version]
Figure 1. The overall framework of the proposed method.
Figure 1. The overall framework of the proposed method.
Micromachines 13 01656 g001
Figure 2. Typical Siamese network architecture.
Figure 2. Typical Siamese network architecture.
Micromachines 13 01656 g002
Figure 3. The architecture of the transformer.
Figure 3. The architecture of the transformer.
Micromachines 13 01656 g003
Figure 4. The Random mask training strategy.
Figure 4. The Random mask training strategy.
Micromachines 13 01656 g004
Figure 5. CWRU. Bearing fault diagnosis test plat.
Figure 5. CWRU. Bearing fault diagnosis test plat.
Micromachines 13 01656 g005
Figure 6. Generate sample with overlap.
Figure 6. Generate sample with overlap.
Micromachines 13 01656 g006
Figure 7. Results of proposed model training with different loss functions.
Figure 7. Results of proposed model training with different loss functions.
Micromachines 13 01656 g007
Figure 8. Feature visualization via t-SNE (a,b) and confusion matrix (c,d).
Figure 8. Feature visualization via t-SNE (a,b) and confusion matrix (c,d).
Micromachines 13 01656 g008
Figure 9. The accuracy of the proposed model with different numbers of transformer encoders.
Figure 9. The accuracy of the proposed model with different numbers of transformer encoders.
Micromachines 13 01656 g009
Figure 10. Diagnosis results of the proposed method compared with those of the comparison models.
Figure 10. Diagnosis results of the proposed method compared with those of the comparison models.
Micromachines 13 01656 g010
Figure 11. Results of different sample sizes in a noisy environment. (a) SNR = −4; (b) SNR = 0; (c) SNR = 4; (d) SNR = 8.
Figure 11. Results of different sample sizes in a noisy environment. (a) SNR = −4; (b) SNR = 0; (c) SNR = 4; (d) SNR = 8.
Micromachines 13 01656 g011
Figure 12. Feature visualization via t-SNE in cross-domain tasks.
Figure 12. Feature visualization via t-SNE in cross-domain tasks.
Micromachines 13 01656 g012
Figure 13. Test rig of Paderborn bearing dataset.
Figure 13. Test rig of Paderborn bearing dataset.
Micromachines 13 01656 g013
Figure 14. The mean accuracy of cross-domain task with the different number of training samples on the Paderborn dataset. (a) D-E; (b) D-F; (c) E-D; (d) E-F; (e) F-D; (f) F-E.
Figure 14. The mean accuracy of cross-domain task with the different number of training samples on the Paderborn dataset. (a) D-E; (b) D-F; (c) E-D; (d) E-F; (e) F-D; (f) F-E.
Micromachines 13 01656 g014
Table 1. Details of the proposed model.
Table 1. Details of the proposed model.
NO.Layer TypeInput SizeOutput Size
Size/Stride(Width × Depth)
1Input/64 × 64 × 1
2Patch layer64 × 64 × 18 × 8 × 64
3Patch Flatten8 × 8 × 6464 × 64
4Fully-connected64 × 6432 × 64
5Class torken &position endoer32 × 6432 × 65
6Transformer Encoder32 × 6532 × 65
7Transformer Encoder32 × 6532 × 65
8Fully-connected32 × 11
Table 2. Comparison between DKLD and cross-entropy.
Table 2. Comparison between DKLD and cross-entropy.
Cross-EntropyDKLD
Equation i P i log ( 1 Q i ) i P i log ( P i Q i ) + i Q i log ( Q i P i )
Gradient i Q i W ( - P i Q i ) i Q i W ( 1 + log Q i P i P i Q i )
Table 3. The comparison methods.
Table 3. The comparison methods.
Input TypeMethod Name Implementation Details
Time-basedWDCNNDetails referred to [55].
Siamese CNNDetails referred to [21].
PSDANImplementation details referred to [52].
FSM3Details referred to [26].
Time-FrequencyDeINDetails referred to [53].
HCAEImplementation details referred to [54]
SViT (our)As shown in Table 1.
Table 4. Details of the comparison methods.
Table 4. Details of the comparison methods.
LayersWDCNN
(Kernel Size/Stride)
Siamese CNN
(Kernel Size/Stride)
PSDAN
(Kernel Size/Stride)
FSM3
(Kernel Size/Stride)
DeIN
(Kernel Size/Stride)
HCAE
(Kernel Size/Stride)
1Convolution
(64 × 16/16)
Convolution
(64 × 16/16)
Convolution
(128 × 32/1)
Convolution
(64 × 1/16)
Convolution
(2 × 2 × 64/2)
Convolution
(3 × 3 × 16/2)
2Pooling
(2 × 16/2)
Pooling
(2 × 16/2)
Pooling
(4 × 32/4)
Pooling
(2 × 1/2)
Offset_low
(3 × 3)
Convolution
(3 × 3 × 32/2)
3Convolution
(3 × 32/1)
Convolution
(3 × 32/1)
Convolution
(32 × 64/1)
Convolution
(3 × 1/1)
Inception_Resnet
16
Convolution
(3 × 3 × 32/2)
4Pooling
(2 × 32/1)
Pooling
(2 × 32/1)
Pooling
(4 × 64/4)
Pooling
(2 × 1/2)
ReductionConvolution
(3 × 3 × 32/2)
5Convolution
(3 × 64/1)
Convolution
(3 × 64/1)
Convolution
(8 × 128/1)
Convolution
(3 × 1/1)
Offset_pooling
(3 × 3)
Flatten layer
6Pooling
(2 × 64/2)
Pooling
(2 × 64/2)
Pooling
(4 × 128/4)
Pooling
(2 × 1/2)
Pooling
(3 × 3/1)
Fully-connected
(512 × 64)
7Convolution
(3 × 64/1)
Convolution
(3 × 64/1)
Convolution
(3 × 128/1)
Convolution
(3 × 1/1)
Convolution
(1 × 1/1)
Fully-connected
(64 × 32)
8Pooling
(2×64/2)
Pooling
(2×64/2)
Pooling
(4 × 128/4)
Pooling
(2 × 1/2)
DropoutClassifier (fully-connectied-Softmax)
(32 × 10)
9Convolution
(3 × 64/1)
Convolution
(3 × 64/1)
Convolution
(3 × 128/1)
Convolution
(3 × 1/1)
Offset_top
(3 × 3)
Transposed convolution
(3 × 3 × 32/2)
10Pooling
(2 × 64/2)
Pooling
(2 × 64/2)
Pooling
(4 × 128/4)
FlattenGlobalMax_PoolingTransposed convolution
(3 × 3 × 32/2)
11Flatten-layerFlatten-layerFlatten-layerFully ConnectedSoftmaxTransposed convolution
(3 × 3 × 32/2)
12Fully-connected
(192 × 100)
Fully-connected
(192 × 100)
Fully-Connected
(512 × 256)
Convolution
(3 × 1/1)
Inception-resnet8Transposed convolution
(3 × 3 × 16/2)
13Fully-connected
(100 × 10)
Distance layerFully-Connected
(256 × 128)
Convolution
(3 × 1/1)
ReductionReconstruction
14-Fully-connected
(100 × 1)
Fully-Connected
(128 × 10)
(128 × 2)
FlattenInception-resnet4-
15--_Fully ConnectedDropout-
16--_-Convolution
(2 × 2/1)
-
17--_ Offset_top-
18--_ Pooling-
19--_ softmax-
Table 5. Description of CWRU dataset.
Table 5. Description of CWRU dataset.
Fault LocationNoneBallInner RaceOuter RaceLoad
Fault Diameter (inch)00.0070.0140.0210.0070.0140.0210.0070.0140.021
Class Labels12345678910
Dataset ATrain6006006006006006006006006006001
Test25252525252525252525
Dataset BTrain6006006006006006006006006006002
Test25252525252525252525
Dataset CTrain60060060060060060060060060060 03
Test25252525252525252525
Table 6. Three different working conditions.
Table 6. Three different working conditions.
DatasetsLoad/HPRotational Speed/rpmDamage Size/10−3 in.
A117727, 14, 21
B217507, 14, 21
C317307, 14, 21
Table 7. Ablation experiments with 600 training samples.
Table 7. Ablation experiments with 600 training samples.
MethodsA-BA-CB-AB-CC-AC-BAverage
SViT97.3593.6495.4297.7688.7593.3194.37
(w/o) Random mask94.8987.2685.6790.1487.6482.7588.06
(w/o) Siamese network95.1391.7392.1195.8286.4692.1592.23
(w/o) Random mask
&Siamese network
92.0182.4181.4387.2178.8280.2183.68
Table 8. Mean classification accuracy (%) with 6000 training samples on CWRU.
Table 8. Mean classification accuracy (%) with 6000 training samples on CWRU.
MethodsA-BA-CB-AB-CC-AC-BAverage
WDCNN97.08 91.48 93.00 91.80 78.84 85.88 89.68
Siamese CNN99.24 90.40 88.28 90.12 60.36 65.36 82.29
PSADAN98.10 92.67 90.67 90.86 79.38 92.37 90.68
FSM398.1491.5493.5497.3689.4496.2494.38
DeIN93.14 70.76 76.33 83.17 79.68 76.56 79.94
HCAE98.67 82.67 89.37 90.37 80.67 76.34 86.35
SViT (our)99.54 93.82 94.24 99.85 92.24 98.78 96.41
Table 9. Precision (%) comparison for cross-domain task C-A with 6000 training samples on CWRU.
Table 9. Precision (%) comparison for cross-domain task C-A with 6000 training samples on CWRU.
Class 1Class 2Class 3Class 4Class 5Class 6Class 7Class 8Class 9Class 10
WDCNN76.67 78.60 79.17 79.47 77.67 79.41 79.93 76.49 81.27 80.13
Siamese CNN58.30 58.53 58.79 61.97 58.78 64.67 63.10 59.08 59.94 60.54
PSADAN75.96 83.21 80.20 79.00 79.47 81.46 77.05 80.07 78.71 79.34
FSM388.16 91.90 92.39 92.18 86.82 87.42 89.84 88.45 89.93 88.06
DeIN79.04 81.63 80.20 77.53 81.51 77.78 81.31 80.00 79.80 78.55
HCAE78.21 82.56 79.19 80.67 82.23 78.48 85.32 82.90 78.07 79.80
SViT (our)91.30 92.47 93.40 93.16 93.46 91.45 93.29 90.20 93.96 90.07
Table 10. Recall (%) comparison for cross-domain task C-A with 6000 training samples on CWRU.
Table 10. Recall (%) comparison for cross-domain task C-A with 6000 training samples on CWRU.
Class 1Class 2Class 3Class 4Class 5Class 6Class 7Class 8Class 9Class 10
WDCNN76.67 78.33 82.33 80.00 77.67 81.00 79.67 77.00 76.67 79.33
Siamese CNN55.00 58.33 61.33 63.00 58.00 64.67 61.00 59.67 62.33 60.33
PSADAN79.00 77.67 78.33 79.00 80.00 82.00 78.33 77.67 81.33 80.67
FSM389.33 87.00 89.00 90.33 90.00 88.00 91.33 89.33 89.33 91.00
DeIN76.67 80.00 78.33 81.67 79.33 79.33 78.33 80.00 80.33 83.00
HCAE81.33 77.33 78.67 80.67 78.67 82.67 83.33 85.67 78.33 80.33
SViT (our)91.00 90.00 89.67 95.33 95.33 92.67 92.67 92.00 93.33 90.67
Table 11. F1 score (%) comparison for cross-domain task C-A with 6000 training samples on CWRU.
Table 11. F1 score (%) comparison for cross-domain task C-A with 6000 training samples on CWRU.
Class 1Class 2Class 3Class 4Class 5Class 6Class 7Class 8Class 9Class 10
WDCNN76.67 78.46 80.72 79.73 77.67 80.20 79.80 76.74 78.90 79.73
Siamese CNN56.60 58.43 60.03 62.48 58.39 64.67 62.03 59.37 61.11 60.43
PSADAN77.45 80.34 79.26 79.00 79.73 81.73 77.69 78.85 80.00 80.00
FSM388.74 89.38 90.66 91.25 88.38 87.71 90.58 88.89 89.63 89.51
DeIN77.83 80.81 79.26 79.55 80.41 78.55 79.80 80.00 80.07 80.71
HCAE79.74 79.86 78.93 80.67 80.41 80.52 84.32 84.26 78.20 80.07
SViT (our)91.15 91.22 91.50 94.23 94.39 92.05 92.98 91.09 93.65 90.37
Table 12. Working conditions of test bearing on Paderborn dataset.
Table 12. Working conditions of test bearing on Paderborn dataset.
DatasetsRotational
[rpm]
Load Torque
[Nm]
Radial Force
[N]
Name of Setting
D15000.71000N15_M07_F10
E15000.11000N15_M01_F10
F15000.7400N15_M07 _F04
Table 13. Data sets used for experiments.
Table 13. Data sets used for experiments.
Fault LocationNoneOut RaceInner Race
File NO.K001
K002
Artificial
(KA01)
Artificial
(KI01)
Real damages
(KA04)
Real damages
(KI14)
Table 14. Detail of datasets on Paderborn.
Table 14. Detail of datasets on Paderborn.
Dates SetsSplittingNone
(Class 1)
Inner Race
(Class 2)
Out Race
(Class 3)
DTraining600600600
Testing404040
ETraining600600600
Testing404040
FTraining600600600
Testing404040
Table 15. Mean classification accuracy (%) with 1800 samples on the Paderborn dataset.
Table 15. Mean classification accuracy (%) with 1800 samples on the Paderborn dataset.
MethodsD-ED-FE-DE-FF-DF-EAverage
WDCNN90.1397.594.9993.3395.8391.1693.82
Siamese CNN88.9895.8395.8392.596.1388.1992.91
PSADAN94.26 92.82 97.42 95.3396.0190.2494.35
FSM397.5798.0499.4599.1496.8994.6897.62
DeIN90.5398.1291.7789.8298.2494.5593.84
HCAE95.6796.8499.6796.2695.7693.6796.31
SViT (our)98.0398.0699.8399.3397.0696.3498.11
Table 16. Precision (%) comparison for cross-domain task E-D with 1800 training samples per class on the Paderborn dataset.
Table 16. Precision (%) comparison for cross-domain task E-D with 1800 training samples per class on the Paderborn dataset.
Class 1Class 2Class 3
WDCNN78.64 79.73 78.31
Siamese CNN59.21 61.02 61.03
PSADAN81.51 76.66 80.10
FSM389.68 89.86 88.80
DeIN80.30 79.03 79.83
HCAE80.64 81.15 80.39
SViT (our)92.84 92.36 91.65
Table 17. Recall (%) comparison for cross-domain task E-D with 1800 training samples per class on the Paderborn dataset.
Table 17. Recall (%) comparison for cross-domain task E-D with 1800 training samples per class on the Paderborn dataset.
Class 1Class 2Class 3
WDCNN79.17 78.67 78.83
Siamese CNN62.17 57.67 61.33
PSADAN80.83 78.83 78.50
FSM389.83 88.67 89.83
DeIN80.17 79.17 79.83
HCAE79.83 79.67 82.67
SViT (our)90.83 92.67 93.33
Table 18. F1 score (%) comparison for cross-domain task E-D with 1800 training samples per class on the Paderborn dataset.
Table 18. F1 score (%) comparison for cross-domain task E-D with 1800 training samples per class on the Paderborn dataset.
Class 1Class 2Class 3
WDCNN78.90 79.19 78.57
Siamese CNN60.65 59.30 61.18
PSADAN81.17 77.73 79.29
FSM389.76 89.26 89.31
DeIN80.23 79.10 79.83
HCAE80.23 80.40 81.51
SViT (our)91.83 92.51 92.49
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

He, Q.; Li, S.; Bai, Q.; Zhang, A.; Yang, J.; Shen, M. A Siamese Vision Transformer for Bearings Fault Diagnosis. Micromachines 2022, 13, 1656. https://doi.org/10.3390/mi13101656

AMA Style

He Q, Li S, Bai Q, Zhang A, Yang J, Shen M. A Siamese Vision Transformer for Bearings Fault Diagnosis. Micromachines. 2022; 13(10):1656. https://doi.org/10.3390/mi13101656

Chicago/Turabian Style

He, Qiuchen, Shaobo Li, Qiang Bai, Ansi Zhang, Jing Yang, and Mingming Shen. 2022. "A Siamese Vision Transformer for Bearings Fault Diagnosis" Micromachines 13, no. 10: 1656. https://doi.org/10.3390/mi13101656

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop