1. Introduction
Rolling bearings are extensively used components across various industries. Approximately 30% of mechanical failures in rotating machinery can be attributed to issues with the rolling bearings [
1]. Detecting bearing faults at an early stage is crucial to prevent severe failures that can result in production losses and potential casualties. Bearing fault diagnosis (BFD) involves analyzing and processing vibration signals from the bearings to extract relevant features [
2]. Traditional fault diagnosis methods primarily encompass time-domain analysis, frequency-domain analysis, and time–frequency-domain analysis [
3]. The time domain analysis method involves analyzing vibration signals along the time axis, including time-domain statistical analysis [
4]. Frequency domain analysis entails performing a Fourier transform on vibration signals to obtain the spectrum, with commonly used methods including power spectrum analysis [
5], cepstrum analysis [
6], and envelope spectrum analysis [
7]. Time-frequency domain analysis combines the analysis of frequency-domain features and transient information in vibration signals. Notable methods include the Fourier transform [
8], Wigner–Ville distribution [
9], wavelet transform [
10], and empirical mode decomposition [
11]. While visually inspecting frequency domain characteristics of measured signals often suffices for diagnosing potential faults, many available technologies require significant expertise for successful application. Simpler methods are necessary to enable relatively unskilled operators to make reliable decisions without relying on diagnostic experts to scrutinize data and identify problems. Thus, there is a need for a reliable, fast, and automated diagnosis procedure [
12].
Machine learning (ML) algorithms [
13] have revolutionized BFD by enhancing both efficiency and accuracy in fault diagnosis. Notably, the k-nearest neighbor (KNN) and support vector machine (SVM) algorithms possess strong data processing capabilities and facilitate automatic fault diagnosis. In a specific study, the KNN algorithm was combined with naive Bayes to improve fault diagnosis accuracy. The preliminary classification results obtained from KNN were fed into naive Bayes for further classification, yielding effective results [
14]. ML algorithms also automate the extraction of fault feature information. For instance, a comprehensive diagnosis was achieved by integrating the whale optimization algorithm with SVM after automatic fault feature extraction [
15]. Additionally, Chen et al. [
16] utilized the wavelet transform to construct time-frequency matrices of signals, which were then classified using a convolutional neural network (CNN) known for its high generalization performance. Jian et al. [
17] devised a CAPSO–DAEN fault diagnosis model based on the depth auto-encoder network (DAEN). The cloud adaptive particle swarm optimization (CAPSO) algorithm optimized the network structure by leveraging the randomness and stability of CAPSO. This approach facilitated adaptive feature extraction by reducing weight constraints and thereby improving fault diagnosis [
17]. The successful application of these algorithms has significantly enhanced the reliability and automation of BFD systems. However, despite the proposal of various effective fault diagnosis methods, most of them require prior extraction of fault features from vibration signals, which considerably delays the diagnosis process.
The CNN method has garnered significant success in image and speech recognition [
18,
19]. In recent years, its application in the field of BFD has also proven fruitful [
20]. Zhang et al. [
21] proposed an improved CNN model that demonstrated the effectiveness of CNN in BFD by converting raw vibration signals into two-dimensional images and utilizing them as CNN input. Fu et al. [
22] developed a CNN integrated with the adaptive batch normalization algorithm, employing large-scale convolution kernels and a small multi-dimensional convolution layer, resulting in excellent fault diagnosis performance. Song et al. [
23] utilized wide kernels in the first two convolution layers of CNN to achieve a larger receptive field, showcasing favorable results in diagnosis accuracy, noise suppression, and diagnosis speed. He et al. [
24] introduced a transfer learning (TL) method based on a one-dimensional CNN (1D-CNN) for BFD, employing correlation alignment to minimize edge distribution differences between the source and target domains, leading to strong performance in comparative experiments. Related research can also be found here [
25]. Some studies focus on the importance of reinforcement learning [
26]. Alternatively, the GRU, a derivative of the recurrent neural network (RNN) that excels with time series data, has demonstrated success in various time series prediction tasks such as mine gas concentration forecasting [
27] and wind speed forecasting [
28]. Leveraging the GRU network provides a novel and promising solution for BFD.
In view of the excellent performance of the GRU in time series data, and the actual needs of fault diagnosis. This paper presents a BFD method based on Bayesian optimization and GRU with a self-attention mechanism. The main contributions are as follows:
- (1)
Classification of multiple fault scenarios of the rolling bearings using the novel GRU model.
- (2)
The Bayesian optimization algorithm was used to optimize the structures and hyperparameters of the GRU model.
- (3)
Compare with other state-of-the-art algorithms (the CNN and long short-term memory (LSTM)).
- (4)
The self-attention mechanism is used to improve the vibration feature extraction performance of the GRU.
2. Materials and Methods
The implementation strategy of this paper is as follows: (1) The vibration signals of the rolling bearing were taken as the input of the GRU network, and an output (representing the state of the bearing) was obtained. (2) The error between the output and the real label was calculated. (3) This error was taken as the prior probability of Bayesian optimization. The network structures and hyperparameters of the GRU were optimized according to the prior probability. The above process was repeated until the error was small (the most ideal GRU structure and hyperparameters). The complete implementation process of Bayesian optimization is shown in
Figure 1. (4) The self-attention mechanism has been used to further improve the performance of GRU, significantly enhancing its vibration feature extraction ability.
2.1. Dataset and Sample Setup
The bearing vibration data for various fault scenarios presented in this paper were obtained from the Bearing Data Center of Case Western Reserve University (CWRU). These data were collected using a bearing accelerometer under different fault conditions [
29]. The reliability of this dataset has been extensively validated [
30], making it a widely adopted and standard dataset for bearing fault diagnosis (BFD). For the experimental setup, rolling bearings were installed within a motor-driven mechanical system, as depicted in
Figure 2. While this study utilizes this canonical dataset to establish a performance baseline and facilitate comparison, we acknowledge that validating the model on additional datasets from different machinery or operational environments would further demonstrate its generality. Such validation is a recommended direction for future work. However, the CWRU dataset provides a robust foundation for evaluation due to its controlled yet comprehensive range of fault types (ball, inner ring, outer ring), severities (fault diameters of 0.007, 0.014, and 0.021 inches), and operational conditions (different motor loads/speeds). The demonstrated ability of the proposed model to achieve high accuracy across these varied scenarios within this benchmark strongly suggests its general applicability to similar fault diagnosis tasks.
The bearing type was a deep groove ball bearing (SKF6205-2RS JEM). The sampling frequency of the vibration data recorder was 12 kHz. Four types of vibration signal datasets (normal, ball fault, inner ring fault, and outer ring fault) were obtained from the corresponding bearings, and the vibration test was carried out at the motor speed of 1797 r/min. The electro-discharge machining (EDM) was employed to introduce the faults of the tested ball, bearing inner ring, and outer ring, respectively. Three different levels (fault diameter: 0.007, 0.014, 0.021) of faults were set at different fault locations (ball, inner ring, and outer ring). Including the normal scenario, a total of 10 scenarios were tested (see
Figure 3 for vibration signals, Fault scenario 1 to Fault scenario 10). The specific bearing states and fault scenarios are shown in
Table 1.
Each scenario includes 120,000 sampling points, and every 100 sampling points is divided into one sample. Therefore, each bearing scenario includes 1200 samples, with a total of 12,000 samples (1200 × 10 scenarios); among them, 10,000 samples were used for training, and 2000 samples were used for testing. Different fault scenarios were given corresponding labels (the normal scenario was labeled as 0, and the inner circle fault was labeled as 1 when the fault diameter is 0.007 inch, and so on). Therefore, in general, the vibration signals were the input of the network (GRU), and the labels were the output of the network.
2.2. Gate Recurrent Unit
The GRU was a novel structure of RNN, which, after multiple version updates, has excellent performance in the field of time-series data prediction and diagnosis. The official GRU was first released in 2014, which was the optimization of the LSTM model [
31]. Compared with the LSTM, the GRU has fewer parameters and is easier to calculate and implement in corresponding tasks [
32]. Unlike the LSTM, which has two state units, the GRU contains only one hidden unit (
ht), and its basic structure was similar to that of the traditional RNN. Specifically, the GRU has two gate units, a reset gate and an update gate, which have fewer structural parameters than the three gates of the LSTM. A study tested more than 10,000 kinds of RNN varieties in 2015 [
33]. The results show that the GRU can achieve the same or even better performance as the LSTM, and can converge faster.
The update gate controls the degree to which the state information at the previous time was retained in the current state. In contrast, the reset gate controls the degree to which the current state is combined with the previous information. The basic structure of the GRU was shown in
Figure 4. In
Figure 4, the direction indicated by the arrow is the direction of data flow, where “X” was the number multiplication of the matrix, “σ” was the activation function (sigmoid function), “tanh” was also the activation function (the activation functions shown in
Figure 5), and “1−” means that the data propagated forward by the link was 1 −
zt.
As mentioned earlier (
Figure 4), the update gate and reset gate are
zt and
rt, respectively;
xt is the input; and
ht is the output of the hidden unit.
ht is calculated as follows:
2.3. Bayesian Optimization
The GRU network mentioned earlier presents a challenge due to the unknown values of its structural layers, learning rate, number of training epochs, mini-BatchSize, and number of neural units. Manually selecting and fine-tuning these hyperparameters can be arduous and time-consuming. Bayesian optimization offers a solution by automatically searching for the optimal hyperparameters. This algorithm utilizes a continuously updated probability model (Equation (5)). It assumes that the probability of Event A occurring given the a priori condition of Event B is directly proportional to the probability of Event B occurring given the a posteriori condition. In other words, subsequent events are influenced by all preceding events. Bayesian optimization serves as a promising method for hyperparameter optimization, where the most probable parameter combination is inferred through iterative attempts involving training network models with different structures.
To obtain the optimal parameter combination, the Bayesian optimization algorithm updates the posterior probability of the optimization function through multiple evaluations of the objective function. This process generates reference values for the parameters to be optimized in subsequent model attempts based on a priori conditions, which in this paper are represented by historical records such as the mean relative errors of the tested network models. In our approach, we defined various search ranges for the hyperparameters, and the Bayesian optimization algorithm automatically sampled values from these ranges. It continuously tried network models with different structures and recorded the corresponding errors. Based on the historical error information, the Bayesian optimization algorithm inferred the potential optimal network combination. The selection of the model parameter combination was expressed in Equation (6).
where
and
are the posterior and prior probabilities of Event A, respectively, and
is the observation point probability obtained from the previous events.
is the objective function (i.e., the prediction error (Equation (7)), the prediction error (Mre) was used to calculate the number of samples mispredicted between the prediction results (PR) and the real labels (GT));
is the optimal parametric combination; and
is the value range of parameters.
The Bayesian optimization algorithm constructs a probability model for the objective function by utilizing historical evaluation results. When determining the next set of parameter combinations, the algorithm leverages the prior evaluation information to expedite the parameter search process. As a result, the obtained parameters are more likely to be optimal. In this study, the GRU architecture and the following hyperparameters were optimized: the number of GRU layers, learning rate, number of epochs, mini-BatchSize, and number of hidden units. The search ranges of these five parameters were set to [1–3], [10–4–1], [20–200], [20–80], and [0–100], respectively.
2.4. Self-Attention Mechanism
In deep learning, the attention mechanism is a technique that allows selective focus on input parts, especially valuable for long sequences or complex data. In practice, it involves calculating attention scores that reflect the relevance of the input element to the current model step, which are used to compute a weighted sum for the model output. Various attention mechanisms exist, e.g., additive, multiplicative, and self-attention. Self-attention, or Transformer attention, gained popularity due to its capacity to model long-range dependencies in sequential data. Self-attention enhances deep learning model performance across applications like natural language processing and computer vision. It is a novel tool for accurately capturing structural characteristics in vibration signals.
The attention mechanism used in this paper is shown in
Figure 6, where the input X is linearly mapped to obtain the Q, K, and V matrices (Equations (7)–(9)). Calculate the similarity by dividing the dot product of the Q and K matrices by Dk to prevent the calculated value from being too large. Then normalize it using the softmax function and finally obtain the final data using the weighted sum method (Equation (10)). In this paper, the self-attention mechanism was used to improve the feature extraction performance of GRU for bearing vibration signals.
The optimal placement of the self-attention layer is often task-dependent. For vibration signal analysis, the initial layers of a network learn to detect low-level temporal features and transient patterns that are highly discriminative for fault diagnosis. Positioning the self-attention mechanism at these early stages (e.g., L1 or L2) allows the model to directly emphasize crucial time-points in the raw input signal, such as the impulsive excitations caused by a fault striking a surface. This early, dynamic filtering of information helps to suppress noise and irrelevant background vibrations before further processing, effectively enhancing the signal-to-noise ratio for subsequent GRU layers. When placed in deeper layers, the self-attention mechanism operates on higher-level, more abstract features, which may have already lost the precise temporal resolution needed to highlight short-duration fault impacts, leading to a potential decrease in performance.
3. Results
This section mainly includes three parts:
- (1)
The diagnosis results based on the GRU and Bayesian optimization and analysis of the GRU structure and hyperparameters.
- (2)
Bayesian optimization was applied to the CNN and LSTM and compared with the diagnosis results of GRU.
- (3)
Finally, the proposed method is applied to the fault diagnosis of the motor at different rotation speeds.
3.1. Bayesian Optimization and Diagnosis Results
The process of Bayesian optimization is shown in
Figure 7. The results show that with the increase in optimization times, the detection error of GRU gradually decreases, and the final diagnosis error was only about 0.21.
Then, this paper analyzes the optimization parameters (the number of layers, learning rate, mini-BatchSize, and Max-Epochs of the GRU). The results shown in
Figure 8: (1) When the number of layers of the GRU was 1 or 2, the error was small, and when the number of layers was 2, the diagnosis error was minimum, and when the number of layers increased (3 GRU layers), the error became larger. (2) A small diagnosis error can be obtained when the learning rate is close to 0, and the diagnosis effect was the best when the learning rate is 0.0436. (3) The influence of the mini-BatchSize on the diagnosis error has no obvious law. When the mini-BatchSize was 25, the diagnosis effect was the best; (4) The lowest diagnosis errors were observed when the maximum epochs parameter was set to values around 20. The Bayesian optimization algorithm identified the exact optimal value for this parameter to be 21. The optimal network structure was shown in
Figure 9.
The best network structure of the GRU and the training process of its hyperparameters were shown in
Figure 10a. After nine epochs of training, the network has reached a relatively stable state, and the detection accuracy of the final testing samples was 97.9% (
Figure 10b).
3.2. Comparative Study
In order to prove that the combination of Bayesian optimization and GRU has excellent detection performance, this paper compares the combination of Bayesian optimization and LSTM, as well as the combination of Bayesian optimization and the CNN. The optimization process of the three methods was shown in
Figure 11. The results show that the Bayesian optimization + CNN has the fastest error reduction rate at the beginning, and maintains a relatively stable error in the later stage; the Bayesian optimization + LSTM takes the second place, and the error in the later stage has a slight downward trend; However, although the error decline time of Bayesian optimization + GRU was late, it has maintained a stable decline trend in the later stage, and finally achieved the best detection effect.
The best network structure of LSTM and the training process of its hyperparameters were shown in
Figure 12a. After 11 epochs of training, the network has reached a relatively stable state, and the detection accuracy of the final testing samples was 95.1% (
Figure 12b).
The best network structure of CNN and the training process of its super parameters are shown in
Figure 13a. After 1250 epochs of training, the network still did not reach a relatively stable state, and the detection accuracy of the final testing samples was only 90.0% (
Figure 13b).
Figure 14 shows the confusion matrix of the best detection results of Bayesian Optimization + GRU, Bayesian Optimization + LSTM, and Bayesian Optimization + CNN. The results show that: (1) The GRU has an excellent diagnosis effect on all fault states, and the effect of State 1 was the worst, with 12 detected errors in 200 samples. (2) The diagnosis effect of LSTM on State 3, State 7, and State 9 was not ideal. There are 14 samples detected incorrectly in State 3, 24 samples detected incorrectly in State 7, and 17 samples detected incorrectly in State 9; (3) the CNN’s diagnosis effect on State 4, State 5, State 7, and State 9 was not ideal. There were 40 samples with detected errors in State 4, 50 samples with detected errors in State 5, 28 samples with detected errors in State 7, and 52 samples with detected errors in State 9.
Figure 15 shows the comparison between the accuracy of fault diagnosis and reasoning time. The results show that (1) the GRU has the highest accuracy (97.9%), which is 2.8% and 7.9% higher than the LSTM and CNN, respectively; (2) the GRU has the shortest reasoning time (0.27 s), which was 0.02 s and 0.1 s lower than the LSTM and CNN, respectively. In general, the GRU has high diagnosis accuracy and low diagnosis cost.
While Bayesian optimization is highly efficient in finding optimal hyperparameters with fewer evaluations compared to methods like grid or random search, it is still a computationally intensive process. The computational expense is primarily associated with the repeated training and validation of the GRU model for each proposed set of hyperparameters by the Bayesian algorithm.
The optimization process for the GRU model in this study, which involved searching across five hyperparameters by 100 iterations, required approximately 1.5 h to complete. This was performed on a workstation equipped with an NVIDIA GeForce RTX 3080 GPU, an Intel Core i7-12700K CPU, and 32 GB of RAM, using MATLAB R2024a with its built-in Bayesian optimization framework (bayesopt function).
Although this initial investment in computational time is significant, it is a one-time cost. The resulting optimally configured model then provides superior performance and can be deployed for inference very rapidly (as shown in
Figure 15, inference time was only 0.27 s). This trade-off is often favorable in industrial and research settings where high diagnostic accuracy and reliability are paramount. Furthermore, this process is vastly more efficient than a manual hyperparameter search, which is often intractable and unlikely to find a globally optimal configuration.
3.3. Influence of Self-Attention Layer on Diagnosis Results
The location of self-attention layers within deep learning models can lead to variations in the results. This paper conducts an analysis of GRU networks based on self-attention mechanisms, aiming to determine the optimal GRU network for diagnosing bearing faults. The self-attention layers are added at different locations to compare diagnosis performance, as illustrated in
Figure 16. The diagnosis results, as shown in
Table 2, indicate that as the self-attention layers are located deeper within the GRU, the accuracy gradually decreases (reaching a minimum of 98%), but remains higher than the diagnosis accuracy of the original GRU (97.9%). This also confirms that when the self-attention layers are placed closer to the input layer of the raw data, they can capture more vibration signal features, thereby improving the fault diagnosis accuracy of the GRU.
This performance trend can be attributed to the role of early-layer attention in directly highlighting primary fault features from the input signal. Visualization of the attention weights for the L1 placement confirmed that the mechanism successfully learned to assign high weights to transient peaks corresponding to fault-induced impacts. As the self-attention layer was moved deeper, the focus became more dispersed over time, leading to a gradual dilution of feature selectivity and a corresponding drop in accuracy. This ablation study confirms that for vibration-based fault diagnosis, applying self-attention to low-level features is most effective for capturing the short-duration, high-impact patterns characteristic of bearing faults.
3.4. Scenario Promotion
In order to validate the detection effect of the proposed method in a variety of scenarios, the diagnosis task was also implemented when the motor speed was 1772, 1750, and 1730. The diagnosis accuracy of the GRU was 97.3%, 97.9%, and 98.3% (
Figure 17), respectively, and the diagnosis effect was highly encouraged. The detailed diagnosis results of the GRU, LSTM, and CNN at different speeds are shown in
Table 3. In general, the GRU can obtain more accurate diagnosis results than other algorithms.
3.5. Analysis of Computational Cost
A critical consideration for the practical deployment of any intelligent fault diagnosis system is its computational efficiency (as shown in
Table 4). While the proposed Bayesian-optimized GRU model demonstrates superior inference speed (0.27 s), the computational cost associated with the Bayesian optimization (BO) process itself must be acknowledged. The BO process for the GRU model, which involved 100 iterations of training and validation, required approximately 1.5 h to complete on our hardware setup. This represents a significant, but one-time, offline investment in model development.
This cost is a trade-off for automating the search for an optimal model configuration that would be intractable to find manually. The result of this process is a highly efficient model whose low-latency inference makes it ideally suited for real-time industrial applications. Future work will explore techniques such as transfer learning to reduce this initial optimization overhead, thereby enhancing the overall feasibility of the approach for rapid deployment across industrial assets.
3.6. Analysis of Misclassifications
A deeper analysis of the confusion matrices in
Figure 14 reveals specific patterns that illuminate the models’ limitations. For the proposed GRU model, the highest number of misclassifications occurred in State 1 (Inner Ring fault, 0.007-inch diameter). This is likely because a small, incipient fault generates a very weak signal that is easily contaminated by background noise, making its features difficult to distinguish from the normal state (State 0) or other low-severity faults.
The LSTM model showed pronounced difficulties with States 3, 7, and 9. State 3 (Inner Ring fault, 0.021-inch diameter) and State 9 (Outer Ring fault, 0.021-inch diameter) both represent large-diameter faults. It is possible that the LSTM’s more complex gating mechanism made it prone to overfitting on certain features of these pronounced faults, reducing its generalization ability to the test set. Its poor performance on State 7 (Outer Ring fault, 0.007-inch diameter), a small-diameter fault, further suggests a potential lack of robustness in handling features across vastly different fault severities.
The CNN’s errors were heavily concentrated in fault states related to the ball (States 4, 5) and the outer ring (States 7, 9). This may indicate a limitation of the CNN architecture in effectively modeling the long-range temporal dependencies and specific modulation patterns characteristic of these fault types, which are more readily captured by recurrent architectures like GRU and LSTM.
These observations suggest that while the proposed GRU model is superior overall, its performance could be further improved by incorporating techniques to enhance feature learning from low-signal, noisy samples (e.g., advanced data augmentation, denoising autoencoders).
4. Discussion
This study proposes a novel BFD method integrating a GRU network, Bayesian hyperparameter optimization, and a self-attention mechanism. The results demonstrated its superior performance. The key findings can be interpreted as follows: Firstly, the GRU model, after Bayesian optimization, achieved an accuracy of 97.9%, outperforming similarly optimized LSTM (95.1%) and CNN (90.0%) models. This superiority can be attributed to the GRU’s inherent efficiency in capturing temporal dependencies in vibration signal data with a simpler structure than LSTM, making it less prone to overfitting on this specific task and dataset scale. Secondly, the integration of the self-attention mechanism further boosted the accuracy to 99.6%. The mechanism allows the model to dynamically weigh the importance of different time steps in the vibration signal, effectively highlighting crucial fault-related features and mitigating the influence of less informative noise. The finding that placing the attention layer closer to the input yielded the best performance suggests its primary role is in enhancing low-level feature extraction. Thirdly, the model maintained high accuracy (>97%) across different motor speeds (1772 rpm, 1750 rpm, and 1730 rpm), demonstrating its robustness to variations in operational conditions—a critical factor for practical applications. The method presents a strong balance of high accuracy, computational efficiency in inference, and robustness.
5. Conclusions
This paper presents a BFD method based on the GRU and Bayesian optimization. The Bayesian optimization was used to optimize the network structures and hyperparameters of GRU until the most ideal GRU structure and hyperparameters were obtained, which significantly improved the accuracy of GRU in BFD. And compared with other excellent detection algorithms (LSTM and CNN), it highlights the excellent performance of GRU in BFD. Subsequently, the self-attention mechanism was used to further improve the performance of GRU. Finally, the method was applied to other scenarios (bearing under different rotation speeds).
According to the above research results, the following conclusions are drawn:
Bayesian optimization can significantly improve the diagnosis accuracy of GRU, and the optimal accuracy was 97.9%;
The diagnosis accuracy of GRU was 2.8% higher than that of LSTM and 7.9% higher than that of the CNN;
With the self-attention mechanism, the diagnostic accuracy of GRU has reached 99.6%, and the optimal position of the self-attention layer has been determined.
Under different bearing rotation speeds, the GRU with self-attention mechanisms still has excellent diagnosis performance, and the diagnosis accuracy of all scenarios was higher than 97%.
Despite the promising results, this study has certain limitations. The primary limitation is that the validation was conducted primarily on a widely used but laboratory-based benchmark dataset (CWRU). Its performance on data from more diverse industrial environments, with higher noise levels and different bearing types, needs further investigation.
Based on these conclusions and limitations, future work will focus on the following:
Validating the proposed method on additional real-world industrial datasets to comprehensively assess its generalizability and robustness.
Extending the application to more complex fault scenarios, such as compound faults and gradually worsening faults.
Exploring knowledge transfer or meta-learning techniques to reduce the computational cost of the Bayesian optimization process for new tasks.
Incorporating explainable AI (XAI) techniques to interpret the model’s decision-making process, thereby increasing trust and facilitating its adoption by domain experts.