Optimized GRU with Self-Attention for Bearing Fault Diagnosis Using Bayesian Hyperparameter Tuning

Liu, Zongchao; Teng, Shuai; Wang, Shaodi

doi:10.3390/a18090576

Open AccessArticle

Optimized GRU with Self-Attention for Bearing Fault Diagnosis Using Bayesian Hyperparameter Tuning

by

Zongchao Liu

¹

,

Shuai Teng

^2,*

and

Shaodi Wang

³

¹

College of Finance and Commerce, Guangzhou Railway Polytechnic, Guangzhou 511300, China

²

School of Intelligent Construction and Civil Engineering, Zhongyuan University of Technology, Zhengzhou 450007, China

³

Earthquake Engineering Research & Test Center, Guangzhou University, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(9), 576; https://doi.org/10.3390/a18090576

Submission received: 19 August 2025 / Revised: 10 September 2025 / Accepted: 11 September 2025 / Published: 12 September 2025

(This article belongs to the Special Issue Intelligent Algorithms and Signal Processing Techniques for Fault Diagnosis in Mechanical and Electrical Systems)

Download

Browse Figures

Versions Notes

Abstract

Rolling bearing failures cause significant production downtime and economic losses. Traditional diagnostic methods suffer from low efficiency, suboptimal accuracy, and susceptibility to human subjectivity. To address these limitations, this paper proposes a novel bearing fault diagnosis (BFD) approach leveraging a Gated Recurrent Unit (GRU) network. Key contributions include: (1) Employing Bayesian optimization to automate the search for the optimal GRU architecture (layers, hidden units) and hyperparameters (learning rate, batch size, epochs), significantly enhancing diagnostic performance (achieving 97.9% accuracy). (2) Integrating a self-attention mechanism to further improve the GRU’s feature extraction capability from vibration signals, boosting accuracy to 99.6%. (3) Demonstrating the robustness of the optimized GRU with self-attention across varying motor speeds (1772 rpm, 1750 rpm, 1730 rpm), consistently maintaining diagnostic accuracy above 97%. Comparative studies with Bayesian-optimized Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) models confirm the superior accuracy (97.9% vs. 95.1% and 90.0%) and faster inference speed (0.27 s) of the proposed GRU-based method. The results validate that the combination of Bayesian optimization, GRU, and self-attention provides an efficient, accurate, and robust intelligent solution for automated BFD.

Keywords:

bearing fault diagnosis; gate recurrent unit; Bayesian optimization; vibration signals; self-attention mechanism

1. Introduction

Rolling bearings are extensively used components across various industries. Approximately 30% of mechanical failures in rotating machinery can be attributed to issues with the rolling bearings [1]. Detecting bearing faults at an early stage is crucial to prevent severe failures that can result in production losses and potential casualties. Bearing fault diagnosis (BFD) involves analyzing and processing vibration signals from the bearings to extract relevant features [2]. Traditional fault diagnosis methods primarily encompass time-domain analysis, frequency-domain analysis, and time–frequency-domain analysis [3]. The time domain analysis method involves analyzing vibration signals along the time axis, including time-domain statistical analysis [4]. Frequency domain analysis entails performing a Fourier transform on vibration signals to obtain the spectrum, with commonly used methods including power spectrum analysis [5], cepstrum analysis [6], and envelope spectrum analysis [7]. Time-frequency domain analysis combines the analysis of frequency-domain features and transient information in vibration signals. Notable methods include the Fourier transform [8], Wigner–Ville distribution [9], wavelet transform [10], and empirical mode decomposition [11]. While visually inspecting frequency domain characteristics of measured signals often suffices for diagnosing potential faults, many available technologies require significant expertise for successful application. Simpler methods are necessary to enable relatively unskilled operators to make reliable decisions without relying on diagnostic experts to scrutinize data and identify problems. Thus, there is a need for a reliable, fast, and automated diagnosis procedure [12].

Machine learning (ML) algorithms [13] have revolutionized BFD by enhancing both efficiency and accuracy in fault diagnosis. Notably, the k-nearest neighbor (KNN) and support vector machine (SVM) algorithms possess strong data processing capabilities and facilitate automatic fault diagnosis. In a specific study, the KNN algorithm was combined with naive Bayes to improve fault diagnosis accuracy. The preliminary classification results obtained from KNN were fed into naive Bayes for further classification, yielding effective results [14]. ML algorithms also automate the extraction of fault feature information. For instance, a comprehensive diagnosis was achieved by integrating the whale optimization algorithm with SVM after automatic fault feature extraction [15]. Additionally, Chen et al. [16] utilized the wavelet transform to construct time-frequency matrices of signals, which were then classified using a convolutional neural network (CNN) known for its high generalization performance. Jian et al. [17] devised a CAPSO–DAEN fault diagnosis model based on the depth auto-encoder network (DAEN). The cloud adaptive particle swarm optimization (CAPSO) algorithm optimized the network structure by leveraging the randomness and stability of CAPSO. This approach facilitated adaptive feature extraction by reducing weight constraints and thereby improving fault diagnosis [17]. The successful application of these algorithms has significantly enhanced the reliability and automation of BFD systems. However, despite the proposal of various effective fault diagnosis methods, most of them require prior extraction of fault features from vibration signals, which considerably delays the diagnosis process.

The CNN method has garnered significant success in image and speech recognition [18,19]. In recent years, its application in the field of BFD has also proven fruitful [20]. Zhang et al. [21] proposed an improved CNN model that demonstrated the effectiveness of CNN in BFD by converting raw vibration signals into two-dimensional images and utilizing them as CNN input. Fu et al. [22] developed a CNN integrated with the adaptive batch normalization algorithm, employing large-scale convolution kernels and a small multi-dimensional convolution layer, resulting in excellent fault diagnosis performance. Song et al. [23] utilized wide kernels in the first two convolution layers of CNN to achieve a larger receptive field, showcasing favorable results in diagnosis accuracy, noise suppression, and diagnosis speed. He et al. [24] introduced a transfer learning (TL) method based on a one-dimensional CNN (1D-CNN) for BFD, employing correlation alignment to minimize edge distribution differences between the source and target domains, leading to strong performance in comparative experiments. Related research can also be found here [25]. Some studies focus on the importance of reinforcement learning [26]. Alternatively, the GRU, a derivative of the recurrent neural network (RNN) that excels with time series data, has demonstrated success in various time series prediction tasks such as mine gas concentration forecasting [27] and wind speed forecasting [28]. Leveraging the GRU network provides a novel and promising solution for BFD.

In view of the excellent performance of the GRU in time series data, and the actual needs of fault diagnosis. This paper presents a BFD method based on Bayesian optimization and GRU with a self-attention mechanism. The main contributions are as follows:

(1): Classification of multiple fault scenarios of the rolling bearings using the novel GRU model.
(2): The Bayesian optimization algorithm was used to optimize the structures and hyperparameters of the GRU model.
(3): Compare with other state-of-the-art algorithms (the CNN and long short-term memory (LSTM)).
(4): The self-attention mechanism is used to improve the vibration feature extraction performance of the GRU.

2. Materials and Methods

The implementation strategy of this paper is as follows: (1) The vibration signals of the rolling bearing were taken as the input of the GRU network, and an output (representing the state of the bearing) was obtained. (2) The error between the output and the real label was calculated. (3) This error was taken as the prior probability of Bayesian optimization. The network structures and hyperparameters of the GRU were optimized according to the prior probability. The above process was repeated until the error was small (the most ideal GRU structure and hyperparameters). The complete implementation process of Bayesian optimization is shown in Figure 1. (4) The self-attention mechanism has been used to further improve the performance of GRU, significantly enhancing its vibration feature extraction ability.

2.1. Dataset and Sample Setup

The bearing vibration data for various fault scenarios presented in this paper were obtained from the Bearing Data Center of Case Western Reserve University (CWRU). These data were collected using a bearing accelerometer under different fault conditions [29]. The reliability of this dataset has been extensively validated [30], making it a widely adopted and standard dataset for bearing fault diagnosis (BFD). For the experimental setup, rolling bearings were installed within a motor-driven mechanical system, as depicted in Figure 2. While this study utilizes this canonical dataset to establish a performance baseline and facilitate comparison, we acknowledge that validating the model on additional datasets from different machinery or operational environments would further demonstrate its generality. Such validation is a recommended direction for future work. However, the CWRU dataset provides a robust foundation for evaluation due to its controlled yet comprehensive range of fault types (ball, inner ring, outer ring), severities (fault diameters of 0.007, 0.014, and 0.021 inches), and operational conditions (different motor loads/speeds). The demonstrated ability of the proposed model to achieve high accuracy across these varied scenarios within this benchmark strongly suggests its general applicability to similar fault diagnosis tasks.

The bearing type was a deep groove ball bearing (SKF6205-2RS JEM). The sampling frequency of the vibration data recorder was 12 kHz. Four types of vibration signal datasets (normal, ball fault, inner ring fault, and outer ring fault) were obtained from the corresponding bearings, and the vibration test was carried out at the motor speed of 1797 r/min. The electro-discharge machining (EDM) was employed to introduce the faults of the tested ball, bearing inner ring, and outer ring, respectively. Three different levels (fault diameter: 0.007, 0.014, 0.021) of faults were set at different fault locations (ball, inner ring, and outer ring). Including the normal scenario, a total of 10 scenarios were tested (see Figure 3 for vibration signals, Fault scenario 1 to Fault scenario 10). The specific bearing states and fault scenarios are shown in Table 1.

Each scenario includes 120,000 sampling points, and every 100 sampling points is divided into one sample. Therefore, each bearing scenario includes 1200 samples, with a total of 12,000 samples (1200 × 10 scenarios); among them, 10,000 samples were used for training, and 2000 samples were used for testing. Different fault scenarios were given corresponding labels (the normal scenario was labeled as 0, and the inner circle fault was labeled as 1 when the fault diameter is 0.007 inch, and so on). Therefore, in general, the vibration signals were the input of the network (GRU), and the labels were the output of the network.

2.2. Gate Recurrent Unit

The GRU was a novel structure of RNN, which, after multiple version updates, has excellent performance in the field of time-series data prediction and diagnosis. The official GRU was first released in 2014, which was the optimization of the LSTM model [31]. Compared with the LSTM, the GRU has fewer parameters and is easier to calculate and implement in corresponding tasks [32]. Unlike the LSTM, which has two state units, the GRU contains only one hidden unit (h_t), and its basic structure was similar to that of the traditional RNN. Specifically, the GRU has two gate units, a reset gate and an update gate, which have fewer structural parameters than the three gates of the LSTM. A study tested more than 10,000 kinds of RNN varieties in 2015 [33]. The results show that the GRU can achieve the same or even better performance as the LSTM, and can converge faster.

The update gate controls the degree to which the state information at the previous time was retained in the current state. In contrast, the reset gate controls the degree to which the current state is combined with the previous information. The basic structure of the GRU was shown in Figure 4. In Figure 4, the direction indicated by the arrow is the direction of data flow, where “X” was the number multiplication of the matrix, “σ” was the activation function (sigmoid function), “tanh” was also the activation function (the activation functions shown in Figure 5), and “1−” means that the data propagated forward by the link was 1 − z_t.

As mentioned earlier (Figure 4), the update gate and reset gate are z_t and r_t, respectively; x_t is the input; and h_t is the output of the hidden unit. h_t is calculated as follows:

z_{t} = σ (W_{z} \cdot [x_{t}, h_{t - 1}])

(1)

r_{t} = σ (W_{r} \cdot [x_{t}, h_{t - 1}])

(2)

{\tilde{h}}_{t} = \tanh (W_{\tilde{h}} \cdot [x_{t}, r_{t} \times h_{t - 1}])

(3)

h_{t} = (1 - z_{t}) \times h_{t - 1} + z_{t} \times \tilde{h_{t}}

(4)

2.3. Bayesian Optimization

The GRU network mentioned earlier presents a challenge due to the unknown values of its structural layers, learning rate, number of training epochs, mini-BatchSize, and number of neural units. Manually selecting and fine-tuning these hyperparameters can be arduous and time-consuming. Bayesian optimization offers a solution by automatically searching for the optimal hyperparameters. This algorithm utilizes a continuously updated probability model (Equation (5)). It assumes that the probability of Event A occurring given the a priori condition of Event B is directly proportional to the probability of Event B occurring given the a posteriori condition. In other words, subsequent events are influenced by all preceding events. Bayesian optimization serves as a promising method for hyperparameter optimization, where the most probable parameter combination is inferred through iterative attempts involving training network models with different structures.

To obtain the optimal parameter combination, the Bayesian optimization algorithm updates the posterior probability of the optimization function through multiple evaluations of the objective function. This process generates reference values for the parameters to be optimized in subsequent model attempts based on a priori conditions, which in this paper are represented by historical records such as the mean relative errors of the tested network models. In our approach, we defined various search ranges for the hyperparameters, and the Bayesian optimization algorithm automatically sampled values from these ranges. It continuously tried network models with different structures and recorded the corresponding errors. Based on the historical error information, the Bayesian optimization algorithm inferred the potential optimal network combination. The selection of the model parameter combination was expressed in Equation (6).

P (A |B) \propto P (B |A) P (A)

(5)

x^{*} = \arg \min_{x \in χ} f (x)

(6)

where

P (A |B)

and

P (A)

are the posterior and prior probabilities of Event A, respectively, and

P (B |A)

is the observation point probability obtained from the previous events.

f (x)

is the objective function (i.e., the prediction error (Equation (7)), the prediction error (Mre) was used to calculate the number of samples mispredicted between the prediction results (PR) and the real labels (GT));

x^{*}

is the optimal parametric combination; and

χ

is the value range of parameters.

The Bayesian optimization algorithm constructs a probability model for the objective function by utilizing historical evaluation results. When determining the next set of parameter combinations, the algorithm leverages the prior evaluation information to expedite the parameter search process. As a result, the obtained parameters are more likely to be optimal. In this study, the GRU architecture and the following hyperparameters were optimized: the number of GRU layers, learning rate, number of epochs, mini-BatchSize, and number of hidden units. The search ranges of these five parameters were set to [1–3], [10–4–1], [20–200], [20–80], and [0–100], respectively.

2.4. Self-Attention Mechanism

In deep learning, the attention mechanism is a technique that allows selective focus on input parts, especially valuable for long sequences or complex data. In practice, it involves calculating attention scores that reflect the relevance of the input element to the current model step, which are used to compute a weighted sum for the model output. Various attention mechanisms exist, e.g., additive, multiplicative, and self-attention. Self-attention, or Transformer attention, gained popularity due to its capacity to model long-range dependencies in sequential data. Self-attention enhances deep learning model performance across applications like natural language processing and computer vision. It is a novel tool for accurately capturing structural characteristics in vibration signals.

The attention mechanism used in this paper is shown in Figure 6, where the input X is linearly mapped to obtain the Q, K, and V matrices (Equations (7)–(9)). Calculate the similarity by dividing the dot product of the Q and K matrices by Dk to prevent the calculated value from being too large. Then normalize it using the softmax function and finally obtain the final data using the weighted sum method (Equation (10)). In this paper, the self-attention mechanism was used to improve the feature extraction performance of GRU for bearing vibration signals.

The optimal placement of the self-attention layer is often task-dependent. For vibration signal analysis, the initial layers of a network learn to detect low-level temporal features and transient patterns that are highly discriminative for fault diagnosis. Positioning the self-attention mechanism at these early stages (e.g., L1 or L2) allows the model to directly emphasize crucial time-points in the raw input signal, such as the impulsive excitations caused by a fault striking a surface. This early, dynamic filtering of information helps to suppress noise and irrelevant background vibrations before further processing, effectively enhancing the signal-to-noise ratio for subsequent GRU layers. When placed in deeper layers, the self-attention mechanism operates on higher-level, more abstract features, which may have already lost the precise temporal resolution needed to highlight short-duration fault impacts, leading to a potential decrease in performance.

Q = W_{q} X

(7)

K = W_{k} X

(8)

V = W_{v} X

(9)

A (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{D_{k}}}) V

(10)

3. Results

This section mainly includes three parts:

(1): The diagnosis results based on the GRU and Bayesian optimization and analysis of the GRU structure and hyperparameters.
(2): Bayesian optimization was applied to the CNN and LSTM and compared with the diagnosis results of GRU.
(3): Finally, the proposed method is applied to the fault diagnosis of the motor at different rotation speeds.

3.1. Bayesian Optimization and Diagnosis Results

The process of Bayesian optimization is shown in Figure 7. The results show that with the increase in optimization times, the detection error of GRU gradually decreases, and the final diagnosis error was only about 0.21.

Then, this paper analyzes the optimization parameters (the number of layers, learning rate, mini-BatchSize, and Max-Epochs of the GRU). The results shown in Figure 8: (1) When the number of layers of the GRU was 1 or 2, the error was small, and when the number of layers was 2, the diagnosis error was minimum, and when the number of layers increased (3 GRU layers), the error became larger. (2) A small diagnosis error can be obtained when the learning rate is close to 0, and the diagnosis effect was the best when the learning rate is 0.0436. (3) The influence of the mini-BatchSize on the diagnosis error has no obvious law. When the mini-BatchSize was 25, the diagnosis effect was the best; (4) The lowest diagnosis errors were observed when the maximum epochs parameter was set to values around 20. The Bayesian optimization algorithm identified the exact optimal value for this parameter to be 21. The optimal network structure was shown in Figure 9.

The best network structure of the GRU and the training process of its hyperparameters were shown in Figure 10a. After nine epochs of training, the network has reached a relatively stable state, and the detection accuracy of the final testing samples was 97.9% (Figure 10b).

3.2. Comparative Study

In order to prove that the combination of Bayesian optimization and GRU has excellent detection performance, this paper compares the combination of Bayesian optimization and LSTM, as well as the combination of Bayesian optimization and the CNN. The optimization process of the three methods was shown in Figure 11. The results show that the Bayesian optimization + CNN has the fastest error reduction rate at the beginning, and maintains a relatively stable error in the later stage; the Bayesian optimization + LSTM takes the second place, and the error in the later stage has a slight downward trend; However, although the error decline time of Bayesian optimization + GRU was late, it has maintained a stable decline trend in the later stage, and finally achieved the best detection effect.

The best network structure of LSTM and the training process of its hyperparameters were shown in Figure 12a. After 11 epochs of training, the network has reached a relatively stable state, and the detection accuracy of the final testing samples was 95.1% (Figure 12b).

The best network structure of CNN and the training process of its super parameters are shown in Figure 13a. After 1250 epochs of training, the network still did not reach a relatively stable state, and the detection accuracy of the final testing samples was only 90.0% (Figure 13b).

Figure 14 shows the confusion matrix of the best detection results of Bayesian Optimization + GRU, Bayesian Optimization + LSTM, and Bayesian Optimization + CNN. The results show that: (1) The GRU has an excellent diagnosis effect on all fault states, and the effect of State 1 was the worst, with 12 detected errors in 200 samples. (2) The diagnosis effect of LSTM on State 3, State 7, and State 9 was not ideal. There are 14 samples detected incorrectly in State 3, 24 samples detected incorrectly in State 7, and 17 samples detected incorrectly in State 9; (3) the CNN’s diagnosis effect on State 4, State 5, State 7, and State 9 was not ideal. There were 40 samples with detected errors in State 4, 50 samples with detected errors in State 5, 28 samples with detected errors in State 7, and 52 samples with detected errors in State 9.

Figure 15 shows the comparison between the accuracy of fault diagnosis and reasoning time. The results show that (1) the GRU has the highest accuracy (97.9%), which is 2.8% and 7.9% higher than the LSTM and CNN, respectively; (2) the GRU has the shortest reasoning time (0.27 s), which was 0.02 s and 0.1 s lower than the LSTM and CNN, respectively. In general, the GRU has high diagnosis accuracy and low diagnosis cost.

While Bayesian optimization is highly efficient in finding optimal hyperparameters with fewer evaluations compared to methods like grid or random search, it is still a computationally intensive process. The computational expense is primarily associated with the repeated training and validation of the GRU model for each proposed set of hyperparameters by the Bayesian algorithm.

The optimization process for the GRU model in this study, which involved searching across five hyperparameters by 100 iterations, required approximately 1.5 h to complete. This was performed on a workstation equipped with an NVIDIA GeForce RTX 3080 GPU, an Intel Core i7-12700K CPU, and 32 GB of RAM, using MATLAB R2024a with its built-in Bayesian optimization framework (bayesopt function).

Although this initial investment in computational time is significant, it is a one-time cost. The resulting optimally configured model then provides superior performance and can be deployed for inference very rapidly (as shown in Figure 15, inference time was only 0.27 s). This trade-off is often favorable in industrial and research settings where high diagnostic accuracy and reliability are paramount. Furthermore, this process is vastly more efficient than a manual hyperparameter search, which is often intractable and unlikely to find a globally optimal configuration.

3.3. Influence of Self-Attention Layer on Diagnosis Results

The location of self-attention layers within deep learning models can lead to variations in the results. This paper conducts an analysis of GRU networks based on self-attention mechanisms, aiming to determine the optimal GRU network for diagnosing bearing faults. The self-attention layers are added at different locations to compare diagnosis performance, as illustrated in Figure 16. The diagnosis results, as shown in Table 2, indicate that as the self-attention layers are located deeper within the GRU, the accuracy gradually decreases (reaching a minimum of 98%), but remains higher than the diagnosis accuracy of the original GRU (97.9%). This also confirms that when the self-attention layers are placed closer to the input layer of the raw data, they can capture more vibration signal features, thereby improving the fault diagnosis accuracy of the GRU.

This performance trend can be attributed to the role of early-layer attention in directly highlighting primary fault features from the input signal. Visualization of the attention weights for the L1 placement confirmed that the mechanism successfully learned to assign high weights to transient peaks corresponding to fault-induced impacts. As the self-attention layer was moved deeper, the focus became more dispersed over time, leading to a gradual dilution of feature selectivity and a corresponding drop in accuracy. This ablation study confirms that for vibration-based fault diagnosis, applying self-attention to low-level features is most effective for capturing the short-duration, high-impact patterns characteristic of bearing faults.

3.4. Scenario Promotion

In order to validate the detection effect of the proposed method in a variety of scenarios, the diagnosis task was also implemented when the motor speed was 1772, 1750, and 1730. The diagnosis accuracy of the GRU was 97.3%, 97.9%, and 98.3% (Figure 17), respectively, and the diagnosis effect was highly encouraged. The detailed diagnosis results of the GRU, LSTM, and CNN at different speeds are shown in Table 3. In general, the GRU can obtain more accurate diagnosis results than other algorithms.

3.5. Analysis of Computational Cost

A critical consideration for the practical deployment of any intelligent fault diagnosis system is its computational efficiency (as shown in Table 4). While the proposed Bayesian-optimized GRU model demonstrates superior inference speed (0.27 s), the computational cost associated with the Bayesian optimization (BO) process itself must be acknowledged. The BO process for the GRU model, which involved 100 iterations of training and validation, required approximately 1.5 h to complete on our hardware setup. This represents a significant, but one-time, offline investment in model development.

This cost is a trade-off for automating the search for an optimal model configuration that would be intractable to find manually. The result of this process is a highly efficient model whose low-latency inference makes it ideally suited for real-time industrial applications. Future work will explore techniques such as transfer learning to reduce this initial optimization overhead, thereby enhancing the overall feasibility of the approach for rapid deployment across industrial assets.

3.6. Analysis of Misclassifications

A deeper analysis of the confusion matrices in Figure 14 reveals specific patterns that illuminate the models’ limitations. For the proposed GRU model, the highest number of misclassifications occurred in State 1 (Inner Ring fault, 0.007-inch diameter). This is likely because a small, incipient fault generates a very weak signal that is easily contaminated by background noise, making its features difficult to distinguish from the normal state (State 0) or other low-severity faults.

The LSTM model showed pronounced difficulties with States 3, 7, and 9. State 3 (Inner Ring fault, 0.021-inch diameter) and State 9 (Outer Ring fault, 0.021-inch diameter) both represent large-diameter faults. It is possible that the LSTM’s more complex gating mechanism made it prone to overfitting on certain features of these pronounced faults, reducing its generalization ability to the test set. Its poor performance on State 7 (Outer Ring fault, 0.007-inch diameter), a small-diameter fault, further suggests a potential lack of robustness in handling features across vastly different fault severities.

The CNN’s errors were heavily concentrated in fault states related to the ball (States 4, 5) and the outer ring (States 7, 9). This may indicate a limitation of the CNN architecture in effectively modeling the long-range temporal dependencies and specific modulation patterns characteristic of these fault types, which are more readily captured by recurrent architectures like GRU and LSTM.

These observations suggest that while the proposed GRU model is superior overall, its performance could be further improved by incorporating techniques to enhance feature learning from low-signal, noisy samples (e.g., advanced data augmentation, denoising autoencoders).

4. Discussion

This study proposes a novel BFD method integrating a GRU network, Bayesian hyperparameter optimization, and a self-attention mechanism. The results demonstrated its superior performance. The key findings can be interpreted as follows: Firstly, the GRU model, after Bayesian optimization, achieved an accuracy of 97.9%, outperforming similarly optimized LSTM (95.1%) and CNN (90.0%) models. This superiority can be attributed to the GRU’s inherent efficiency in capturing temporal dependencies in vibration signal data with a simpler structure than LSTM, making it less prone to overfitting on this specific task and dataset scale. Secondly, the integration of the self-attention mechanism further boosted the accuracy to 99.6%. The mechanism allows the model to dynamically weigh the importance of different time steps in the vibration signal, effectively highlighting crucial fault-related features and mitigating the influence of less informative noise. The finding that placing the attention layer closer to the input yielded the best performance suggests its primary role is in enhancing low-level feature extraction. Thirdly, the model maintained high accuracy (>97%) across different motor speeds (1772 rpm, 1750 rpm, and 1730 rpm), demonstrating its robustness to variations in operational conditions—a critical factor for practical applications. The method presents a strong balance of high accuracy, computational efficiency in inference, and robustness.

5. Conclusions

This paper presents a BFD method based on the GRU and Bayesian optimization. The Bayesian optimization was used to optimize the network structures and hyperparameters of GRU until the most ideal GRU structure and hyperparameters were obtained, which significantly improved the accuracy of GRU in BFD. And compared with other excellent detection algorithms (LSTM and CNN), it highlights the excellent performance of GRU in BFD. Subsequently, the self-attention mechanism was used to further improve the performance of GRU. Finally, the method was applied to other scenarios (bearing under different rotation speeds).

According to the above research results, the following conclusions are drawn:

Bayesian optimization can significantly improve the diagnosis accuracy of GRU, and the optimal accuracy was 97.9%;
The diagnosis accuracy of GRU was 2.8% higher than that of LSTM and 7.9% higher than that of the CNN;
With the self-attention mechanism, the diagnostic accuracy of GRU has reached 99.6%, and the optimal position of the self-attention layer has been determined.
Under different bearing rotation speeds, the GRU with self-attention mechanisms still has excellent diagnosis performance, and the diagnosis accuracy of all scenarios was higher than 97%.

Despite the promising results, this study has certain limitations. The primary limitation is that the validation was conducted primarily on a widely used but laboratory-based benchmark dataset (CWRU). Its performance on data from more diverse industrial environments, with higher noise levels and different bearing types, needs further investigation.

Based on these conclusions and limitations, future work will focus on the following:

Validating the proposed method on additional real-world industrial datasets to comprehensively assess its generalizability and robustness.
Extending the application to more complex fault scenarios, such as compound faults and gradually worsening faults.
Exploring knowledge transfer or meta-learning techniques to reduce the computational cost of the Bayesian optimization process for new tasks.
Incorporating explainable AI (XAI) techniques to interpret the model’s decision-making process, thereby increasing trust and facilitating its adoption by domain experts.

Author Contributions

Conceptualization, Z.L. and S.T.; methodology, Z.L. and S.T.; software, Z.L.; validation, S.T. and S.W.; formal analysis, Z.L. and S.T.; investigation, S.W. and Z.L.; resources, S.T.; data curation, Z.L. and S.W.; writing—original draft preparation, Z.L., S.T. and S.W.; writing—review and editing, Z.L., S.T. and S.W.; visualization, Z.L.; supervision, S.T. and Z.L.; project administration, Z.L.; funding acquisition, S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was funded by the Guangzhou Railway Polytechnic, grant number: GTXYR2431.

Data Availability Statement

The data presented in this study are openly available in the Case Western Reserve University Bearing Data Center at https://engineering.case.edu/bearingdatacenter, accessed on 25 August 2025.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Li, W.; Chen, J.; Li, J.; Xia, K. Derivative and enhanced discrete analytic wavelet algorithm for rolling bearing fault diagnosis. Microprocess. Microsyst. 2021, 82, 103872. [Google Scholar] [CrossRef]
Zhao, H.; Jiang, X.; Wang, B.; Cheng, X. Bearing fault feature extraction method: Stochastic resonance-based negative entropy of square envelope spectrum. Meas. Sci. Technol. 2024, 35, 045102. [Google Scholar] [CrossRef]
Wang, W.; Sun, Y. Bearing fault diagnosis based on the feature enhancement of improved mean differential SDP images. Eng. Res. Express 2024, 6, 015039. [Google Scholar] [CrossRef]
Fu, S.; Liu, K.; Xu, Y.; Liu, Y. Rolling Bearing Diagnosing Method Based on Time Domain Analysis and Adaptive Fuzzy SV. Shock Vib. 2016, 2016, 9412787. [Google Scholar]
Ma, H.; Zhang, Z.; Shi, W.; Chen, J.; Chen, T. Doubly-fed Induction Generator Stator Fault Diagnosis Based on Rotor Instantaneous Power Spectrum. Dianli Xitong Zidonghua/Autom. Electr. Power Syst. 2014, 38, 30–35. [Google Scholar]
Jiang, Z.; Zhang, Y.; Feng, K.; Hu, M.; He, Y. Gear Fault Diagnosis Method based on Feature-enhanced Cepstrum Analysis. J. Mech. Transm. 2019, 43, 13–17,55. [Google Scholar]
Feng, Z.; Ma, H.; Zuo, M.J. Amplitude and frequency demodulation analysis for fault diagnosis of planet bearings. J. Sound Vib. 2016, 382, 395–412. [Google Scholar] [CrossRef]
Wang, Y.; Wang, J.; Huang, H. Fault Diagnosis of Planetary Gearboxes Based on NLSTFT Order Tracking under Variable Speed Conditions. China Mech. Eng. 2018, 29, 1688–1695. [Google Scholar]
Liu, Z.; Wu, K.; Ma, Z.; Ding, Q. Vibration Analysis of a Rotating Flywheel/Flexible Coupling System with Angular Misalignment and Rubbing Using Smoothed Pseudo Wigner–Ville Distributions. J. Vib. Eng. Technol. 2020, 8, 761–772. [Google Scholar] [CrossRef]
Zhu, W.; Feng, Z. Fault diagnosis of planetary gearbox based on improved empirical wavelet transform. Chin. J. Sci. Instrum. 2016, 37, 2193–2201. [Google Scholar]
Singh, S.; Kumar, N. Combined rotor fault diagnosis in rotating machinery using empirical mode decomposition. J. Mech. Sci. Technol. 2014, 28, 4869–4876. [Google Scholar] [CrossRef]
Shen, Z.; Kong, X.; Cheng, L.; Wang, R.; Zhu, Y. Fault Diagnosis of the Rolling Bearing by a Multi-Task Deep Learning Method Based on a Classifier Generative Adversarial Network. Sensors 2024, 24, 1290. [Google Scholar] [CrossRef]
Peng, H.; Du, J.; Gao, J.; Wang, Y.; Wang, W. Adversarial training of multi-scale channel attention network for enhanced robustness in bearing fault diagnosis. Meas. Sci. Technol. 2024, 35, 056204. [Google Scholar] [CrossRef]
Lu, D.; Qian, N.; Yang, X. Fault Diagnosis of Rolling Bearing Based on KNN-Naive Bayesian Algorithm. Comput. Meas. Control 2018, 26, 21–23. [Google Scholar]
Zhao, C.; Hengxing, H.U.; Chen, B.; Zhang, Y.; Xiao, J. Bearing fault diagnosis based on the deep learning feature extractionand WOA SVM state recognition. J. Vib. Shock 2019, 38, 31–37. [Google Scholar]
Chen, R.X.; Huang, X.; Yang, L.X.; Tang, B.P.; Zhou, J. Rolling bearing fault identification based on convolution neural network and discrete wavelet transform. Zhendong Gongcheng Xuebao/J. Vib. Eng. 2018, 31, 883–891. [Google Scholar]
Di, J.; Wang, L. Application of Improved Deep Auto-Encoder Network in Rolling Bearing Fault Diagnosis. J. Comput. Commun. 2018, 6, 41–53. [Google Scholar] [CrossRef]
Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Jiang, L.; Guo, S.; Guo, S.; Zhuang, K.; Li, Y. Bearing fault diagnosis based on variational autoencoder and non-local block wide kernel convolutional neural network. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2024, 238, 7322–7340. [Google Scholar] [CrossRef]
Zhang, J.; Sun, Y.; Guo, L.; Gao, H.; Hong, X.; Song, H. A new bearing fault diagnosis method based on modified convolutional neural networks. Chin. J. Aeronaut. 2020, 33, 439–447. [Google Scholar] [CrossRef]
Fu, C.; Lv, Q.; Lin, H.C. Development of Deep Convolutional Neural Network with Adaptive Batch Normalization Algorithm for Bearing Fault Diagnosis. Shock Vib. 2020, 2020, 8837958. [Google Scholar] [CrossRef]
Song, X.; Cong, Y.; Song, Y.; Chen, Y.; Liang, P. A bearing fault diagnosis model based on CNN with wide convolution kernels. J. Ambient. Intell. Humaniz. Comput. 2022, 13, 4041–4056. [Google Scholar] [CrossRef]
He, J.; Li, X.; Chen, Y.; Chen, D.; Guo, J.; Zhou, Y. Deep Transfer Learning Method Based on 1D-CNN for Bearing Fault Diagnosis. Shock Vib. 2021, 2021, 6687331. [Google Scholar] [CrossRef]
Dong, J.; Jiang, H.; Su, D.; Gao, Y.; Chen, T.; Sheng, K. Transfer learning rolling bearing fault diagnosis model based on deep feature decomposition and class-level alignment. Meas. Sci. Technol. 2024, 35, 046006. [Google Scholar] [CrossRef]
Li, X.; Guo, S.; Sun, D.; Cao, L.; Li, C.; Tian, S.; Liu, P.; Qi, Y. A rolling bearing fault diagnosis method based on extreme learning machine optimized by improved whale optimization algorithm. Facta Univ. Ser. Mech. Eng. 2025. [Google Scholar]
Jia, P.; Liu, H.; Wang, S.; Wang, P. Research on a Mine Gas Concentration Forecasting Model Based on a GRU Network. IEEE Access 2020, 8, 38023–38031. [Google Scholar] [CrossRef]
Wu, J.; Li, N.; Zhao, Y.; Wang, J. Usage of correlation analysis and hypothesis test in optimizing the gated recurrent unit network for wind speed forecasting. Energy 2022, 242, 122960. [Google Scholar] [CrossRef]
Wang, X.; Zheng, Y.; Zhao, Z.; Wang, J. Bearing Fault Diagnosis Based on Statistical Locally Linear Embedding. Sensors 2015, 15, 16225–16247. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Yin, X.; Li, C.; Yang, H.; Hong, L. One-Dimensional Convolutional Neural Network Based Bearing Fault Diagnosis. Open Access Libr. J. 2022, 9, 11. [Google Scholar] [CrossRef]
Cho, K.; Merrienboer, B.V.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proceedings of the SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014; pp. 103–111. [Google Scholar]
Wang, Y.; Liao, W.; Chang, Y. Gated Recurrent Unit Network-Based Short-Term Photovoltaic Forecasting. Energies 2018, 11, 2163. [Google Scholar] [CrossRef]
Jozefowicz, R.; Zaremba, W.; Sutskever, I. An Empirical Exploration of Recurrent Network Architectures. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 2342–2350. [Google Scholar]

Figure 1. Method implementation process.

Figure 2. The rolling bearing fault test-bed [29].

Figure 3. The vibration signals of all considered fault scenarios. (a) Normal; (b) inner ring fault + 0.007; (c) inner ring fault + 0.014; (d) inner ring fault + 0.021; (e) ball fault +0.007; (f) ball fault + 0.014; (g) ball fault + 0.021; (h) outer ring fault +0.007; (i) outer ring fault + 0.014; (j) outer ring fault + 0.021.

Figure 4. Unit structure of the GRU.

Figure 5. Activation function: (a) sigmoid; (b) tanh.

Figure 6. The structure of the self-attention mechanism.

Figure 7. Convergence of the Bayesian optimization process for the GRU model.

Figure 8. Influence of optimization parameters on diagnosis error. (a) The number of layers; (b) learning rate; (c) mini-BatchSize; (d) MaxEpochs.

Figure 9. The optimal network structure.

Figure 10. Training process and diagnosis results of the optimized GRU model. (a) Training curve; (b) confusion matrix on the testing set.

Figure 11. Bayesian optimization process of different diagnosis methods.

Figure 12. Training process and diagnosis results of the LSTM model. (a) Training curve; (b) confusion matrix on the testing set.

Figure 13. Training process and diagnosis results of the CNN. (a) Training curve; (b) confusion matrix on the testing set.

Figure 14. Confusion matrix of different diagnosis methods. (a) Bayesian Optimization + GRU; (b) Bayesian Optimization + LSTM; (c) Bayesian Optimization + CNN.

Figure 15. Comparison of diagnosis accuracy and inference time: (a) diagnosis accuracy; (b) inference time.

Figure 16. Location of self-attention layer.

Figure 17. Detection accuracy of the GRU at different rotation speeds: (a) 1772 motor speed; (b) 1750 motor speed; and (c) 1730 motor speed.

Table 1. The specific fault scenarios and samples.

Fault Scenarios	Fault Diameter (Inch)	Labels	Number of Samples
Normal	0	0	1200
Inner ring fault	0.007	1	1200
	0.014	2	1200
	0.021	3	1200
Ball fault	0.007	4	1200
	0.014	5	1200
	0.021	6	1200
Outer ring fault	0.007	7	1200
	0.014	8	1200
	0.021	9	1200
Total			12,000

Table 2. Diagnosis accuracy of self-attention layers at different locations.

Locations	Accuracy
L1	99.6%
L2	99.6%
L3	98.8%
L4	98.6%
L5	98.6%
L6	98.2%
L7	98.0%

Table 3. Detailed diagnosis accuracy of the GRU, LSTM, and CNN.

Algorithms	Rotation Speeds
Algorithms	1772	1750	1730
GRU	97.3%	97.9%	98.3%
LSTM	97.9%	97.7%	97.0%
CNN	90.2%	93.4%	95.6%

Table 4. Comparative computational cost of model development and inference.

Model	BO Iterations	BO Time (h)	Final Inference Time (s)
GRU	100	1.5	0.27
LSTM	100	1.6	0.29
CNN	100	1.9	0.37

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Z.; Teng, S.; Wang, S. Optimized GRU with Self-Attention for Bearing Fault Diagnosis Using Bayesian Hyperparameter Tuning. Algorithms 2025, 18, 576. https://doi.org/10.3390/a18090576

AMA Style

Liu Z, Teng S, Wang S. Optimized GRU with Self-Attention for Bearing Fault Diagnosis Using Bayesian Hyperparameter Tuning. Algorithms. 2025; 18(9):576. https://doi.org/10.3390/a18090576

Chicago/Turabian Style

Liu, Zongchao, Shuai Teng, and Shaodi Wang. 2025. "Optimized GRU with Self-Attention for Bearing Fault Diagnosis Using Bayesian Hyperparameter Tuning" Algorithms 18, no. 9: 576. https://doi.org/10.3390/a18090576

APA Style

Liu, Z., Teng, S., & Wang, S. (2025). Optimized GRU with Self-Attention for Bearing Fault Diagnosis Using Bayesian Hyperparameter Tuning. Algorithms, 18(9), 576. https://doi.org/10.3390/a18090576

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimized GRU with Self-Attention for Bearing Fault Diagnosis Using Bayesian Hyperparameter Tuning

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset and Sample Setup

2.2. Gate Recurrent Unit

2.3. Bayesian Optimization

2.4. Self-Attention Mechanism

3. Results

3.1. Bayesian Optimization and Diagnosis Results

3.2. Comparative Study

3.3. Influence of Self-Attention Layer on Diagnosis Results

3.4. Scenario Promotion

3.5. Analysis of Computational Cost

3.6. Analysis of Misclassifications

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI