1. Introduction
Fault diagnosis is a critical technology for the maintenance of industrial equipment, particularly in the context of smart manufacturing and Industry 4.0, where timely and accurate fault detection can substantially reduce downtime and maintenance costs [
1]. Traditional fault diagnosis methods typically rely on large volumes of labeled data; however, in practical applications, obtaining sufficient fault data is often challenging, especially for specific fault conditions in critical equipment. This is due to the high cost and complexity of data acquisition, leading to the problem of sample scarcity.
In industrial environments, operating conditions such as speed, load, and temperature often differ significantly between source and target domains, leading to substantial distribution shifts in vibration signals. These differences make it challenging to collect sufficient labeled data for each specific condition, exacerbating sample scarcity. Traditional fault diagnosis methods, which rely on large amounts of labeled data for each condition, are often inadequate in such scenarios. This highlights the need for robust transfer learning approaches that can leverage data from one condition to improve diagnostic performance in another, even when labeled data are limited.
Conventional machine-learning-based fault diagnosis typically involves three key stages: data collection, feature extraction, and fault classification [
2,
3]. However, the manual nature of feature extraction often leads to the inclusion of irrelevant or redundant features, which can degrade the accuracy of the classification models [
4]. Common classification algorithms include decision trees, support vector machines (SVM), and neural networks [
5]. For example, Amarnath et al. [
6] proposed a decision tree-based method for bearing fault diagnosis, while Konar and Chattopadhyay [
7] applied SVM to analyze vibration signals from asynchronous motor bearings. Similarly, Tian et al. [
8] integrated feature extraction with k-nearest neighbor (k-NN) distance analysis to achieve accurate motor bearing fault detection. Other notable approaches include the combination of empirical mode decomposition (EMD) energy entropy with neural networks for rolling bearing fault diagnosis [
9], and the integration of particle swarm optimization (PSO) with hidden Markov models (HMM) for automatic bearing fault classification [
10]. Despite their effectiveness, traditional machine learning methods often require labor-intensive manual feature selection, which becomes increasingly inefficient when dealing with large-scale data.
To reduce the cost of manual feature extraction, researchers have begun exploring deep learning methods, which can automatically learn features from data and efficiently process complex motor data for precise fault diagnosis [
4,
11]. Common deep learning models include convolutional neural networks, recurrent neural networks, restricted Boltzmann machines, autoencoders, and deep belief networks. Chen et al. [
12] investigated a comparative diagnosis approach for motor faults using convolutional neural networks (CNN) and long short-term memory (LSTM) networks, assessing their effectiveness in fault diagnosis. Shao et al. [
13] introduced a novel fault diagnosis method for electric locomotive bearings based on a convolutional deep belief network (CDBN). Chen et al. [
14] proposed a data fusion model for bearing fault diagnosis, combining a sparse autoencoder with a deep belief network (DBN). The experimental results demonstrated the model’s high accuracy and robustness. Chu et al. [
15] developed a bearing fault diagnosis method utilizing a Gaussian restricted Boltzmann machine (GRBM), which effectively captures latent information from fault signals, improving both accuracy and stability. Zhang et al. [
16] presented a recurrent Kalman variational autoencoder (RKVAE) for monitoring complex dynamic processes, enhancing fault detection effectiveness. Shao et al. [
17] introduced a novel feature learning approach based on a deep autoencoder for fault diagnosis in rotating machinery. However, the outstanding performance of deep learning typically hinges on large amounts of labeled training data, which are often scarce in real industrial environments, directly affecting the robustness and generalization ability of deep learning models.
Furthermore, deep learning models often assume that the training and testing data share the same distribution. However, in real-world applications, variations in operating conditions, noise, and fault severity frequently violate this assumption. This results in a challenge in maintaining model performance when data are scarce or conditions change [
18]. Consequently, addressing the problem of sample scarcity under variable conditions has become a crucial research focus.
To overcome this challenge, transfer learning has emerged as a promising solution. Transfer learning enables the transfer of knowledge from a source domain with abundant data to a target domain with limited data. Transfer learning offers a promising solution to these challenges by enabling knowledge transfer from a source domain with abundant data to a target domain with limited data. By aligning the feature distributions between source and target conditions, transfer learning can mitigate the impact of distribution shifts caused by cross-condition differences, thereby improving diagnostic accuracy even when labeled data are scarce. The primary strategies in transfer learning include fine-tuning, statistical methods, and adversarial approaches [
5].
Fine-tuning approaches utilize a diagnostic model from the source domain to transfer learned models or parameters to new target scenarios. In retraining the model for the target task, a relatively small learning rate is typically employed, instead of training from scratch. Chen et al. [
19] utilized open-source datasets to pre-train deep learning models, followed by fine-tuning available data collected from real industrial scenarios for the target task. Zhao et al. [
20] proposed a transfer learning framework based on a deep multi-scale convolutional neural network (CNN), optimizing the fault diagnosis model through dilated convolutions and global average pooling. The model achieved efficient transfer across different tasks via fine-tuning, enhancing its ability to detect rolling bearing faults under complex conditions. Shao et al. [
21] explored the application of non-mechanical datasets, such as ImageNet, to pre-train transfer learning models. The top layers of the pre-trained model were replaced to match the number of target labels, and the model was fine-tuned using bearing fault data for fault diagnosis. In bearing fault diagnosis, it is common for the source and target domains to have different fault labels. To address this label inconsistency, Zhiyi et al. [
22] applied a fine-tuning approach by replacing the output layer of the pre-trained model with a new output layer with the same dimensions as the target labels.
The basic idea behind statistics-based transfer learning is to learn domain-invariant representations by minimizing the distributional differences between the source and target domains. Guo et al. [
23] proposed a neural-network-based transfer learning method for fault diagnosis, demonstrating that statistics-based transfer learning methods can improve classification accuracy and reduce the training time when only a small amount of target data are available. Zhao et al. [
24] introduced a novel transfer learning method using bidirectional gated recurrent units (BiGRU) and manifold-embedded distribution, which proved effective in aligning a limited amount of labeled data. Li et al. [
25] presented a two-stage knowledge transfer scheme to address knowledge transfer between different machines. Zhou et al. [
26] suggested with the zhou2021deepduring domain adaptation that, in addition to measuring marginal distributions, conditional distributions should also be considered.
Adversarial-based methods, inspired by generative adversarial networks (GANs), aim to extract domain-invariant features by designing classifiers that learn from both source and target domain data [
27,
28]. Li et al. [
29] proposed a deep-learning-based transfer learning method, where adversarial domain training is used to transfer diagnostic knowledge from the supervised data of multiple rotating machines to the target device, thus improving fault diagnosis performance in cases of insufficient training data. Han et al. [
30] introduced a multi-domain discriminator to enhance domain-invariant feature extraction, further improving fault diagnosis performance.
While transfer learning exhibits significant potential in various scenarios, direct transfer across different operating conditions often encounters challenges, due to considerable differences in data distribution. Simply transferring a source condition model to a target condition may lead to a significant performance degradation or even negative transfer. Moreover, in cases of extreme sample scarcity, models may become prone to overfitting, hindering effective generalization. Thus, the ability to leverage rich data from source conditions for efficient transfer diagnosis under sample scarcity remains a crucial research topic in fault diagnosis.
To address these challenges, this paper proposes a novel transfer learning framework— the TTLN—which focuses on transferring fault diagnosis models between different operating conditions. This framework leverages the Transformer deep learning model, which has achieved remarkable success in natural language processing. Its core mechanism, a self-attention mechanism, effectively captures long-range dependencies between sequences, demonstrating outstanding performance in handling sequential data. The attention mechanism’s ability to focus on all input parts simultaneously is particularly advantageous for bearing fault diagnosis. Unlike CNNs, which rely on local receptive fields [
31], or RNNs, which process data sequentially, the self-attention mechanism in Transformers can directly model long-range dependencies and capture global context for vibration signals. This is crucial for identifying fault-related patterns that may span across distant time steps or manifest in different frequency ranges. The framework employs statistical strategies for model transfer, while aligning both marginal and conditional distributions, ensuring that even with scarce samples in the target condition, the fault knowledge from the source condition can be fully utilized to maintain high diagnostic accuracy. This method not only mitigates the negative impacts of data distribution differences but also alleviates the limitations posed by sample scarcity in model training. The effectiveness of the proposed TTLN model is validated through bearing fault diagnosis examples, highlighting its advantages in scenarios with limited samples.
The main contributions of this study are as follows:
1. A novel TTLN framework is proposed for efficient transfer learning in bearing fault diagnosis, specifically addressing scenarios with limited samples.
2. Experimental validation demonstrated the effectiveness of the TTLN model in multiple bearing conditions, particularly in maintaining high diagnostic precision when samples of target conditions are scarce.
3. This research provides a new perspective on the management of cross-condition fault diagnosis and sample scarcity issues, with significant theoretical and practical application value.
The remainder of this paper is organized as follows.
Section 2 introduces the design and implementation of the Transformer-based bearing fault diagnosis model.
Section 3 presents the TTLN model, detailing its architecture, transfer strategies, and implementation details.
Section 4 provides experimental results, demonstrating the efficacy of the TTLN model in bearing fault diagnosis. Finally,
Section 5 concludes the paper and discusses potential future research directions.
2. Fault Diagnosis of Motor Bearings Under Single Operating Conditions Based on Transformer Models
In motor bearing fault diagnosis, traditional methods typically rely on manually designed features and machine learning models for classification, whereas Transformer models demonstrate significant advantages in processing time series data, due to their powerful attention mechanisms [
32]. The core of the Transformer architecture lies in the self-attention mechanism [
33], which allows simultaneous focus on all parts of the input sequence, effectively capturing long-range dependencies within the signal. Unlike traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs), Transformers do not require sequential processing, thus avoiding issues such as gradient vanishing with long time steps. Furthermore, Transformers do not depend on extensive feature engineering and can directly extract useful features from raw vibration signals through model learning.
Transformers, through their self-attention mechanism, can learn patterns directly from raw sequential data, eliminating the need for manual feature extraction. This stands in contrast to traditional methods, which often involve pre-processing steps such as signal decomposition or handcrafted feature extraction (e.g., frequency components or statistical features). The self-attention mechanism allows Transformers to automatically learn hierarchical representations, making them particularly effective for complex time-series data, such as vibration signals in fault diagnosis tasks. The TTLN algorithm integrates a Transformer-based fault diagnosis model with transfer learning techniques to address the challenges of data scarcity and domain shifts. It employs domain adaptation strategies, such as marginal and conditional distribution alignment, to ensure robust performance even when target condition data are limited.
Bearing fault diagnosis is fundamentally a multiclass classification problem. For a single operating condition from a bearing fault diagnosis task, let the given input dataset be , where denotes the ith sample, and let the corresponding label space be , where represents the label associated with the ith sample.
The application of Transformer models in motor bearing fault diagnosis primarily involves the processing and feature extraction of vibration signals. A Transformer-based model for diagnosing motor bearing faults in a single operating condition is illustrated in
Figure 1. This model first segments the vibration signal into multiple time-step input sequences and then converts each time-step signal into high-dimensional vectors that the model can process through an embedding layer. In this process, positional encoding is added to the input sequences to retain temporal information.
Subsequently, the vibration signals pass through multiple layers of self-attention, where each layer focuses on different time-step features according to varying weight distributions. This enables the model to effectively capture the characteristics of vibration signals under different fault modes. The self-attention mechanism enables the model to assign attention weights to all time steps in the input signal simultaneously, allowing it to focus on the most informative parts of the signal, while filtering out irrelevant noise. This capability is particularly important for bearing fault diagnosis, where fault characteristics (e.g., inner race faults, outer race faults) may exhibit complex patterns across different time scales and frequency ranges. For instance, the vibration characteristics of inner race and outer race faults are typically distributed across different frequency ranges, which the Transformer can capture simultaneously through its multi-head attention mechanism.
After processing through the self-attention mechanism, the signals are forwarded to a feedforward network (FFN), also known as a multilayer perceptron (MLP). The MLP consists of two linear layers and a GeLU activation function; the first layer expands the input dimensions, while the second layer reduces them back to the original dimensions. The primary role of the MLP is to further combine and refine the features extracted from the self-attention mechanism through nonlinear transformations, enhancing the model’s classification capability.
To accommodate the specific scenarios of motor bearing fault diagnosis, the model’s loss function employs cross-entropy loss combined with L2 regularization to prevent overfitting. Additionally, adjustments to the learning rate and batch size are made to optimize the model’s training process, ensuring that it can effectively learn vibration signal features under different fault modes.
In a single operating condition, the Transformer model achieves satisfactory fault diagnosis results. In this controlled environment, the patterns of vibration signals are relatively stable, making fault-type characteristics more apparent. The Transformer quickly and accurately captures key information in fault signals through its self-attention mechanism, thereby reducing the rate of misclassification.
However, in practical applications, variations in operating conditions (such as load, speed, and temperature) may lead to changes in the vibration signal patterns, which are not accounted for in a single operating condition model. Consequently, while the Transformer performs excellently under controlled conditions, its applicability remains limited. To address this issue, subsequent chapters will explore a transfer learning-based approach for multi-condition fault diagnosis, to enhance the model’s generalization capability and adapt to more complex industrial scenarios.
3. Fault Diagnosis of Motor Bearings Under Multiple Operating Conditions Based on Transfer Learning
Fault diagnosis of motor bearings under multiple operating conditions presents greater challenges, due to the significant variations in operating conditions, such as changes in speed, load, and temperature, in practical industrial applications. These factors cause alterations in the distribution and pattern of vibration signals across different operating conditions, making it difficult for models trained under a single condition to generalize effectively. Traditional machine learning and deep learning methods typically require large amounts of data across multiple conditions for training, but in many real-world scenarios, data collection is costly, and data from different conditions are often scarce.
To address this issue, this paper proposes a transfer-learning-based multi-condition fault diagnosis method, utilizing a Transformer model pre-trained under a single operating condition. The TTLN model is designed to adapt to new operating conditions.
In the context of transfer learning, given a feature space X and marginal distribution , a domain can be expressed as . Correspondingly, a task consists of two components: a label space Y and a conditional probability distribution . The labeled source domain and the unlabeled target domain datasets are denoted as and , respectively. Additionally, the output labels for the source domain task and the target domain task must be consistent to ensure that domain knowledge can be transferred across samples between domains.
The core of transfer learning lies in transferring knowledge learned from the source condition to the target condition [
18], improving the model’s generalization capability across different conditions. In this study, domain adaptation techniques and fine-tuning strategies were employed to ensure the high accuracy and robustness of the model under multiple operating conditions.
The key to statistics-based transfer strategies is domain adaptation (DA), which emphasizes reducing the gap between the target and source domains, while learning domain-invariant features to achieve distribution alignment and knowledge transfer between the two domains [
34]. The domain adaptation process is implemented through the design of the transfer model’s loss function, aiming to reduce the distribution discrepancies between source and target domain data, including both marginal and conditional distribution discrepancies [
35], thereby achieving domain adaptation. The goal of domain adaptation is to minimize the differences between the marginal and conditional distributions, while learning feature transformations [
36].
In this study, we employ marginal and conditional distribution alignment strategies, specifically using maximum mean discrepancy (MMD) and correlation alignment (CORAL), to bridge the distribution gap between the source and target domains in the motor bearing fault diagnosis task. MMD and CORAL are relatively simple to implement and computationally efficient, making them ideal for real-world industrial applications where fast computation and ease of implementation are critical. While adversarial methods (e.g., GAN-based approaches) and manifold alignment techniques have been successful in domain adaptation tasks, they often require intricate training procedures, are computationally expensive, and may be more difficult to stabilize. For example, adversarial methods involve a generator and discriminator, which can be challenging to optimize, particularly when working with small datasets. Similarly, manifold alignment approaches can be highly sensitive to the choice of alignment functions and may not always scale well to high-dimensional data. In contrast, MMD and CORAL provide a more straightforward and computationally feasible alternative for aligning distributions, particularly in fault diagnosis scenarios, where the focus is on robustness and ease of application. While adversarial methods could potentially improve performance, they would also require careful tuning and may not provide significant benefits in the context of our fault diagnosis task, where the alignment of marginal and conditional distributions is sufficient.
(1) Marginal Distribution Alignment (MDA): In transfer learning, the emphasis is on the proximity of features between the source and target domain data. The aim is to adjust the marginal distributions in the feature space to make the marginal distributions of the source and target domains as similar as possible, thereby improving model performance in the target domain. In this section, MMD and CORAL are combined to create a new marginal distribution discrepancy (MDD) metric, as shown in Equation (
1).
where
is the feature space of the source domain, and
is the feature space of the target domain.
MMD is the most commonly used distribution distance metric in transfer learning tasks for measuring marginal distribution alignment, defined as shown in Equation (
2).
where
and
denote the batch sizes of the source and target domain samples, respectively;
represents the reproducing kernel Hilbert space (RKHS); and
denotes the mapping in the Hilbert space, typically determined by a kernel function.
CORAL measures the covariance alignment by calculating the Frobenius norm difference between the covariance matrices of the source and target distributions, as shown in Equation (
3).
(2) Conditional Distribution Alignment (CDA): This addresses class-level discrepancies between the source and target domains, aiming to resolve the mismatch in data distribution between the two domains, as shown in Equation (
4).
where
and
are feature vectors in the source and target domain feature spaces belonging to class
c.
and
represent the MMD and CORAL values for class
c, and
C is the total number of classes. CDA aligns the conditional distributions between the source and target domains within the same class during the training process, reducing domain discrepancies and enhancing the model’s generalization ability in the target domain.
(3) Integrated Loss Function: The optimization objective in transfer-learning-based multi-condition fault diagnosis models is the integrated loss function of the transfer learning model network, which consists of several components: marginal distribution discrepancy, conditional distribution discrepancy, classification loss for the source domain, and classification loss for the target domain [
37]. The integrated loss function is a linear combination of these components, where the marginal distribution discrepancy and source domain classification loss are typically assigned higher weights, while the conditional distribution discrepancy and target domain classification loss are given lower weights, to better regulate the model’s performance and generalization, as shown in Equation (
5).
The integrated loss function is designed to balance multiple objectives, ensuring effective domain adaptation and robust fault diagnosis. Since the source domain contains a complete dataset and label space, it serves as the foundation for feature transfer, and thus the source domain classification loss () is assigned a higher weight () to prioritize learning robust features from the source domain. MDA, computed using MMD and CORAL, is the core of domain adaptation. The term minimizes the discrepancy between the marginal distributions of the source and target domains, promoting feature proximity and enabling effective knowledge transfer. CDD, represented by , ensures class-level discrimination by aligning the conditional distributions of the source and target domains within each class. This prevents misclassification due to overly small class distances and enhances the model’s ability to generalize across domains. In the target domain, pseudo-labeling techniques are employed to compute the classification loss (). However, due to potential errors in pseudo-labels compared to actual labels, the weight of this term () is reduced to enhance the model’s robustness. By carefully balancing these components, the integrated loss function ensures that the model prioritizes domain adaptation, while maintaining high classification accuracy, even in scenarios with limited target domain data.
The complete TTLN model is shown in
Figure 2. The transformer is used as the feature extractor in the TTLN model. After pre-training the model with source domain data, the TTLN model is fine-tuned with a small amount of data from the target condition. During fine-tuning, only a portion of the Transformer model’s parameters (the self-attention and MLP layers) are updated, while other layers remain unchanged. This approach reduces dependency on the target condition data and improves the model’s generalization capability, enabling it to perform well under new conditions. To prevent overfitting, appropriate regularization methods—L2 regularization and dropout—are applied during fine-tuning.
To optimize the performance of the TTLN model, we performed a grid search of key hyperparameters, including the number of epochs, learning rate, and batch size. The search space for each hyperparameter was determined based on preliminary experiments and prior knowledge. The optimal hyperparameters were selected based on validation accuracy and training stability.
Regarding the discussion of negative transfer, the TTLN indirectly reduces the risk of negative transfer through the following mechanisms. By simultaneously aligning both marginal and conditional distributions using MMD and CORAL, the TTLN ensures that the feature distributions between the source and target domains are well-matched, reducing the risk of a misalignment that could lead to negative transfer. Additionally, the TTLN dynamically adjusts the weights of marginal and conditional distribution alignment losses during training, prioritizing the alignment of more critical features and reducing the impact of irrelevant or misleading features. Furthermore, the Transformer-based feature extractor in the TTLN is inherently robust to distribution shifts due to its self-attention mechanism, which focuses on the most informative parts of the signal, while filtering out noise and irrelevant variations.
To illustrate the training and optimization workflow, Algorithm 1 presents the pseudocode for the TTLN model. This algorithm outlines the key steps, including model initialization; transfer learning strategies; and the iterative process of distribution alignment, loss computation, and parameter updates. The detailed procedure is as follows:
Algorithm 1 TTLN Model for Fault Diagnosis Transfer Learning |
- 1:
Input: - 2:
: Source domain dataset (features and labels) - 3:
: Target domain dataset (features only) - 4:
N: Number of training epochs - 5:
: Learning rate - 6:
: Regularization parameter - 7:
Output: - 8:
Trained model for fault diagnosis on the target domain - 9:
- 10:
Initialize source domain model (pre-trained) - 11:
Initialize target domain model with random weights - 12:
for each epoch = 1 to N do - 13:
for each mini-batch in do - 14:
Forward pass: - 15:
Compute loss - 16:
Backward pass: Update using and - 17:
end for - 18:
for each mini-batch in do - 19:
Forward pass: - 20:
Apply statistical transfer strategy: Align and distributions - 21:
Compute transfer loss using cross-domain alignment - 22:
Update with and - 23:
end for - 24:
end for - 25:
Return: Trained model for target domain fault diagnosis
|
The pseudocode demonstrates the TTLN model’s iterative training process, which is divided into two distinct phases: source domain pre-training, and target domain fine-tuning. During each epoch, both the source and target domain data are processed. In the first phase, supervised learning is performed on the source domain to learn general fault diagnosis features. In the second phase, the pre-trained model is fine-tuned on the target domain data, with only a portion of the model’s parameters (e.g., self-attention and MLP layers) updated to adapt to the target condition. During fine-tuning, a statistical transfer strategy is applied to align the source and target domain distributions. This strategy involves aligning the marginal and conditional distributions between Ds and Dt using a combination of MMD and CORAL. MMD minimizes the distance between the overall feature distributions of Ds and Dt, while CORAL aligns the second-order statistics (covariance) of Ds and Dt within each class. These techniques are integrated into the loss function to ensure robust domain adaptation. By aligning the marginal and conditional distributions using MMD and CORAL, the model reduces the impact of domain shifts and improves the diagnostic accuracy in cross-condition scenarios, even with limited labeled data.
4. Experimental Study
This section validated the proposed model using the CWRU bearing dataset [
38] and the PU bearing dataset [
39] under various operating conditions, demonstrating the effectiveness of the proposed method.
4.1. Motor Bearing Fault Diagnosis Under Single Operating Condition
Under a single operating condition, the Transformer-based fault diagnosis model was trained on two bearing datasets, with specific details as follows:
(1) CWRU Dataset: The CWRU bearing dataset is a standard dataset widely used in the field of fault diagnosis, provided by the Department of Electrical Engineering and Computer Science at Case Western Reserve University. It was specifically designed for motor bearing fault research and has been extensively utilized to evaluate the performance of machine learning and deep learning models in fault diagnosis. The dataset primarily consists of vibration signals from a set of motor bearings, recorded using accelerometers under various operating conditions. During the experiments, the motor operated at different speeds and under different load conditions, generating vibration signals for both healthy and faulty bearings. The fault signals included nine types of data corresponding to different locations and diameters of faults. The tests simulated four operating conditions generated by running at four different loads: 0 hp, 1 hp, 2 hp, and 3 hp, with the accelerometer sampling frequency set at 12,000 Hz.
(2) PU Dataset: The PU dataset, provided by Christian Lessmeier and others, is aimed at data-driven bearing fault diagnosis. It includes artificially induced fault bearings, real fault bearings caused by accelerated life testing, and healthy bearings, all of which are of the 6203 deep groove ball bearing type. Data collection synchronizes high-frequency signals at 64 kHz, capturing motor current and vibration signals under four different speeds and loads, as shown in
Table 1. The dataset contains data for 26 different bearing damage states and 6 healthy states, with high-frequency synchronous collection of motor current and vibration signals across the specified conditions.
For the CWRU dataset, each category consists of 1000 samples, totaling 10 categories and 10,000 samples overall. The original data were segmented using a sliding window technique to augment the fault samples, with overlapping portions between samples; each sample contained 1024 data points. The dataset was split into training, validation, and testing sets in a ratio of 7:2:1, and the model was trained, validated, and tested under the aforementioned four operating conditions.
For the PU dataset, each category has 5000 samples, resulting in 4 categories and a total of 20,000 samples. Unlike the CWRU dataset, the PU dataset has sufficient data length, so the sliding window technique was not used, and there was no overlap between samples; each sample also contained 1024 data points. The dataset was divided into training, validation, and testing sets in a ratio of 7:2:1, and the model was trained, validated, and tested across the four different operating conditions.
Figure 3 illustrates the model training process using the 0 hp condition data from the CWRU dataset. It can be seen that, after the 50th epoch, the loss rapidly decreased to 0, with the accuracy approaching 100.
Common models in the fault diagnosis field, such as CNN, LSTM, and SVM, were selected as comparison models, and experiments were conducted under various conditions for both the CWRU and PU datasets.
For each dataset, the main model and three comparison models were repeated ten times, with the performance of each model summarized in
Table 2. In the four operating conditions, the Transformer-based fault diagnosis model achieved average accuracies of approximately 99.58%, 99.55%, 99.99%, and 99.99% on the CWRU test set, and 99.77%, 98.01%, 99.93%, and 99.79% on the PU test set. These results exceeded those of the CNN, LSTM, and SVM comparison models, with the accuracy deviations generally smaller than those of the other models, indicating that the Transformer-based single-condition fault diagnosis model was highly accurate and stable, approaching 100%. Notably, the CNN model displayed relatively higher accuracy and stability, leading to its selection as the comparison model for subsequent experiments.
4.2. Motor Bearing Fault Diagnosis Under Multiple Operating Conditions
In the context of multiple operating conditions, the TTLN model was utilized to transfer the data from Condition 1 for each dataset, validating the model’s capability to migrate from a single condition to other conditions and assessing its generalization ability under varying operating scenarios. Initially, data from Condition 1 were set as the source domain dataset and migrated to the target domain datasets of Conditions 2, 3, and 4.
To ensure the optimal performance of the TTLN model, we conducted a systematic hyperparameter search using a grid search approach, focusing on key parameters such as the number of epochs, learning rate, and batch size. After evaluating various combinations, we selected the optimal settings for the TTLN model: 300 epochs, a learning rate of 0.0005, and a batch size of 128. These settings were chosen based on validation performance, balancing training stability and convergence speed. For instance, a learning rate of 0.0005 was found to prevent overshooting, while ensuring efficient convergence, and a batch size of 128 provided a good balance between computational efficiency and gradient estimation accuracy.
We evaluated the impact of key hyperparameters on the performance of the TTLN model. For instance, increasing the number of epochs beyond 300 did not significantly improve the accuracy but increased the training time, while reducing the number of epochs below 300 led to underfitting. Similarly, a learning rate higher than 0.0005 caused training instability, while a lower learning rate slowed down the convergence. These findings guided our selection of optimal hyperparameters.
The TTLN model was employed to conduct migration training from Condition 1 to Condition 2, tracking the changes in the various losses (overall loss, domain adaptation loss, category alignment loss, source domain loss, and target domain loss) and classification accuracy throughout the training process, as illustrated in
Figure 4.
From
Figure 4, it is evident that, after a series of iterations, the losses of the TTLN model exhibited a convergence trend, while the testing accuracy for the target domain data consistently improved. After approximately 300 iterations, the testing accuracy stabilized around 99.4%.
To further analyze the model’s performance, we provide confusion matrices for the CWRU datasets (see
Figure 5). The confusion matrices reveal that the model achieved high accuracy for most fault types, with only minor misclassifications observed in certain classes, such as ball faults. These misclassifications may be attributed to the similarity in vibration patterns between certain fault types under specific operating conditions.
To evaluate the impact of the weighting factors on model performance, we conducted experiments with different weight configurations. The results, as shown in
Table 3, indicated that setting
higher than
led to a better generalization, as the model prioritized learning from the source domain, while reducing the influence of potentially noisy pseudo-labels in the target domain. Additionally, a balanced setting of
and
ensured effective domain adaptation. Specifically,
(weight for marginal distribution alignment) played a central role in domain adaptation, as it minimized the discrepancy between the source and target domains using MMD and CORAL, promoting feature proximity and enabling effective knowledge transfer. On the other hand,
(weight for conditional distribution alignment) served as an auxiliary term, ensuring class-level discrimination and preventing misclassification due to overly small class distances. This balanced approach allowed the model to achieve robust domain adaptation, while maintaining high classification accuracy.
To simulate a sample-scarce environment, the sample size for the target condition was reduced, and the designed TTLN model was utilized for transfer learning from Condition 1 to the other target conditions. Additionally, the direct training results of the Transformer model on the corresponding target conditions, as well as the results of a Transformer model pre-trained on the source domain data and fine-tuned on the target condition data, were compared with the transfer results of the TTLN model, to evaluate the effectiveness of the transfer learning strategy in data-scarce scenarios.
Figure 6 illustrates the change in accuracy of the different models during the sample size reduction process. The Transformer model, without transfer learning, maintained an accuracy above 90% when the dataset sample size exceeded 1000, but this was still lower than that of the TTLN model. When the sample size fell below 1000, the accuracy of the Transformer model rapidly dropped to below 50%. Similarly, the fine-tuned Transformer model (pre-trained on the source domain and fine-tuned on the target domain) showed an improved performance compared to the non-transfer learning Transformer model, achieving an accuracy of around 95% when the sample size exceeded 1000. However, its performance also degraded significantly when the sample size was reduced below 1000, dropping to around 65% accuracy. In contrast, the TTLN model demonstrated remarkable resilience, with only a marginal decline in performance, consistently achieving an accuracy above 95%, even with substantially reduced sample sizes.
This clearly demonstrates that the TTLN model showed strong adaptability in data-scarce situations, achieving a significantly higher accuracy than the directly trained Transformer model and the fine-tuned Transformer model. The Transformer model struggled to adequately learn the fault patterns in the target conditions due to an insufficient sample size, resulting in poor generalization ability. Therefore, the TTLN model effectively identified faults in the target conditions, while minimizing the reliance on large amounts of new condition data.
The training time for the TTLN model varied depending on the dataset and hardware configuration. For the CWRU dataset with 7000 samples, the model typically took approximately 15 min to complete 300 epochs using a single NVIDIA RTX 3090 GPU. To handle large-scale training data, we employed mini-batch gradient descent with a batch size of 64, which balanced computational efficiency and gradient estimation accuracy. Additionally, data augmentation techniques, such as sliding window segmentation, were used to increase the diversity of the training data, without significantly increasing the computational overhead [
40].
5. Conclusions
In this study, we propose a transformer transfer learning network (TTLN) for motor bearing fault diagnosis, addressing key challenges such as limited labeled data and varying operating conditions. The main contributions of this work are as follows:
1. Methodological Contribution: We introduced the TTLN, a model that combines domain adaptation and transfer learning to enhance diagnostic accuracy across diverse operational conditions. By aligning both marginal and conditional distributions, the TTLN effectively adapts to both source and target domains.
2. Experimental Validation: Comprehensive experiments on the CWRU and PU datasets demonstrated the effectiveness of the TTLN, achieving an average accuracy of 99.58% and 99.77%, respectively. The TTLN outperformed baseline models, such as CNN and LSTM, particularly in scenarios with limited training data, maintaining over 95% accuracy when the sample sizes fell below 1000, while the baseline models exhibited significant performance degradation.
3. Practical Implications: The TTLN’s robustness in data-scarce environments and across multiple operating conditions underscores its potential for real-world applications, particularly in fault detection systems where obtaining labeled data are costly or impractical. This approach offers a reliable solution for enhancing the performance of diagnostic systems in dynamic and resource-constrained industrial settings.
While this study demonstrates the effectiveness of the TTLN in handling varying operating conditions and limited data, it is important to acknowledge that real-world vibration signals are often contaminated by noise [
41,
42], which can impact model performance. Noise in raw sensor data can lead to misclassifications, particularly in challenging conditions with scarce labeled samples. Future work will explore strategies to improve the TTLN’s robustness to noise, such as incorporating noise-resilient preprocessing techniques or noise-robust training approaches to further enhance the diagnostic accuracy in industrial applications.
Additionally, it is worth noting that while load variations are an important factor in industrial applications, they were not the primary focus of this study. Instead, our research addressed the broader challenge of cross-condition fault diagnosis, where operating conditions (including load, speed, and temperature) may vary significantly. The proposed TTLN model demonstrated robustness to these variations through domain adaptation strategies, achieving high diagnostic accuracy across different conditions. While this study focused on cross-condition fault diagnosis, the impact of specific factors such as load variations warrants further investigation in future work.
To further improve the computational efficiency, compressed sensing (CS) techniques, such as adaptive step size forward-backward pursuit (ASFBP), could be integrated into our framework. These techniques reduce the dimensionality of the input data while preserving critical fault-related features, thereby reducing training time and memory requirements. For example, ASFBP has been successfully applied in the acoustic emission (AE)-based health state assessment of high-speed train bearings, demonstrating its effectiveness in handling large-scale datasets [
43,
44,
45,
46]. Future work could explore the integration of ASFBP with the TTLN model to enhance its scalability and efficiency, particularly in scenarios with large-scale data and limited computational resources.
While the CWRU and PU datasets provided a solid foundation for evaluating the TTLN model, they may not fully capture the complexities and noise present in real industrial environments. In future work, we plan to test the TTLN model on more realistic datasets that incorporate noise, varying operational conditions, and real-life fault patterns, further validating its practical applicability in industrial settings.
While the proposed TTLN demonstrated superior performance in motor-bearing fault diagnosis, its deployment in real-world industrial environments must consider several practical constraints, such as memory usage, inference speed, and the potential need for hardware acceleration. Transformers typically demand more memory than traditional models like CNNs or RNNs, due to their self-attention mechanism, which computes pairwise interactions between all time steps. While this can be a limitation in resource-constrained environments, the use of modern hardware (e.g., GPUs) can mitigate this issue. Additionally, although Transformers may have a higher computational complexity, their parallelizable nature allows for faster inference on GPUs or TPUs, making them suitable for real-time fault diagnosis in industrial settings. However, deploying Transformers may require hardware acceleration, which is increasingly feasible given the growing availability of affordable GPU-based solutions.
In conclusion, this work advances both the theory of transfer learning and domain adaptation, while providing a practical framework for predictive maintenance in industrial systems. The exploration of noise resilience in future work will further strengthen the model’s applicability in real-world environments.