1. Introduction
Gear is the core unit of power transmission in rotating machinery, which is widely used in the aerospace, rail transit, wind energy, and ship industries, as well as in other fields. Its reliability directly affects the performance and life of equipment [
1,
2]. Affected by complex working conditions and load impact, gears are prone to various faults, such as cracks, pitting, wear, spalling, and broken teeth, resulting in mechanical structure failure, equipment shutdown, casualties, and other consequences [
3,
4,
5]. Therefore, research on gear fault diagnosis (GFD) technology in complex environments is of great significance for equipment condition monitoring, maintenance reduction, and reliability improvement [
6,
7].
Over the past few decades, as a widespread application method for gear fault detection, fault diagnosis based on vibrations has become increasingly favored by scholars, and some feasible GFD methods have been put forward. Cheng et al. [
8] built a Ramanujan Fourier decomposition method, and it had excellent anti-noise properties and precise feature extraction capability. Zheng [
9] put forward a novel empirical reconstruction Gaussian decomposition method, which was used for GFD. Based upon the comparison of fault feature detections models, namely variational mode decomposition (VMD) and empirical mode decomposition (EMD), Zhang [
10] presented a novel method of bearing fault diagnosis based on VMD. However, these methods often need significant human intervention and experienced experts, and are not suitable for analyzing and processing large datasets. With the increase in the complexity of the vibration signal, the efficient feature extraction of the gear fault characteristics is key to understanding gear damage status, which led to machine learning becoming the primary mainstream method used for gear fault diagnosis. To extract gear fault features under ultra-low compression conditions by multi-channel compressed signals, Liu [
11] proposed a new weighted and distributed empirical compressed sensing (CS) method. Integrating VMD and particle swarm optimization (PSO), Liu [
12] presented a bearing fault diagnosis approach to overcome the problem of difficult fault feature extraction under strong noise conditions.
With the rapid development of modern industrialization and artificial intelligence, deep neural networks are being applied in the field of intelligent fault diagnosis [
13]. To better extract gear fault features, Refs. [
14,
15] encoded raw 1D signals into 2D images and introduced a corresponding machinery fault diagnosis and classification approach by inputting a time series of vibration signals. Xing [
16] proposed a MLDGCNN network model in order to obtain cutting top line information. To accurately extract fault features with insufficient samples, a GFD method combining a Gramian angular summation field (GASF) and the multi-scale channel attention mechanism DenseNet was proposed by Shi [
17]. Combining a recurrence plot and a convolutional neural network (CNN), Wang [
18] and Liu [
19] presented an intelligent fault diagnosis and classification scheme for planetary gears and bearings, respectively. To achieve an end-to-end diagnostic mechanism, Chen [
20] presented a rotating machinery fault diagnosis method involving novel continuous wavelet transform. Combining deep-transfer learning and CNNs, Liu and his team [
21,
22] developed different fault diagnosis methods for different types of rotating parts (such as bearings, gears, and blades). To enhance the adaptability and precision of GFD, Lin [
23] put forward an intelligent gear diagnosis approach based upon a CS-improved VMD and a probabilistic neural network. Chen [
24] proposed a bearing fault diagnosis approach on the basis of the improved empirical wavelet transform and compressive sensing joint denoising and lead convolution neural network. To upgrade the precision of the fault diagnosis approach with noise for rotating machinery, Weng [
25] put forward a multi-scale kernel-based network that considered the attention mechanism, which additionally improved the precision and efficiency. For bearings, Jin [
26] presented a novel anti-noise multi-scale CNN to enable coupling fault diagnosis at different levels of noise. Zhao [
27] proposed a new multi-scale inverted residual CNN approach for variable load bearing fault diagnosis. In Ref. [
28], a bearing fault diagnosis approach based on the multi-scale feature fusion of a parallel CNN was described. To break through the limitations of small samples and strong noise interference, Wang [
29] designed a lightweight multi-scale CNN with a fault diagnosis approach. A novel multiscale wavelet prototypical network was developed by Yue [
30], which enabled cross-component fault diagnosis. Wang [
31] proposed a multi-layer fusion CNN and a multi-layer fusion module–relational knowledge distillation module with an attention mechanism to improve robustness in different noisy environments. To extract fault characteristics effectively from the vibration data, Zhang [
32] and Zhong [
33] presented different mechanical fault diagnosis methods, which were able to achieve synergistic effects between networks to improve deep learning capability. To overcome the limitation of complex background noise and the number of fault samples, Refs. [
34,
35,
36] each put forward different fault diagnosis methods with an improved attention mechanism, all of which had excellent robustness and generalization properties when compared to other GFD approaches in noise conditions.
Through research and analysis, it was found that existing gear fault diagnosis methods have poor anti-noise performance and insufficient generalization ability. In order to meet the above challenges, a novel compressive sensing lightweight attention multi-scale residual network method is proposed to improve the adaptability and accuracy of GFD. Firstly, a MSFE module with a strong anti-noise ability was created, which can capture image features at different scales. Then, the lightweight attention (LATT) module was designed, which can allocate weights adaptively and reduce calculation parameters. Finally, the constructed IDRA module can effectively extract global and local fault information.
The contributions and highlights of this paper are as follows:
- (1)
The CS method was used to reconstruct vibration data and transform them into images through GADF so as to extract more comprehensive feature information.
- (2)
A MSFE module with strong anti-noise ability was constructed, image features of different scales were captured by parallel convolution layers with different convolution kernel sizes, and multi-scale features were extracted by feature fusion.
- (3)
An IDRA module was created. The IDRA module was embedded with the designed lightweight attention module, which ensures the full extraction of fault features and improves calculation efficiency.
- (4)
The effectiveness of the proposed method was verified by NEU dataset and SEU dataset, and the advantages of CS-LAMRNet and the effectiveness of the proposed modules were verified by noise contrast experiments and ablation experiments.
The layout of this research is as follows:
Section 2 introduces the related theories of CS, GADF, and DSC. In
Section 3, the architecture of CS-LAMRNet model is outlined and the overall structure of the network is discussed in detail.
Section 4 demonstrates the verified effectiveness and superiority of the presented CS-LAMRNet compared with other fault diagnosis methods through the NEU dataset and the SEU dataset. Finally, the main conclusions of this research are summarized in
Section 5.
3. CS-LAMRNet
Since gears mostly operate in a hostile environment, collected gear vibration data are easily polluted by loud and non-stationary noise, which leads to the collected data being very obviously nonlinear or even to the weak fault characteristics of the gear being overlooked. Thus, extracting effective features from gear data with noise is still a challenge in intelligent gear fault diagnosis. To overcome the above shortcomings, a novel compressive sensing lightweight attention multi-scale residual network method for gear fault diagnosis is proposed, which is divided into three parts: a multi-scale feature extraction (MSFE) module, a lightweight attention module, and an improved depth residual attention (IDRA) module.
3.1. Multi-Scale Feature Extraction Module
The limitations of relying on a single small-scale convolution kernel for global information extraction and the inability of large convolution kernels to precisely capture local features can result in critical information loss and an elevated risk of overfitting, thereby constraining a model’s capacity for feature representation. Moreover, the presence of noise and fault-related features distributed across various frequency bands in vibration data imparts a multi-scale characteristic to collected gear fault datasets. To effectively address this multi-scale nature, this section introduces a multi-scale feature extraction (MSFE) module predicated on multi-scale feature learning. This module leverages diverse convolution kernel sizes to extract features at multiple scales and levels, with the aim of improving the comprehensiveness and precision of feature extraction.
Combined with depth-separable convolution, a high-efficiency multi-scale feature extraction architecture is shown in
Figure 3. The multi-scale feature extraction module has three parallel branches, and the sizes of convolution kernel on each branch are 3 × 3, 5 × 5, and 7 × 7, respectively. In addition, the filling parameters of each convolution kernel are set to 1, 2, and 3, which can ensure that the different scale features introduced into the fusion layer display the same dimensions. Finally, a batch normalization (BN) layer is added at the end of each branch to reduce the risk of over-fitting. The main function of the BN layer is to normalize each batch input data point, which makes network training more stable and faster, as shown below:
where
I represents the input; DwConv2d
i (
i = 1, 2, 3) is the
ith depth separable convolution layer; B is the BN layer; and
Si (i = 1, 2, 3) denotes the output of the
ith branch. Finally, the information extracted from the three channels is fused to obtain comprehensive features, which are input into the neural network:
where
F represents the output of MSFE and dim = 1 represents the connections in the channel dimension.
3.2. Lightweight Attention Module
To overcome the noise pollution in the collected gear fault dataset and enhance the noise resistance capacity of the model, a lightweight attention module was designed which can adaptively assign weights, ensure that the model focuses on important features related to gear failure, reduce the impact of redundant information, extract useful fault features, and effectively improve the robustness of GFD. During the calculation of LATT, for a given input
, the queries, keys, and values are
,
, and
, respectively. Specifically,
n is the patch quantity;
d represents the dimensions of the input tensor; and
dq,
dk, and
dv denote the characteristic dimensions of the query, key, and value vectors. Then, in order to effectively pool attention, a faster, lighter scaled dot-product attention is adopted, which is expressed as follows:
To reduce the computational costs brought by pooling attention and decrease the spatial size of
K and
V, two 3 × 3 depthwise separable convolution layers were included. The framework of the proposed LATT module is shown in
Figure 4.
and
. In other words, the relative position deviation B is added to each self-attention module, which generates a lightweight multi-head attention mechanism, and the formula is as follows:
The n × dh for each head can be obtained by utilizing the “heads” of the LATT module. Subsequently, the output sequence of each head is stacked and forms a synthetic sequence sized n × d.
3.3. Improved Depth Residual Attention Module
In order to effectively improve gear fault feature extraction and accurately capture local and global feature information, an improved depth residual attention (IDRA) module was developed and is outlined in this section, consisting of an improved residual module and a LATT module. The IDRA module utilizes a depthwise separable convolution layer considering a convolution kernel size of 5 × 5 to improve the ability and efficiency of gear feature extraction, and a reverse bottleneck design in the Transformer was utilized to move the deep convolution layer up one layer in order to accommodate a larger convolution kernel. In addition, the GELU was selected as the activation function, which combines the advantages of ReLU and Dropout, and its core features are nonlinearity and smoothness. And it has a certain regularization effect, which helps to avoid the problem of gradient disappearance and explosion. The approximate calculation formula of GELU is expressed as follows:
Figure 5 shows the framework of the IDRA module. To dynamically adjust the strength of the residual connection, a layer scale was introduced after the improved depth residual attention module. It can be observed that the input gear features enter the lightweight attention module for further feature extraction after the residual block treatment.
3.4. Architecture of CS-LAMRNet
In this section, the architecture of proposed compressive sensing lightweight attention multi-scale residual network is displayed in
Figure 6. It can be seen intuitively that the input of CS-LAMRNet is a two-dimensional image of the 1D vibration signal converted by GADF. Firstly, the feature information is fully extracted by MSFE branches with three different convolution kernels and BN layers, and then the extracted features are fused as the input of the backbone network. The backbone network of the model is represented by the stack of IDRA modules in the following order: 1, 2, 2, 1. After repeated tests, the dim of each group was set to 96, 192, 384, and 768 to further process the feature information. A downsampling module was placed between each stack block, which consists of BN layer with step size of 2 and convolution kernel of 2 × 2, so as to reduce the space size of feature images and thus reduce the complexity of the model. Finally, the learned features were input into the GAP and the average values of each feature image were calculated, which significantly reduced the number of parameters in the full connection layer, and the classification of the fault types was completed by SoftMax. The specific parameters of each network layer of CS-LAMRNet are shown in
Table 1.
3.5. Flow Chart of the Proposed Fault Diagnosis Method
Figure 7 presents a novel flowchart of gear fault diagnosis method for complex operating conditions based on the CS-LAMRNet proposed in this section, and four steps are contained in this framework.
Step 1: Data acquisition. The vibration data of gears under various working conditions are acquired by an acceleration sensor mounted on the gearbox and the data collection system.
Step 2: Signal processing. The vibration signal after noise reduction is converted into 2D image by Gramian angular difference fields; then, it is divided into two proportional parts, namely a training set and a test set.
Step 3: Model training. The training set is input into CS-LAMRNet for iterative training. The loss function is optimized by the Adam algorithm, and the Softmax classifier is used to classify.
Step 4: Fault diagnosis. To achieve fault identification, the test set is input into the proposed CS-LAMRNet, and the accuracy of the entire test set and single fault identification are the outputs.
4. Experimental Results and Discussion
In this section, the effectiveness and generalization ability of the CS-LAMRNet method are validated using experimental data from the Northeastern University gearbox dataset (NEU dataset) and Southeastern University gearbox dataset (SEU dataset). In addition, to verify the advantages of the present model, comparative tests against several popular models are carried out. Finally, an ablation experiment is performed to evaluate the effectiveness of the various components of the present model. The batch size, learning efficiency, and training epoch of the model are set to 16, 0.0005, and 100, respectively.
4.1. Dataset Description
- (1)
NEU dataset
To obtain gear status and fault data, a CL-100 gear wear testbed was set up, as shown in
Figure 8. The gear experiment table consists of a gearbox, a DH5922D dynamic signal test system, a loading device, an acceleration sensor, etc. The gear vibration data sampling frequency was 20 kHz, the sampling time was 20 s, and the gear speed was set to 1450 r/min. The main parameters of the gears are listed in
Table 2. The experimental platform has five states, namely normal gear (NO), tooth break (MT), tooth pitting (CT), tooth crack (RF), and tooth wear (SF), which are displayed in
Figure 9.
In order to achieve a good balance between computational efficiency and information retention and to ensure that the generated GADF images can provide sufficient information, for each type of gear fault data, a sliding window of 1000 data points was used for extraction. GADF was utilized to convert the data of each fault type into 1000 images, with the image size set to 224 × 224, thereby providing sufficient spatial resolution to capture the fault information. A total of 5000 images were accumulated for the five types of faults.
Figure 10a shows the GADF images of the NEU datasets. Then, the corresponding dataset was divided into a training set and a test set at a ratio of 8:2. The different gear fault types are labeled and listed in
Table 3.
- (2)
SEU dataset
The SEU gear dataset of gearbox was acquired from the dynamic drivetrain simulator (DDS), as presented in
Figure 11, which consists of motor, motor controller, parallel shaft gearbox, load, and load controller. Two different conditions (1200 r/min at 0 Nm and 1800 r/min at 7.32 Nm) were selected and researched [
41]. The SEU dataset contains multi-type data and the vibration signal of gearbox in
x-axis was selected under the first condition. The dataset contains four common gear faults and normal gearbox data: tooth pitting (CT), root fault (MT), tooth break (RF), tooth wear (SF), and normal state (NO). Each fault type adopts a 1000-data point sliding window for feature extraction and the training datasets are converted into 5000 images by GADF.
Figure 10b shows the GADF images of SEU datasets. Subsequently, each fault type contains 800 training samples and 200 test samples. A detailed description of the different types of faults for gearboxes is given in
Table 4. In order to better compare the differences and connections between the two datasets, we summarize the missing datasets in
Table 5.
4.2. Experimental Results and Visual Analysis
- (1)
Experimental verification in NEU dataset
The accuracy and loss function were obtained using the CS-LAMRNet, and the results are presented in
Figure 12. During the training process, the identification accuracy was gradually improved and the loss was diminished accordingly with increasing iterations. It should be noted that the proposed CS-LAMRNet method rapidly converges and stabilizes during training. As the iteration exceeds 20 epochs, the accuracies of training set and test set reached 99.80%, and the CS-LAMRNet rapidly reached the minimum loss value and stabilized at 0.2, which indicates that the CS-LAMRNet method basically achieved convergence, and nearly all training and testing samples were identified accurately. To further investigate the feature learning and fault classification capabilities of the CS-LAMRNet, the features of the model testing procedure were visualized by utilizing confusion matrix and t-SNE with the NEU dataset, which corrected the misclassified proportions of the different gear fault samples. As can be seen, the 1.0% test samples of the tooth wear are depicted as tooth cracks in
Figure 13a, and all other test samples are correctly classified. The faster convergence and higher precision of the training process prove the better performance of the proposed CS-LAMRNet.
- (2)
Experimental verification in SEU dataset
In order to further verify the superiority of the proposed CS-LAMRNet model, the accuracy and loss of training and test samples were studied, as shown in
Figure 14. After the first 20 epochs, the recognition accuracies almost overlapped, reached 100%, and stabilized; meanwhile, the corresponding losses fell rapidly and eventually stabilized at 0.001. Furthermore, to evaluate a more comprehensive image of the CS-LAMRNet and the SEU gear dataset, the confusion matrix and t-SNE were employed to obtain more detailed diagnostic information, as shown in
Figure 15, which indicates that the proposed GFD method has an excellent classification ability.
4.3. Comparative Experiments
- (1)
Comparative experiments based on NEU dataset
To verify the effectiveness and feasibility of CS-LAMRNet in this research, the experiment results were compared with other seven methods, namely ResNet18 [
42], VGG11 [
43], MTF-ResNet [
44], GADF-CNN [
45], MobileNet V3 [
38], ConvNeXt-T [
39], and DRSN-CW [
46]. To ensure a single variable, all methods were required to use the same dataset (NEU dataset) during training and testing. To avoid contingency and randomness and to improve the accuracy of the proposed model, the average values of the accuracy and F1-macro were chosen as model evaluation indexes, which are obtained by carrying out ten experiments for each fault diagnosis method. The expressions are as follows:
where TP/FP and TN/FN represent true/false positive examples and true/false negative examples, respectively.
Table 6 summarizes the average accuracies of the different methods, which are 87.45%, 97.72%, 96.91%, 96.90%, 98.96%, 99.30%, 99.18%, and 99.58%, respectively. In contrast, the average accuracy value of the VGG11 is the smallest at 87.45%, due to a small convolution kernel (3 × 3). Although the VGG11 enhances the network depth, it may lead to the large loss of features and spatial information in some cases, and the feature expression ability is limited, resulting in overfitting to a certain extent. The average accuracy of the other fault diagnosis methods was over 96.90%. Of these, the DRSN-CW yields better results, namely 99.30% with a F1-score of 99.30%. However, the highest accuracy and F1-score are 99.58% and 99.83%, which were obtained by the CS-LAMRNet proposed in this research. Compared with VGG11 and DRSN-CW, the accuracy rate and F1-score increased by 12.13%, 0.28%, and 12.32%, 0.53%, respectively, which indicates that the CS-LAMRNet has excellent diagnosability.
- (2)
Comparative experiments based on SEU dataset
In this section, the SEU dataset is utilized to evaluate the advantages of the proposed method and whether the corresponding calculation strategy and the previously mentioned fault diagnosis methods are the same as in the upper segment. The results of the comparison of the accuracies and F1-scores are displayed in
Table 7. In contrast, it can be seen that the proposed CS-LAMRNet model has the highest accuracy and F1-score among the compared methods, reaching 100%. Therefore, from the previous investigation and comparison results, the conclusion can be drawn that the proposed model exhibits excellent fault classification and a good and robust diagnosis effect.
4.4. Comparative Experiment in Noisy Environments
Although the proposed CS-LAMRNet method in this research has an excellent fault diagnosis ability for the laboratory gear datasets (NEU dataset and SEU dataset), gear systems operate in complicated and volatile conditions, and the obtained gear vibration data are frequently polluted by noise in actual industrial applications. Hence, to evaluate the anti-noise ability of the proposed CS-LAMRNet, the SEU gear dataset is injected with Gaussian noise to simulate the noise in real working conditions. The signal-to-noise ratio (SNR) is applied to evaluate Gaussian noise intensity, and it can be expressed as follows:
Here, PS represents the power of the signal and PN denotes power of the noise.
To ensure the consistency of the analysis, the several diagnostic methods mentioned above were used for comparison. The comparison experiments were carried out using the SEU dataset with different SNRs (4 dB, 6 dB, 8 dB, 10 dB, 12 dB, and 14 dB), which are shown in
Figure 16. From the line chart in
Figure 16, it can be observed that the accuracy of the proposed CS-LAMRNet shows better diagnostic ability under the examined SNRs than the seven other approaches. When SNR = 4 dB, the accuracies of VGG11, MobileNet V3, ResNet18, MTF-ResNet, and GADF-CNN are all below 86%, while the accuracies of ConvNeXt-T, DRSN-CW, and CS-LAMRNet remain above 92.5%. With increasing SNRs, it can be concluded that the proposed CS-LAMRNet model has the highest diagnostic accuracy rate of 100% when SNR ≥ 12 dB. Therefore, the results show that the proposed CS-LAMRNet model possesses excellent noise resistance properties.
To further evaluate the diagnostic performance of CS-LAMRNet in a noisy environment,
Figure 17 displays the corresponding confusion matrix of the SEU gear dataset obtained by the aforementioned diagnostic methods under the condition of SNR = 6 dB. As shown in
Figure 17, for VGG11, the 14%, 6%, and 6% of Label 2 are, respectively, identified as Label 1, Label 3, and Label 4; the 30% and 7% of Label 3 are, respectively, identified as Label 2 and Label 4; and the 20% and 13% of Label 4 are, respectively, identified as Label 2 and Label 3 in
Figure 17a. In addition, the gear fault type is not precisely identified. This indicates that the VGG11 is not good at identifying the specific gear faults under strong noisy environments. In addition, the specific gear faults cannot be accurately classified by other diagnostic methods (ResNet18, MTF-ResNet, GADF-CNN, MobileNet V3, DRSN-CW, and ConvNeXt-T). However, it is easy to see that the diagnostic accuracy rate of CS-LAMRNet for Label 0/2 and Label 1/3 are 100% and 99%, respectively. The lowest accuracy rate is Label 4, but it also reaches 94%, which indicates that the CS-LAMRNet was able to correctly identify the five types of gear failure. Therefore, compared with other diagnostic methods, the proposed CS-LAMRNet presents the more accurate classification ability, especially in Labels 0, 1, 2, and 3, where the CS-LAMRNet has a better generalization ability.
To better understand the advantages and reliabilities of the CS-LAMRNet for GFD, the t-SNE algorithm was used to perform visual analysis.
Figure 18 shows a diagram of the t-SNE of different models in a noisy 6 dB environment. It can be clearly seen that the visual results of the first seven diagnostic methods show obvious overlap, making it difficult to accurately identify different types of gear faults. In
Figure 18h, the classification using CS-LAMRNet shows a remarkable clustering effect, namely the congeneric samples are clustered together in the feature space and form unique regions with clear boundaries. Hence, it can be concluded that the CS-LAMRNet displays good distinguishability and achieves accurate predictions for the SEU dataset.
4.5. Ablation Study
To evaluate the performance of every component of the proposed framework, four comparative methods of ablation experiments were carried out by removing or replacing specific structures within the CS-LAMRNet model using the SEU dataset; these are described as follows.
Method 1: The compressed sensing is not applied to process the signal with noise addition, which allows us to observe the changes in the model performance without denoising processing.
Method 2: The convolution kernels of 3 × 3, 5 × 5, and 7 × 7 in the MSFE module are all replaced with regular 3 × 3 convolutions, which can investigate the effect of convolution kernel size on feature extraction.
Method 3: The IDRA module is removed from the LATT module, which is used to evaluate the significance of the LATT module in the complete method.
Method 4: The CS-LAMRNet model in this research is compared with the previous three methods.
As shown in
Table 8, it can be observed from the ablation experiments that the compressed sensing and MSFE module have a remarkable impact on diagnostic accuracy; in other words, the compressed sensing technology presents excellent denoising ability, which effectively improves the availability of signals in a complex environment. Meanwhile, multi-scale convolution kernel enhances the adequacy and diversity of feature extraction by capturing information at different scales. Additionally, the importance of the LATT module cannot be overlooked. The inclusion of the attention mechanism in CS-LAMRNet enables the model to adaptively focus on key fault information, alleviates the interference of noise, and improves diagnostic accuracy and reliability. Therefore, the effectiveness and necessity of each component of the proposed CS-LAMRNet are proven by the ablation experiments.