1. Introduction
As a general-purpose component, rolling bearings are widely used in various rotating mechanical equipment. Defects may occur to the bearings during operation, which may ultimately cause damage to the equipment [
1,
2]. Fault diagnosis of rolling bearings can effectively help prevent safety accidents and economic losses. For instance, the statistic shows that bearing failure accounted for more than 21% of all failures in electrical machines [
3]. In the past, the fault diagnosis of rolling bearings was often realized by physical models [
4]. However, the oversimplification and low accuracy of physical models make it impossible to apply to the increasingly complex modern industrial system. With continuous improvements in computer processors and sensor technologies (vibration sensor, acoustic sensor, etc.), researchers have summarized the diagnosis scheme into two solutions based on vibration and acoustic. These two solutions can be implemented alone or combined with deep learning methods based on historical data. It provides a new direction for accurate bearing fault diagnosis [
5,
6] and accelerates the use and development of fault diagnosis tools [
7].
The distinct advantages of deep learning over other machine learning methods include its great learning capacity, more powerful feature extracting ability and faster data processing ability [
8]. With these advantages, deep learning models, such as convolutional neural networks (CNN) and recurrent neural networks (RNN), have achieved excellent performance in the fields of image processing, natural language processing, etc. [
9]. Researchers have also tried to apply deep learning methods to achieve a high-accuracy diagnosis of bearing faults. Eren et al. [
10] used one-dimensional CNN and raw vibration signals for bearing fault diagnosis. Liu et al. [
11] reported lightweight CNN to perform bearing fault diagnosis under variable operating conditions, and Luo et al. [
12] employed a semi-supervised autoencoder (AE) to solve the diagnosis problem when the labeled samples were insufficient.
Although previous research has improved diagnostic accuracy to a certain extent, the information and features of historical data have not been fully excavated and utilized, hindering the further improvement of these methods. In particular, rolling bearing fault diagnosis models based on deep learning are faced with the following challenges. (1) The global features of vibration signals are difficult to extract. Vibration signal data belongs to one-dimensional time series data, up to several thousand dimensions in the time domain. Compared to image data, the structural characteristics of the vibration signal make the receptive field (RF) of CNN smaller. In image processing, when a 5 × 5 convolution kernel is used to stack a 10-layer network, the receptive field of the output layer can reach 1681. However, when the 1 × 5 convolution kernel is stacked with ten layers, the RF can only reach 41 for the vibration signal. Therefore, it is difficult to explore the global features of the vibration signal using a small-size convolution kernel. (2) The large-sized convolution kernel and the deep network structure can enhance the global feature extraction ability but also increase the number of parameters and calculations and thus have a risk of overfitting. Moreover, deep CNNs are laborious to train due to gradient explosion or vanishing. (3) The task of fault diagnosis is unitary. In image processing, diversified learning tasks such as classification, object detection, semantic segmentation, and text annotation can assist each other [
13]. Additional learning tasks can be used as regularization items or pre-training methods [
14]. In contrast, only one classification task (fault diagnosis) or one regression task (life prediction) is usually performed in the field of diagnosis.
Considering the above challenges, this study aims to develop a fault diagnosis method that can extract global and local features without increasing the learning parameters and the possibility of overfitting. In addition, multi-task learning methods are considered for effective training.
Multi-scale feature learning and multitask learning are not new in the field of image processing. Multi-scale convolution models, such as Inception [
15], have more powerful feature extraction and generalization capabilities than single-scale convolution models. Dilated convolution can improve the receptive field of the convolution kernel while keeping the number of parameters consistent [
16]. GAN is a deep learning framework proposed by Goodfellow et al. in 2014 [
17]. GANs perform unsupervised learning through a binary game to obtain a generative model. Auxiliary classifier generative adversarial networks (ACGANs) are a variant of GANs [
18]. ACGAN consists of two components: the discriminator and the generator. The generator randomly samples from random noise and classification labels to generate new fake samples. The discriminator performs source discrimination and classification discrimination on either true or fake samples. Both the discriminator and the generator can effectively improve their performance during the game. Other technologies, such as attention mechanisms and residual structures, are also often used to improve performance. The attention mechanism can improve the representability of CNN and help visualize the learning process [
19]. The residual structure effectively resolves the training difficulties of deep networks [
20].
Besides the field of image processing, the above-mentioned technologies, such as multi-scale convolution, residual connection, and ACGAN, have also been tried to apply in fault diagnosis, but limitations still exist. Huang et al. [
21] employed multi-scale convolution kernels to extract features of bearing faults. However, the study did not involve dilated convolution to reduce the number of parameters and calculations, and a large number of training samples were needed to ensure the accuracy of diagnosis. Li et al. [
22] investigated the residual model to accelerate learning but did not utilize multi-scale features or unsupervised learning to improve performance. Shao et al. [
23] employed ACGAN for data augmentation while ignoring the discriminator for diagnosis tasks. Huang et al. [
24] and Wang et al. [
25] adopted an attention mechanism-based model to extract fault features. However, the studies did not reveal the importance of multi-scale features through the attention mechanism.
Considering the current methods’ limitations, this study proposes a new method based on deep learning for bearing fault diagnosis. Different from the previous methods using artificial features as input, a multi-scale dilated convolution kernel is designed in our proposed method to extract features from the raw signal adaptively, and unsupervised learning is used to improve model performance.
The main contributions of the proposed method can be highlighted as follows:
- (1)
Multi-scale and multi-dilation rate convolution kernels are used to extract both the global and local features of the raw signals. The receptive field is improved while not increasing the number of parameters significantly or causing serious overfitting.
- (2)
The channel attention mechanism is employed to illustrate the importance of features at different scales. Features of different scales have different contributions to diagnostic tasks. The channel attention mechanism adaptively learns the importance of various features, and the role of features at various levels and scales could be revealed. Therefore, the proposed method enhances the interpretability of the fault diagnosis model.
- (3)
A multi-task learning model suitable for bearing fault diagnosis is established. The new method uses an unsupervised learning task to strengthen the feature extraction capabilities of the diagnostic model. Essentially, the proposed method has implicit data augmentation.
The remaining parts of this paper are arranged as follows.
Section 2 provides the preliminary details of this study. In
Section 3, the proposed method and model are demonstrated in detail. In
Section 4, the experimental arrangement and data processing work are described. In
Section 5, experimental results on two datasets are provided to illustrate the effectiveness of the proposed method. Finally, the conclusion is drawn in
Section 6.
3. Proposed Method
Firstly, this study constructs a convolution module based on the dilated convolution, SE, and residual structure to extract features on various time scales. Then, a classifier generative adversarial network (CGAN) based on the idea of ACGAN for fault diagnosis is designed. In CGAN, G and D use the constructed convolution module or its internal sub-modules. Finally, two stages of learning are employed to improve the diagnostic accuracy of the proposed model. The first stage is multi-task learning based on CGAN. The second stage is fine-tuning based on supervised learning.
3.1. Multi-Scale Dilated Convolution SE Residual Module
As shown in
Figure 4, the proposed convolution module embeds the multi-scale dilated convolution (MDC) layer and SE module into the residual module. Therefore, it is called a multi-scale dilated convolution squeeze-and-excitation residual block (MDC-SE-ResBlock). The MDC layer comprises eight groups of dilated convolution kernels with different scales and dilation rates, called multi-scale dilated convolution group (MDCG). The detailed parameters are shown in
Table 1. Compared with the original SE module, the SE module in this paper uses both GAP and GMP to form a global pooling layer. In MDC-SE-ResBlock, Conv-Layer1 and Conv-Layer4 are added to maintain the consistency of the feature size and the number of channels.
As shown in
Table 1, the convolution range of MDCGs ranges from 3 to 111. Since there is a certain degree of information loss in dilated convolution [
29], this new model sets the step size of all MDCGs to 1 to reduce information loss. The step size of Conv Layer 1 and Conv Layer 4 is set to 2 to achieve down-sampling and information compression. In MDC-SE-ResBlock, the Leaky Rectified Linear Unit (LeakyReLU) [
30] is selected as the activation function. Batch Normalization (BN) [
31] and Dropout [
32] are also used to stabilize the learning process and strengthen regularization, respectively.
The MDC, SE, and residual modules constructed by MDC-SE-ResBlock all have distinct advantages. Dilated convolution can reduce network parameters and prevent serious overfitting while maintaining a larger RF. Therefore, the dilated convolution is particularly suitable for one-dimensional time-series signals. For example, in MDCG8, the basic convolution kernel with the same convolution range contains 111 parameters. MDCG8 contains only 23 parameters, which reduces the number of parameters by 79.28%. The SE module can adaptively learn the importance of features at different scales. The activation values of SE can also be used for visualization. The residual structure introduced by MDC-SE-ResBlock can alleviate gradient explosion or vanishing effectively.
3.2. Classifier Generative Adversarial Net (CGAN)
The goal of ACGAN is to obtain a high-quality
G, but in this study, we want to learn a discriminator
D, which will be used as a classification network in the later tasks. The classification task should be the main task, while the source identification task should be the auxiliary task. In this case, we call ACGAN CGAN, and the goal of CGAN is to obtain a stable and reliable fault classification model D. The specific differences between CGAN and ACGAN can be shown in
Table 2.
Figure 5 illustrates the framework of CGAN. The classification task is achieved by maximizing
, while the source identification task is achieved by maximizing
.
In CGAN, the discriminator
D and generator
G structures are shown in
Figure 6.
D is cascaded with 4 MDC-SE-ResBlocks. The output layer of
D has an RF of about 500 to the input layer, which can effectively extract global features. MDC-SE-ResBlock is not used directly in
G, and two sets of MDC-SE structures in series are used instead. There are two reasons for the design of
G in this paper. On the one hand, the learning effect of
G mainly depends on
D, so the residual structure is not necessary for
G. On the other hand, after many attempts in this study, it is found that the effect of directly using MDC-SE-ResBlock is not as good as the current structure. Among the proposed
D and
G, LeakyReLU, Dropout, and Batch Normalization are still used.
3.3. Learning Strategy
This paper proposes a two-stage learning strategy to train the fault diagnosis model. Only labeled samples are used in the entire learning process. In Stage 1, the CGAN learning strategy is used for semi-supervised learning. In Stage 2, D again uses the labeled samples for supervised fine-tuning.
3.3.1. The First Stage of Learning
The learning strategy of CGAN is used to pre-train the diagnostic model
D so as to better initialize the parameters. For supervised learning tasks, direct learning supervised tasks can easily fall into a poor local optimum, which leads to poor generalization ability. An appropriate pre-training method can constrain the parameters to the vicinity of the global optimum point. The CGAN learning strategy has the following advantages: (1) After unsupervised learning,
D obtains a good representation of the raw data; (2) When the Nash equilibrium is reached, it is equivalent to carrying out implicit data augmentation that the
D classifies the samples generated by
G; and (3) Supervised learning carried out by
D can overcome the blindness of unsupervised learning. The first stage of learning is both multi-task learning and semi-supervised learning. The learning process of the first stage is cyclically updating the parameters in
D and
G. The specific learning process is shown in Algorithm 1. In this paper, the Adam optimizer is used for gradient descent. Stochastic gradient descent (SGD) or RMSProp algorithm can also be used as an alternative.
Algorithm 1 Training process of the first learning stage |
Initialization: lr, delta, n, EPOCH, ITERATION. lr is the initial learning rate. delta is the corresponding decay rate. n is the mini-batch size, and EPOCH is the total number of times the data set will be traversed. ITERATION is the number of training iterations for one epoch. |
1: | For i in range(0, EPOCH) do: |
2: | For j in range(0, ITERATION) do: |
3: | Sample from and . |
4: | Generate by . |
5: | Sample from . |
6: | Optimizing by updating the D. |
7: | Optimizing by updating the G. |
8: |
end for |
9: |
|
10: | end for |
3.3.2. The Second Stage of Learning
To further improve the performance in supervised tasks, supervised fine-tuning is adopted to train
D in Stage 2. Theoretically, one or more from the multiple trained
D that finally reach the Nash equilibrium can be selected, and then the selected model is trained again using the training data. In this paper, we simply choose a
D with a smaller loss in the equilibrium stage according to the loss curves of
D and
G. A lower learning rate is employed in this stage to avoid wasting the previous learning. Only labeled samples are used for learning, which is different from many semi-supervised methods that require many unlabeled samples [
33].