1. Introduction
With the development of modern industry, machine health monitoring becomes more and more important to maintain the safe operation of modern mechanical equipment [
1,
2]. Bearings are an indispensable part of modern machinery, especially in rotating machinery. Rolling bearing faults account for a large proportion of mechanical equipment faults, so it is necessary to ensure the normal operation of bearings [
3,
4,
5]. In practical engineering, because of the complex and changeable operating conditions, once a fault occurs, it often affects other parts, leading to a compound fault [
6,
7,
8]. Compared with single faults, the vibration signals are affected by mutual coupling and interference from various fault features, so the difficulty of feature extraction and fault diagnosis is greatly increased [
9]. An effective and reliable compound fault diagnosis of rolling bearings is of great significance in guaranteeing safe operation.
In recent years, the compound fault diagnosis of rolling bearings has received increasing interest and attention from researchers. The research methods for compound fault diagnosis mainly include compound fault mechanism research [
10,
11], the blind source separation algorithm [
12,
13,
14,
15], the signal decomposition algorithm [
16,
17], and the artificial intelligence algorithm [
18,
19]. The method based on the compound fault mechanism, taking specific machineries as research objects, limits the application and portability of the model. Methods based on blind source separation have high requirements for the number of channels of a raw signal. The number of sensors must meet the requirements of the algorithm, which increases the cost of a fault diagnosis. The modal decomposition algorithm is a typical signal decomposition method that is suitable for non-stationary signal processing.
However, there are some issues, such as mode aliasing and the end effect, that can directly affect the accuracy of a compound fault diagnosis. With the rapid development of artificial intelligence techniques, machine learning methods, including support vector machines (SVMs), Bayesian classifiers, artificial neural networks (ANNs), and convolutional neural networks (CNNs), have already been applied as potential tools to fault diagnosis, especially deep learning approaches. Ref. [
20] proposed a deep inception net with atrous convolution and applied this model to a bearing fault diagnosis. This model overcomes the problem caused by the different feature distributions of characteristics between two data sets and achieves high accuracy. Ref. [
21] proposed a model based on CNN, named WDCNN, for a bearing fault diagnosis that achieves high accuracy on raw vibration signals directly in the Case Western Reserve University (CWRU) bearing data set. Ref. [
22] utilized an improved CNN for a fault diagnosis. In that network, a multiscale cascaded layer is added to CNN, which can enhance the classification information of the input. Ref. [
23] constructed a CNN with feature alignment that addressed the finite-shift-invariance problem, which can extract a robust fault feature. Ref. [
24] adopted a multi-task CNN through information fusion for a fault diagnosis on two bearing data sets. The experiment results showed that the proposed model improved the accuracy of the fault diagnosis. Ref. [
25] implemented a lightweight CNN that combined transfer learning and self-attention, which can achieve higher diagnosis accuracy than traditional CNN models. Ref. [
26] proposed a novel multiscale CNN model that incorporates multiscale learning during the feature extraction process to diagnose the fault of a wind turbine gearbox. A novel multiscale residual attention CNN model was proposed in [
27], which utilizes multiscale features, an attention mechanism, and residual learning to enhance feature extraction ability. Experimental validation on two bearing datasets demonstrated that the algorithm achieved higher accuracy. Based on a CNN model, [
28] designed a multi-task CNN model that utilizes a speed identification task and a load identification task as auxiliary tasks to improve the performance of a fault diagnosis task. The experimental results showed that multi-task learning can enhance the fault diagnosis performance of the model. In [
29], an end-to-end fault diagnosis model combining CNN with LSTM is designed, which can realize a bearing fault diagnosis in a short time. Ref. [
30] presented an improved one-dimensional multiscale model that combined different extended convolutional kernels with varying dilation rates. The superiority of this approach was validated on the CWRU and PU datasets. Ref. [
31] integrated vibration signals and sound signals through a one-dimensional CNN for fusion and validated that it has a higher diagnostic accuracy. Ref. [
32] used short-time Fourier transform to convert vibration signals into spectrograms and adopted a model based on CNN for feature extraction and health status classification. Ref. [
33] proposed a lightweight CNN combined with data augmentation technology for a bearing fault diagnosis. A novel hybrid CNN-MLP model for diagnoses was proposed in [
34], which combined mixed input to achieve a rolling bearing diagnosis. In [
35], a lightweight CNN model with fixed feature graph dimensions is constructed with down-sampling vibration signals to construct spectral graphs, which can achieve high classification accuracy on low-dimensional input data. Ref. [
36] put forward a model based on optimized-parameter maximum-correlated kurtosis deconvolution and CNN for a bearing compound fault diagnosis and verified its effectiveness.
The above models demonstrate that deep learning approaches have significantly improved fault diagnosis effectiveness. However, the relationship between compound fault and single fault is one-to-many or many-to-many, which is not a simple superposition of a single fault signal. Compared with signal faults, feature extraction and location of compound faults are more difficult and challenging, which brings great difficulties to compound fault diagnosis. Currently, intelligent diagnosis methods based on deep learning mainly focus on single faults. Some models are suitable for single fault diagnosis; however, they are not effective for compound fault diagnosis. The research and application of deep learning approaches to bearing compound fault diagnosis are still in their infancy. Thus, in this study, a novel deep convolutional neural network combining global feature extraction with detailed feature extraction (GDDCNN) was proposed.
The proposed GDDCNN model is a deep convolutional neural network that incorporates both global feature extraction and detailed feature extraction, where G means global feature extraction, D means detailed feature extraction, and DCNN represents a deep convolutional neural network. DCNN is a feature-progressive learning algorithm in which the deep network can continue to learn more advanced fault features based on shallow features. By designing two feature extraction modules, G and D, better learning ability can be achieved for DCNN. Therefore, more abundant fault features can be extracted through GDCNN, thus improving the performance of fault diagnosis. The contributions of this study are summarized as follows:
A novel deep convolutional neural network combining global feature extraction with detailed feature extraction (GDDCNN) is proposed to extract features adaptively from a raw signal.
The modified activation concatenated ReLU (CReLU) is applied in the shallow layer of GDDCNN. It can improve the performance of global feature extraction.
The global max pooling (GMP) strategy is designed to replace the traditional fully connected layer, which can extract shift-invariance features and reduce model parameters. It overcomes the overfitting problem in the training process.
The rest of this study is described as follows.
Section 2 reviews the basic theory of CNN.
Section 3 expresses the proposed model in detail.
Section 4 presents the experimental validation through two bearing datasets. Finally,
Section 5 draws the conclusion.
2. Theoretical Background
CNN can be built with multiple layers, comprising a convolution layer, a pooling layer, an activation layer, and a classification layer. The convolution layer, pooling layer, and activation layer are used to extract features from input signals. One-dimensional CNN has been widely applied to 1-D vibration signal processing due to its powerful feature extraction ability. The classification layer applies the extracted features to classify.
The convolution layer is the core layer of the CNN structure, which convolves the input data with filter kernels. The network makes the filter learn to activate when it extracts certain features, then realizes feature extraction. The mathematical form can be described as follows:
where the
denotes the output of the
l-th layer;
is the
i-th convolution kernel of the
l-th layer;
denotes the input of the
l-th layer; the notation
represents the convolution operation;
denotes the weights the convolution kernel; and
is the offset.
The activation layer is usually followed by the convolution layer, which is an essential layer. The activation function defines the output and input connections of a neuron, which is usually nonlinear. It makes the network learn nonlinear features from an input vibration signal to improve the capability of feature extraction. Rectified linear unit (Re
LU) [
37] is commonly used as an activation function in CNN, which is defined as follows:
which denotes the activation value is 0 on the negative half-axis.
Batch normalization (BN) [
38] is applied to deep neural networks, which can reduce the shift of internal covariance and improve the accuracy of the training model. In addition, BN compels the learned options into a standard distribution with a mean value of 0 and a variance of 1, which can accelerate the training speed of the model. The transform of BN is described as follows:
where
m represents the mini-batch size,
expresses the mini-batch mean, and
represents the mini-batch variance.
γ and
β are learnable parameters of the network.
The pooling layer performs a down-sampling operation, which removes redundant features and extracts deeper features. Max pooling and average pooling are the most common pooling operations. Max pooling generally outperforms average pooling for time series classification tasks, which is expressed as follows:
where the
represents the output features of the
l-th layer; max is max pooling,
means the output value of the
t-th neuron in the
i-th channel of the
l-th layer,
s denotes the stride of the pooling.
In the classification layer, the Softmax function is applied to normalize the probability of each category in the output. Softmax in the neural network is defined as
where
K is the number of categories and
represents the logits of the
j-th output neuron.
3. Proposed GDDCNN
To overcome the problems of compound fault feature extraction, a novel deep CNN is proposed. The structure of the proposed model, GDDCNN, is shown in
Figure 1, composed of an input module, a feature extractor module, a GMP layer, and a Softmax classifier. Global feature extraction and detailed feature extraction constitute the feature extractor module. Besides, this study used two strategies: modified activation operation and GMP strategy during the feature extraction process to enhance the feature extraction ability of the proposed model, GDDCNN. Finally, a compound fault diagnosis is implemented. More details are illustrated in subsequent parts.
3.1. GDDCNN Architecture Design
Different kernel sizes for convolution: In CNN, convolutional kernel size plays an important role in the convolutional layer because kernels of different sizes can obtain different features. Generally, wider kernels pay more attention to global information during convolution operations, thereby extracting more global features, while smaller kernels can capture more detailed features. To obtain more robust features from a raw vibration signal, global feature extraction and detailed feature extraction are combined and applied to the feature extractor module. This study designed different kernel sizes for convolution; wide kernel sizes are applied in the shallow (first and second) convolution layers to extract global features, while the deep convolutional kernels are small, which help to obtain detailed features. Multi-convolution layers that adopt small convolutional kernels make the CNN networks deeper, which can improve the performance of compound fault feature extraction. Finally, the size of the convolution kernel for different layers is set to be [64, 16, 3].
Modified activation operation: In addition to the convolution, the activation operation also affects the performance of CNN. Due to the advantages of simple calculation and no vanishing gradient problem, ReLU is widely applied in CNN as an activation unit. However, when the inputs are negative, the neurons are always inactive. These dead neurons in a network may never activate, which stops learning and thus affects the learning ability of the network. In [
39], it was found that in the shallow layer of CNNs, the parameter distribution of the network exhibits a stronger negative correlation, while with the deepening of the network, this negative correlation gradually becomes weak. CReLU is a concatenated ReLU, which, through inverting the feature map to activate the negative inputs, helps features transmit better backward. CReLU is defined as follows:
where Re
LU represents the Re
LU activation function. The modified activation CReLU is used in the shallow (first and second) convolution layers, which can improve the performance of global feature extraction.
GMP strategy: The fully connected layer is generally applied after the last convolutional or pooling layer, which can integrate the class-differentiated local features extracted by CNN. Each neuron in the fully connected layer is fully connected to all the neurons in the previous layer. Due to its fully connected characteristics, the parameters of the fully connected layer are numerous, and the calculation is extremely complicated. The right part of
Figure 2 shows the GMP progress, that is, the max value of each channel is used as the new feature vector. The GMP strategy clearly reduces the dimension of the feature vector, which can avoid overfitting. Another advantage is that it is more native to the convolution structure by enforcing correspondences between feature maps and categories. Furthermore, this strategy can retain the spatial information that keeps shift-invariance, resulting in robust extracted features.
3.2. Training of GDDCNN
The architecture of GDDCNN is designed to take advantage of one-dimensional CNN. The cross-entropy loss function is used to estimate the consistency between the Softmax output probability distribution and the target class probability distribution. Suppose that
p(
x) and
q(
x) represent the target distribution and estimated distribution, respectively. The loss function can be expressed as follows:
The gradient of loss L and the gradient of parameters related to BN should be back-propagated in GDDCNN during training as per the following rules:
Adam is a learning rate adaptive optimization algorithm that assembles the Adagrad algorithm and the RMSProp algorithm, which allows a model to allocate more updates to rarely occurring features, thereby aiding in the convergence of the optimization process. By bringing together the adaptive learning rates of Adagrad and the stability of RMSProp, the Adam algorithm provides an effective solution to optimize the proposed model, enabling it to converge quickly and efficiently while avoiding common optimization pitfalls. The choice of optimizer plays a crucial role in the success of the proposed model’s training process.
In order to minimize the value of the loss function, the Adam optimization algorithm is applied to update the weights and obtain the optimal weights. The Adam optimizer is chosen for the following reasons: Bearing vibration signals typically contain a large number of data points due to their time-series nature. The efficient convergence property of the Adam algorithm allows the model to learn from a large volume of data more quickly. This, in turn, accelerates the assessment of the bearing’s health status, saving training time. Features in bearing vibration signals may vary in importance over different time intervals. Adam’s adaptive learning rate automatically adjusts the learning rate based on the gradient of each parameter. This capability helps the model better adapt to dynamic variations within a signal. When the Adam optimizer is initialed, it is necessary to set the learning rate
, exponential decay rate of moment estimation
, and constant
. In this experiment, we set
to 0.001,
to default 0.9, 0.999, and the constant was set to prevent the numerical mutation from being set to
during the dividing operation. The detailed process of the Adam algorithm is shown in
Table 1. More details on the Adam algorithm can be found in [
40].
3.3. Diagnosis Procedure
The raw vibration signals of rolling bearings are collected by a data acquisition device.
The obtained vibration signals are sliced into samples of length 2048 for standardization processing and then used as network input.
Extracting fault features by combining global feature extraction with detailed feature extraction. Utilizing the GMP layer to integrate features. Then, Softmax is employed as a classifier to classify fault features.
Testing samples are entered into the network to realize fault diagnosis and validate the performance of the proposed model, GDDCNN.