*Article* **Prediction of Remaining Service Life of Rolling Bearings Based on Convolutional and Bidirectional Long- and Short-Term Memory Neural Networks**

**Zhidan Zhong \* , Yao Zhao, Aoyu Yang, Haobo Zhang and Zhihui Zhang**

School of Mechatronics Engineering, Henan University of Science and Technology, Luoyang 471003, China; zhaoyao2020115@163.com (Y.Z.); scotthw919@163.com (A.Y.); zhb156909@163.com (H.Z.); zzhui2022@163.com (Z.Z.)

**\*** Correspondence: zzd@haust.edu.cn

**Abstract:** Predicting the remaining useful life (RUL) of a bearing can prevent sudden downtime of rotating machinery, thereby improving economic efficiency and protecting human safety. Two important steps in RUL prediction are the construction of a health indicator (HI) and the prediction of life. Traditional methods simply use the time-series characteristics of the vibration signal, for example, using root mean square (RMS) as HI, but this HI does not reflect the true degradation of the bearing. Meanwhile, existing prediction models often cannot consider both the time and space characteristics of the signal, thus limiting prediction accuracy. To address the above problems, in this study, wavelet packet transform (DWPT) and kernel principal component analysis (KPCA) were combined to extract HI from the original vibration signal. Then, a CNN-BiLSTM (convolutional and bidirectional long- and short-term memory) prediction network with root mean square as input and HI as output was constructed by combining convolutional neural network (CNN) and bi-directional long- and short-term memory neural network (BiLSTM). The network improved prediction accuracy by considering the temporal and spatial characteristics of the input signal. Experimental results on the PHM2012 dataset showed that the method proposed in this paper outperformed existing methods.

**Keywords:** wavelet packet transform; kernel principal component analysis; remaining service life of rolling bearings; convolutional neural network; bidirectional long- and short-term memory neural network

#### **1. Introduction**

Bearing is a key component in rotating machinery, known as the joint of machinery, and its failure may lead to downtime of industrial production or even cause casualties [1]. According to a survey, rolling bearing failure is one of the most important factors of rotating machinery failure, accounting for 45–55% of cases [2]. A reasonable and effective bearing remaining useful life prediction (RUL) method can help technicians develop maintenance plans for predictive maintenance [3]. Therefore, it is important to predict the remaining service life of bearings to avoid accidents and reduce economic losses [4].

Generally speaking, methods for RUL prediction of bearings fall into two main categories: model-based (physical/mathematical) methods [5,6] and data-driven methods [7]. Wang et al. [8] proposed a mechanical state prediction method based on a probabilistic model with particle filters, which was successfully used for the state prediction of wind power bearings. El-Tawil et al. [9] developed a method based on a nonlinear damage law to determine the RUL of the system. Ma et al. [5] analyzed the interaction between various parts of the bearing by modeling the angle of relative sliding velocity between the rolling element and the bearing raceway and the bearing dynamics. However, model-based methods require complex physical or mathematical models, which require researchers

**Citation:** Zhong, Z.; Zhao, Y.; Yang, A.; Zhang, H.; Zhang, Z. Prediction of Remaining Service Life of Rolling Bearings Based on Convolutional and Bidirectional Long- and Short-Term Memory Neural Networks. *Lubricants* **2022**, *10*, 170. https://doi.org/10.3390/ lubricants10080170

Received: 31 May 2022 Accepted: 19 July 2022 Published: 26 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

with extensive knowledge base and are often difficult to develop due to complex working conditions [10].

With the development of sensor technology and computer technology, data-driven methods based on data have been developed [11]. As a data-driven method, deep learning can learn the bearing degradation trend spontaneously from sensor data and establish the mapping relationship between data and bearing health status with remarkable application [12]. Deep-learning-based RUL prediction methods are mainly divided into steps such as data acquisition, health factor (HI) construction, and remaining service life prediction [13]. Liu et al. [14] proposed a rolling bearing RUL prediction method based on regularized LSTM networks and verified the advantages of the method with the dataset of PRONOSTIA platform [15]. Ning et al. [16] first performed feature screening of signals and then predicted the remaining service life of bearings using RNN models. Network models such as RNN and LSTM tend to ignore spatial features, although they can learn the degradation trend and temporal characteristics of bearings from the data [17]. Wang et al. [18] used 1d-CNN to process fused signals and learn fault features using the powerful feature extraction capability of the network. However, a single CNN network tends to ignore the temporal features of the data and is unable to learn signal features at multiple scales [19]. HI can reflect the degradation trend of bearings, and it is critical to obtain excellent HI labels for training prediction models [20]. For example, Zhang et al. [21] used the timedomain feature RMS of the vibration signal as the main performance degradation indicator. Singleton et al. [22] used the variance of the vibration signal as HI. Zhang et al. [23] used the kurtosis of the vibration signal after band-pass filtering as HI. In the literature [24], the ratio of current life to total life of the bearing is used as HI of the bearing. However, HI constructed by the above methods cannot fully describe the bearing degradation trend. Because the bearing signal is nonlinear, we can pay attention to the transient changes of the signal by analyzing the signal with different resolutions in time–frequency domain. Time–frequency analysis technology DWPT is often used in the analysis of bearing vibration signals [25].

In response to the above problems, a new method for HI and RUL prediction of rolling bearings is proposed in this paper. Firstly, discrete wavelet packet transform (DWPT) was performed on the time-domain vibration signal to extract RMS features from the obtained sub-bands, and the HI was then obtained by fusing the RMS of each sub-band through kernel principal component analysis (KPCA). Based on this, the convolutional bidirectional long- and short-term memory neural network (CNN-BiLSTM) was proposed for lifetime prediction. Finally, the feasibility of the method was verified by bearing experimental data. The main contributions are as follows.


The remainder of this paper is organized as follows. Section 2 provides the theoretical background. Section 3 introduces the method proposed in this paper. Section 4 describes the experimental procedure and the analysis of the results in detail. Section 5 concludes the paper.

#### **2. Theoretical Background**

#### *2.1. Discrete Wavelet Packet Transform (DWPT)*

The discrete wavelet transform can describe the local characteristics of vibration signals in the time and frequency domains and is a very effective signal analysis method, which is often used for signal preprocessing for bearing fault diagnosis and life prediction [26]. − +=+

In this study, the bearing vibration signal was preprocessed based on DWPT in order to construct HI. The algorithms for wavelet packet decomposition and reconstruction are shown in Equations (1) and (2), respectively. − −

$$\begin{cases} d\_l^{j,2n} = \sum\_k p\_{k-2l} d\_k^{j-1,n} \\ d\_l^{j,2n+1} = \sum\_k q\_{k-2l} d\_k^{j-1,n} \end{cases} \tag{1}$$

+

=

=

−

 −

−

−

$$d\_l^{j-1,n} = \sum\_k p\_{l-2k} d\_k^{j,2n} + \sum\_k q\_{l-2k} d\_k^{j,2n+1} \tag{2}$$

where *p* and *q* are filter coefficients; *d* is the wavelet packet decomposition coefficient; *k* and *l* are the number of decomposition layers; and *j* and *n* are wavelet packet node numbers.

Figure 1 is a schematic diagram of the three-layer wavelet packet decomposition structure, where S0 is the original signal, S10 is the low-frequency part of the original signal, S11 is the high-frequency part of the original signal, and so on. As can be seen from the figure, the wavelet packet transform can decompose both the low- and high-frequency parts of the signal uniformly and has higher time–frequency resolution than the wavelet transform, making it more effective in analyzing nonsmooth signals (e.g., bearing vibration signals) [20]. In this study, the db4 mother wavelet was used to decompose the original vibration signal into three levels of wavelet packets to obtain eight sub-bands. –

**Figure 1.** Structural scheme of DWPT.

#### *2.2. Kernel Principal Component Analysis (KPCA)*

Kernel principal component analysis (KPCA) [27] is a nonlinear feature extraction method that is often used for feature extraction and fusion of bearing signals [28]. The kernel function was first introduced to map the original data space to a high-dimensional feature space, and PCA was then performed to reduce the dimensionality of the analysis. The quality of the nonlinear features thus extracted was much better.

Let the data set with *M* samples be {*x*1, *x*2, · · · , *xi*}(*i* = 1, 2, . . . , *M*), *x<sup>i</sup>* ∈ *R <sup>N</sup>* and the sample dimension be *N*. Normalize the high-dimensional spatial data so that it satisfies the following:

$$\frac{1}{M} \sum\_{i=1}^{M} \varphi(\mathbf{x}\_i) = \mathbf{0} \tag{3}$$

=

=

 :

 = where *ϕ* is a nonlinear mapping function that enables the mapping of the low-dimensional spatial feature *x<sup>i</sup>* to the higher dimensional space *F* : *ϕ*(*xi*).

The covariance matrix of *F* space is expressed as follows:

$$\mathcal{C} = \frac{1}{M} \sum\_{i=1}^{M} \varphi(\mathbf{x}\_i) \varphi(\mathbf{x}\_i)^T \tag{4}$$

The eigenvalues of the covariance matrix are *λ*, and the eigenvectors are *v*, both of which satisfy the following:

$$
\mathcal{C}v = \lambda v \tag{5}
$$

After transforming each sample into *ϕ*(*x<sup>k</sup>* ), make inner product with Equation (5):

$$
\varphi(\mathbf{x}\_k)\mathbf{C}v = \lambda\varphi(\mathbf{x}\_k)v\tag{6}
$$

The linear representation of the feature vector is as follows:

$$w = \sum\_{i=1}^{M} a\_i \varphi(\mathbf{x}\_i) \tag{7}$$

Simultaneous Formulas (4)–(7) can be obtained:

$$\frac{1}{M} \sum\_{i=1}^{M} \alpha\_i \sum\_{j=1}^{M} \left[ \boldsymbol{\varrho}(\mathbf{x}\_k) \, \boldsymbol{\varrho}(\mathbf{x}\_j) \right] \left[ \boldsymbol{\varrho}(\mathbf{x}\_j) \, \boldsymbol{\varrho}(\mathbf{x}\_i) \right] = \lambda \sum\_{i=1}^{M} \alpha\_i \left[ \boldsymbol{\varrho}(\mathbf{x}\_k) \, \boldsymbol{\varrho}(\mathbf{x}\_i) \right] \tag{8}$$

Nonlinear mapping from input space to high-dimensional feature space can be realized by kernel function inner product operation. The kernel function selected in this study is a Gaussian radial basis function, whose expression is as follows:

$$k(\mathbf{x}\_i, \mathbf{x}\_j) = \exp(-\gamma \|\mathbf{x}\_i - \mathbf{x}\_j\|^2) \tag{9}$$

where parameter *γ* is used to control the range of action of the kernel function.

Define the *M* × *M* dimensional matrix *K*, where the elements can be represented using the following kernel function:

$$K = \begin{bmatrix} k(\mathbf{x}\_1, \mathbf{x}\_1) & \cdots & k(\mathbf{x}\_1, \mathbf{x}\_m) \\ \vdots & \ddots & \vdots \\ k(\mathbf{x}\_m, \mathbf{x}\_1) & \cdots & k(\mathbf{x}\_m, \mathbf{x}\_m) \end{bmatrix} \tag{10}$$

The kernel matrix is used to represent Equation (8), which can be simplified as follows: *Kα* = *Mλα*. The eigenvalues and eigenvectors of the kernel matrix can be derived from the simplified Equation (8), which in turn leads to the normalized eigenvector *v k* (*k* = 1, 2, . . . , *M*) of the covariance matrix. Then, the *k*-th linear principal element of the sample *x* can be obtained as follows:

$$h\_k = v^k \varphi(\mathbf{x}) = \sum\_{i=1}^{M} a\_i^k K(\mathbf{x}\_{i\prime} \mathbf{x}) \tag{11}$$

The cumulative contribution of features is calculated and the principal element is selected as follows:

$$\sum\_{k=1}^{p} \lambda\_k / \sum\_{i=1}^{m} \lambda\_i \ge 0.90 \tag{12}$$

where *λ*<sup>1</sup> ≥ *λ*<sup>2</sup> ≥ *λ*<sup>3</sup> . . . ≥ *λ<sup>m</sup>* is the eigenvalue of the kernel matrix.

#### *2.3. Bidirectional Long Short-Term Memory Neural Network (BiLSTM)*

Long short-term memory neural network (LSTM) [29] is an improvement on the recurrent neural network (RNN). It solves the RNN gradient disappearance and gradient −1

ℎ ⃗⃗⃗

explosion problems by introducing forgetting gates and can efficiently learn the nonlinear features of time series. 

As a deep learning neural network model, each neuron of LSTM is a memory cell with three gates, which are forgetting gate *ft* , input gate *it* , and output gate *ot* . The whole update process is shown in Equations (13)–(18). – 

The forgetting gate *ft* determines what information is discarded and is determined by both the current input and the output of the previous sequence. = ( ∙ (ℎ−1 , ) + )

$$f\_t = \sigma\left(\mathcal{W}\_f \cdot (h\_{t-1}, \mathbf{x}\_t) + b\_f\right) \tag{13}$$

where *σ* is the sigmoid activation function; *W<sup>f</sup>* is the weight vector; *b<sup>f</sup>* is the base vector; and *Ct*−<sup>1</sup> denotes the cell state, which is used to store the memory information of the previous moment. 

Update gate *i<sup>t</sup>* determines what information is stored and updates the cell state as follows: = ( ∙ (ℎ−1 , ) +

$$\mathbf{i}\_t = \sigma(\mathcal{W}\_{\mathbf{i}} \cdot (h\_{t-1}, \mathbf{x}\_t) + b\_{\mathbf{c}} \tag{14}$$

$$\bar{\mathbf{C}}\_{t} = \tanh(\mathcal{W}\_{\mathbf{c}} \cdot (\mathbf{h}\_{t-1}, \mathbf{x}\_{t}) + \mathbf{b}\_{\mathbf{c}} \tag{15}$$

$$\mathbf{C}\_{t} = f\_{l} \cdot \mathbf{C}\_{t-1} + i\_{l} \cdot \bar{\mathbf{C}}\_{t} \tag{16}$$

where *tanh* is the activation function, *C* e *t* is the candidate vector for the current new state information; *ft* ·*Ct*−<sup>1</sup> denotes the information to be forgotten; *i<sup>t</sup>* ·*C* e *t* is the information to be retained; and *Ct* is the current cell state. ∙ −1 ∙ ̃ = ( ∙ (ℎ−1 , ) + )

$$\rho\_t = \sigma(\mathcal{W}\_o \cdot (h\_{t-1}, \mathfrak{x}\_t) + b\_o) \tag{17}$$

$$h\_l = o\_l \cdot \tanh(\mathbf{C}\_l) \tag{18}$$

where *ot* represents the output of information from the output gate, and *ht* is the output of the memory cell, which will also be input in the next LSTM cell.

The BiLSTM [30,31] consists of two LSTMs that pass information from the forward and reverse directions, respectively, compared to the LSTM and can associate both past and future states. The Bi-LSTM structure is shown in Figure 2, and its output is as follows [32].

$$h\_t = \stackrel{\rightarrow}{[h\_t}\stackrel{\leftarrow}{h\_t}]\tag{19}$$

where → *ht* is the result of forward propagation, and ← *ht* is the result of backward propagation.

**Figure 2.** BiLSTM network architecture.

#### *2.4. Convolutional Neural Network (CNN)*

Convolutional neural networks [33] have the characteristics of local connectivity and weight sharing. One-dimensional convolutional neural networks can perform feature extraction on time-domain signals and are commonly used in the field of bearing fault diagnosis [34].

Convolutional neural networks usually consist of three types of network layers: convolutional layer, pooling layer, and fully connected layer. The convolutional layer can implement convolutional operations for feature extraction, the pooling layer can reduce the feature dimensionality and prevent overfitting, and the fully connected layer can perform nonlinear combination of the extracted features.

The formula for one-dimensional convolution is as follows:

$$Z^{l+1} = \left[Z^l \ast w^{l+1}\right] + b = \sum\_{\mathbf{x}=1}^{f} \left[Z\_k^l(\mathbf{s}\_0 + \mathbf{x}) w\_k^{l+1}(\mathbf{x})\right] + b \tag{20}$$

The maximum pooling equation is as follows:

$$A\_i^{l+1}(j) = \max\_{(j-1)\mathcal{W}+1 \le j\mathcal{W}} \left\{ F\_i^l(t) \right\} \tag{21}$$

where *Z l* is the convolutional input of layer *l* + 1, and *Z l*+1 is the output of layer *l* + 1; *b* is the amount of variance; *w l*+1 *k* is the weight of layer *l* + 1; *f* is the convolutional kernel size; s<sup>0</sup> is the convolutional step size; *F l i* (*t*) is the value of the *t*-th neuron in the *i*-th feature of layer *l*; *W* is the pooling region; and *A l*+1 *i* is the output of the neuron of layer *l* + 1.

#### **3. The Proposed Framework**

The overall block diagram of the proposed method is shown in Figure 3. Firstly, the original vibration signal was subjected to discrete wavelet packet transform (DWPT) to obtain eight sub-bands and extract the RMS values of different sub-bands. Then, KPCA was used to downscale the multidimensional RMS to obtain HI. Finally, the remaining lifetime prediction was performed by CNN-BiLSTM network.


**Figure 3.** A flowchart of the proposed method.

#### **4. Experiments and Results**

#### *4.1. Data Description*

To verify the effectiveness of the proposed method, the PHM 2012 Challenge dataset was used in this study. The data was collected and obtained from the PRONOSTIA testbed, as shown in Figure 4.

**Figure 4.** Pronostia bearing testbed.

The acquisition device was used for 17 full-life cycle experiments of the bearing under three operating conditions, and a total of 17 sets of data were collected in the horizontal and vertical directions of the bearing using an accelerometer. The sampling frequency of the accelerometer sensor was 25.6 kHZ, and the experimental setup recorded the vibration signals at 10 s intervals with a sampling time of 0.1 s [36]. Under working condition I, bearings 1-1 to 1-7 were tested with motor speed of 1800 rpm and load of 4000 N. Under working condition II, bearings 2-1 to 2-7 were tested with motor speed of 1650 rpm and load of 4200 N. Under working condition III, bearings 3-1 to 3-3 were tested with motor speed of 1500 rpm and load of 5000 N. Details of the data are shown in Table 1.


**Table 1.** Operating condition of the PHM2012 dataset.

#### *4.2. Construction of Health Indicators*

In this section, we describe the process of HI construction in detail. Figure 5 shows the original vibration signal of bearing 1-1. As can be seen, the vibration signal amplitude of bearing 1-1 initially fluctuated roughly smoothly. The vibration signal showed a gradual upward trend with increase in time and increased sharply at the later stage.

**Figure 5.** Original signal of bearing 1-1.

In the original time-domain signal shown in Figure 5, every 2560 consecutive points constitute a sample. These samples were processed by fast Fourier transform to obtain the corresponding frequency-domain samples. In Figure 6, 10 frequency-domain samples are shown. In the figure, the time corresponding to these samples increases with the sample number. For example, sample 1 corresponds to the beginning of the bearing life cycle, and sample 10 corresponds to the end of the bearing life cycle.

**Figure 6.** Bearing 1-1 partial sample frequency-domain signal.

As can be seen from Figure 6, the bearings had different degrees of amplitude increase in different frequency sections. The low-frequency vibration was due to rotation frequency, rolling body, and internal and external fault frequency of rolling. The high-frequency vibration was caused by the inherent frequency of each component of the bearing. When the bearing failed, the shape and quality of the component changed, affecting the highfrequency vibration.

To construct the HI, the original vibration signal was first decomposed into wavelet packets using db4 wavelets for three-level decomposition [20]. Reconstruction was performed according to the reconstruction coefficients to obtain eight sub-bands. The reconstructed sub-bands are shown in Figure 7. It can be seen that the eight sub-bands exhibited different trends, and each contained degradation characteristics of different frequency bands.

**Figure 7.** The eight sub-bands of vibration acceleration signal of bearing 1-1.

The RMS was extracted for each of the eight sub-bands, and the RMS values for each sub-band were obtained, as shown in Figure 8.

**Figure 8.** RMS trends extracted from eight sub-bands of the vibration acceleration signals for bearing 1-1.

As can be seen from Figure 8, the RMS extracted from the eight sub-bands had different trends. During the whole life cycle of the bearing, the RMS of some sub-bands showed an increasing trend, while the RMS of other sub-bands showed significant fluctuations. The RMS of all sub-bands showed a steep upward trend in the last part of the life cycle, while the RMS of some sub-bands showed sensitivity at the beginning of the wear. This indicates that different sub-bands carry different degradation information.

In order to fuse the most important degradation information exhibited by all subbands, the RMS of the eight sub-bands were feature fused using the KPCA algorithm. First, the eight sub-band RMS sequences shown in Figure 8 were selected to construct an eight-dimensional high-dimensional feature set. Then, the KPCA algorithm was used to reduce the dimensionality of the feature set. Finally, the first principal element (contribution rate >90%) was selected as the final HI. Table 2 shows the contribution rates of the principal elements. The final construction results are shown in Figure 9.

**Table 2.** Contribution rate of partial principal components.


Figure 9 shows a comparison of the values of RMS extracted from the sub-bands and RMS extracted using the method proposed in this study. The sub-band RMS fluctuated a lot, and the curves were messy. The method proposed in this paper could fuse the important characteristics of each sub-band, such as the sudden increase in RMS of subband (3,5) near sample 2500, which reflected the sensitivity at the early stage of wear; the gentle fluctuation of RMS of sub-band (3,7) at sample 2600 and the decrease in HI, which reflected the sensitivity at the recovery period of wear; and the sharp increase in RMS of sub-band (3,1) at the last part of the life cycle, which reflected the sensitivity of the late wear period. The proposed HI thus contained more comprehensive information on bearing degradation and the curve was smoother, reflecting its superiority as an HI. To further illustrate the superiority of the proposed HI, the HI of bearings 2-1, 2-6, 3-2, and 3-3 under different operating conditions were extracted, and the results are shown in Figure 10.

**Figure 9.** Prominent feature accumulation process of RMS values of different sub-bands of bearing 1-1.

**Figure 10.** HI of different bearings.

 −  

**icity o**

=

 0

 0 −

The health factor, which reflects the degradation trend of the bearing, needs to have a strong correlation with the degradation trend of the bearing. The physical degradation process of the bearing is irreversible, so the health factor should also have a similar monotonic change trend [37]. Monotonicity indexes are widely used in the evaluation of health factor performance. Yang et al. [38] obtained the optimal HI by optimizing the monotonicity index of HI. Lin et al. [39] used ensemble stacked autoencoders to construct health factors for bearings and evaluated their performance using monotonicity metrics.

We compared the performance of the proposed HI with the original RMS using the metric of monotonicity according to the following equation:

$$Mon(X) = \frac{1}{K - 1} |No.ofd/d\_X > 0 - No.ofd/d\_X < 0| \tag{22}$$

where *X* denotes the feature sequence; *K* denotes the total number of features; and *No*.*o f d*/*d<sup>x</sup>* > 0 and *No*.*o f d*/*d<sup>x</sup>* < 0 denote the number of positive and negative variances, respectively. The higher the *Mon* score, the better the monotonicity and the better the index performance. The results are shown in Table 3.

**Table 3.** HI performance analysis using monotonicity.


From the table, we can see that the proposed HI had better monotonicity, which proves the superiority of the proposed method.

#### *4.3. RUL Prediction*

In this section, we outline the development of a CNN-BiLSTM prediction model with RMS input and HI labels and discuss the effect of model parameters on prediction accuracy. Bearing 1-1 was used as an example to develop a detailed description.

#### 4.3.1. Input Selection

It is convenient to process raw vibration signals of bearings in the time domain, and features such as Rms, Peak2Peak, Kurtosis, Impulse Factor, Var, and Clearance Factor are commonly used in the analysis of the remaining service life of bearings and as inputs to prediction networks [4,40].

We can extract many features, such as time domain and frequency domain, from vibration signals. These features have different representation abilities for vibration signals. Some features are not helpful or even cause interference for characterizing signals. Therefore, it is very important to select appropriate features [41]. The degradation process is an

accumulation of random fatigue failure processes, so it should have a certain overall increasing or decreasing trend on the time axis, i.e., the characteristic quantity should have certain monotonicity [37]. Zhang et al. [42] extracted signal time-domain, frequency-domain, and time–frequency-domain correlation features and defined metrics such as monotonicity for feature selection based on the trend and residuals of the features. Tian et al. extracted 10 features of bearing vibration signals and used the monotonicity index to screen good features as input to the neural network [43].

Different time-domain features have different characterization capabilities for the original signal, and monotonicity continues to be used to assess the characterization capabilities of time-domain features.

The monotonicity score was calculated for the time-domain features, and the results are shown in Figure 11.

**Figure 11.** Monotonicity index of six vibration signal characteristics.

As can be seen from Figure 11, the Rms monotonicity of the original vibration signal was the best, and Rms was chosen here as input to the prediction network.

#### 4.3.2. Training and Test of CNN-BiLSTM Model

 = = Before training and testing a prediction model, it is necessary to construct the dataset and determine the correspondence between the input Rms and output HI labels. Suppose the Rms sequence is [*X*1, *X*2, *X*3, *X*4] and the HI sequence is [*Y*1,*Y*2,*Y*3,*Y*4], then the prediction relationship of the network is *F*([*X*1, *X*2]) = *Y*3, *F*([*X*2, *X*3]) = *Y*4. A sliding window was used to take the value of the Rms sequence with a window width of 64 and a sliding step of 1. The specific correspondence is shown in Figure 12. The final sample format (number of samples, time step, and dimension) obtained was (2739, 64, and 1).

For the CNN-BiLSTM model, Adam optimizer with an initial learning rate of 0.001 was used, and in order to maximize the optimization of the network parameters, a decreasing learning strategy was used to reduce the learning rate by 10 <sup>−</sup><sup>6</sup> per round. The training results of the model are shown in Figure 13, and the prediction results of the training set were almost identical to the real HI labels.

−

 

 =  =  

−

**Figure 12.** Sliding window sampling method.

Signal processing was performed as described in the previous section for HI construction, and finally the trained CNN-BiLSTM model was used for prediction of bearing 1-5. The prediction results are shown in Figure 14. As can be seen, the prediction results are basically consistent with the real HI labels, which verifies the effectiveness of the method.

**Figure 14.** Testing of the model.

#### 4.3.3. Selection of Hyper Parameters

In prediction models, the width of the sliding window and the size of the batch size are key hyperparameters that affect the performance of the model. This section discusses the effects of both parameters on the model performance. ˆ = = − 

= = −

ˆ

The 2 n facilitated the computer processor for optimization, and window widths of 8, 16, 32, 64, and 128 were used to shape the samples. The prediction model was then trained and tested. The models were evaluated using mean square error (MSE), root mean square error (RMSE), and mean absolute error (MAE). The errors on the training and test sets are shown in Figure 15. ˆ = = − ˆ

**Figure 15.** The prediction error for window width.

The above metrics can be described as follows.

$$\text{RMSE} = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} \left( y\_i - \hat{y}\_i \right)^2} \tag{23}$$

$$\text{MSE} = \frac{1}{N} \sum\_{i=1}^{N} \left( y\_i - \mathcal{y}\_i \right)^2 \tag{24}$$

$$\text{MAE} = \frac{1}{N} \sum\_{i=1}^{N} |y\_i - \mathcal{Y}\_i| \tag{25}$$

where *y<sup>i</sup>* is the true label; *y* ˆ *i* is the predicted label; *y<sup>i</sup>* is the mean of the actual labels; and *N* denotes the total number of samples.

As can be seen from Figure 15, on the training set, with the increase in the window width, each error showed a decreasing–increasing trend and the smallest error was at the window width of 64. On the test set, with the increase in the window width, the error showed irregular fluctuations and the smallest error was at the window width of 64. Therefore, the window width of 64 was chosen.

The batch size represents the number of data samples crawled in one training session. The batch size affects the training speed and model optimization. Batch sizes of 8, 16, 32, 64, 128, and 256 were selected, and the relationship between the three types of errors and batch size was observed. The results are shown in Figure 16.

From Figure 16, it can be seen that the three errors on the training set basically showed a decreasing trend with the increase of Batch size. However, in the test set, the three errors first decreased and then increased as the batch size increased. The errors increased significantly when the batch size exceeded 64, so 64 was chosen as the batch size.

#### 4.3.4. Results of Different Models

This section discusses the prediction effects of CNN, LSTM, and BiLSTM models on the PHM2012 dataset and compares them with the method proposed in this paper.

Prediction experiments were conducted on the dataset using each of the three models mentioned above, and the prediction results for bearings 1-1, 1-3, and 1-5 were visualized. The results are shown in Figure 17.

– –

– –

**Figure 17.** *Cont*.

– – – – **Figure 17.** Prediction effects of different models. (**a**–**c**) is the prediction result of our method. (**d**–**f**) is the prediction result of CNN model. (**g**–**i**) is the prediction result of LSTM model. (**j**–**l**) is the predicted result of BiLSTM.

From Figure 17, it can be seen that the proposed method could accurately predict the degradation trend of the bearing, and the rest of the models had different degrees of problems. For example, both the LSTM and CNN models could not well predict the rapid degradation stage of bearing 1-1. The prediction effect of BiLSTM was better, but the fluctuation trend at the beginning of degradation could not be well predicted. Both the LSTM and CNN models could not well predict the rapid degradation stage of bearing 1-1. The prediction effect of BiLSTM was better, but the fluctuation trend at the beginning of degradation could not be well predicted. The BiLSTM model could not predict the rapid degradation trend of the bearing at the end of cycle 1-5. The CNN could roughly predict the degradation trend of the bearing from 1-3, but the accuracy was insufficient.

As shown in Table 4, the proposed model achieved the smallest prediction error on almost all bearings using MSE as an indicator, thereby showing the superiority of the proposed method.


**Table 4.** MSE for different prediction models.

#### **5. Conclusions**

The remaining service life prediction of bearings is a research focus. In this study, we established a suitable health indicator (HI) and proposed a prediction network combining CNN and BiLSTM. First, wavelet packet transform was performed on the original vibration

signal of the bearing to obtain eight sub-bands, and the RMS of each of the eight sub-bands were extracted. The KPCA algorithm was then used to fuse the features of the extracted RMS of the eight sub-bands to obtain the HI of the bearing life cycle. The CNN-BiLSTM prediction network was then developed, which can extract the spatiotemporal features of the signal at the same time to improve prediction accuracy. The prediction network uses the RMS of the original signal at the current moment as the network input and the HI of the future moment as the network output. Experiments were conducted on the PHM2012 dataset to verify the effectiveness of the proposed prediction model.

**Author Contributions:** Data curation, A.Y., H.Z. and Z.Z. (Zhihui Zhang); funding acquisition, Z.Z. (Zhidan Zhong); investigation, Y.Z.; project administration, Z.Z. (Zhidan Zhong); software, Y.Z.; supervision, Z.Z. (Zhidan Zhong); writing—original draft, Y.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the National Natural Science Foundation of China (No.: 52105182) and the Major Science and Technology Project of Henan Province (project name: Research and industrialization of key technologies of high-end bearings for major equipment).

**Data Availability Statement:** Public datasets used in our paper: https://github.com/wkzs111/phmieee-2012-data-challenge-dataset (accessed on 7 January 2022).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

