*Article* **Hard Negative Samples Contrastive Learning for Remaining Useful-Life Prediction of Bearings**

**Juan Xu <sup>1</sup> , Lei Qian 1 , Weiwei Chen <sup>2</sup> and Xu Ding 3, \***

	- **\*** Correspondence: dingxu@hfut.edu.cn

**Abstract:** In recent years, deep learning has become prevalent in Remaining Useful-Life (RUL) prediction of bearings. The current deep-learning-based RUL methods tend to extract high dimensional features from the original vibration data to construct the Health Indicators (HIs), and then use the HIs to predict the remaining life of the bearings. These approaches ignore the sequential relationship of the original vibration data and seriously affect the prediction accuracy. In order to tackle this problem, we propose a hard negative sample contrastive learning prediction model (HNCPM) with encoder module, GRU regression module and decoder module, used for feature embedding, regression RUL prediction and vibration data reconstruction, respectively. We introduce self-supervised contrast learning by constructing positive and negative samples of vibration data rather than constructing any health indicators. Furthermore, to avoid the subtle variability of vibration data in the health stage to aggravate the degradation features learning of the model, we propose the hard negative samples by cosine similarity, which are most similar to the positive sample. Meanwhile, a novel infoNCE and MSE-based loss function is derived and applied to the HNCPM to simultaneously optimize a lower bound on mutual information of the positive and negative sample over life cycle, as well as the discrepancy between true and predicted values of the vibration data, such that the model can learn the fine-grained degradation representations by predicting the future without any HIs as labels. The HNCPM is validated on the IEEE PHM Challenge 2012 dataset. The results demonstrate that the prediction performance of our model is superior to the state-of-the-art methods.

**Citation:** Xu, J.; Qian, L.; Chen, W.; Ding, X. Hard Negative Samples Contrastive Learning for Remaining Useful-Life Prediction of Bearings. *Lubricants* **2022**, *10*, 102. https:// doi.org/10.3390/lubricants10050102

Received: 15 April 2022 Accepted: 18 May 2022 Published: 21 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Keywords:** positive and negative samples; contrastive learning; gated recurrent unit; remaining useful-life prediction

#### **1. Introduction**

Bearings undergo an irreversible degradation process during use that eventually leads to bearing breakdown. Via predicting the Remaining Useful Life (RUL) of bearings, one can avoid missing the best time for maintenance [1] and remanufacturing [2]. However, in realistic industrial scenarios, due to noise, variation in life cycle and other prediction uncertainties [3], RUL prediction is a challenging issue [4].

Generally, RUL prediction of bearings is implemented in two different strategies [5]: physically based approaches [6–8] and data-driven [9–12] approaches. Compared with the former, data-driven RUL methods use historical data directly to build prediction models, which avoids the difficulties in modeling of the physical model and has become the prevalent approach in recent years [13–15]. For instance, Xia et al. used a denoising autoencoder to classify the original signal into different degradation stages and extracted representative features directly from the original signal using DNN. Then they obtained RUL values for each stage using deep regression models [16]. Guo et al. proposed a recurrent neural-network-based health indicator(RNN-HI) with fairly high monotonicity and correlation values, which is beneficial to bearing RUL prediction [17].

Among several data-driven methods, deep learning-based methods obtained considerable results in bearing RUL prediction, which do not require manual feature design and build an end-to-end deep neural network to map the relationship between the degradation process and the original sensory data [18].

The existing deep-learning-based approaches usually construct health indicators (HI) that can represent the degradation trend of bearing performance, then predict the degradation trend using deep-learning regression models [18]. The effectiveness of these approaches largely relies on these HIs accurately modeling the degradation trend of bearings. However, in real-world industrial applications, the degradation processes of different bearings are diverse. The existing HIs usually describe the overall trend of the whole life cycle, but fail to represent the changes of local details. Using these inappropriate HIs as model input will directly restrict the prediction performance of the model.

Noted that the original time-domain vibration data of bearings contains the sequential relationship with respect to the degradation trend of bearings, thus predicting that the original vibration data, instead of any designed HI, can more accurately evaluate the model performance. However, the time-domain vibration data changes slightly in the health stage; the existing well-developed methods are less capable of capturing the latent features of the bearing data at this stage.

To circumvent the aforementioned problems, we propose an end-to-end hard negative sample contrastive learning prediction model, termed HNCPM. We construct a three-layer, one-dimensional convolutional-based encoder module to map the high-dimensional original vibration data to a low-dimensional feature space to facilitate the model's computation. Next, we adopt a Gated Recurrent Unit with a decoder module in this feature space to learn the sequence relationship of vibration data for RUL prediction. Importantly, the HNCPM introduces self-supervised contrastive learning to construct positive and negative samples of vibration data, rather than supervised leaning via any HI labels. Considering there is no significant variability between the positive and negative samples in the health stage, thus we select the most similar negative sample to the positive sample as the hard negative sample via cosine similarity, to improve the fine-grained feature identification of the model. Finally, we design the novel loss function of the model combining the Mean Square Error(MSE) with infoNCE [19] for self-supervised training from the original vibration data.

The main contributions of this work are summarized as follows:


#### **2. Related Works**

#### *2.1. Deep-Learning-Based Approaches for Rul Prediction*

The existing deep-learning-based approaches usually include two steps: constructing the bearing HIs for model training that can represent the degradation trend of bearing performance, and designing the deep learning regression models to predict the degradation trend (usually the HIs).

In terms of model design, it generally includes: CNN [20], RNN [21] and AE [22]. Xu et al. used the SAE to extract features of bearing data to construct HI [23]. Wang et al. [24] applied deep separable convolutional networks to learn high-level representations from the original signal and then predict RUL. Wu et al. [25] proposed Deep Long Short-Term Memory (DLLSTM) networks, and Han et al. proposed a Transferable Convolutional Neural Network(TCNN) to accurately predict the bearing RUL under different failure behaviors [26].

The common model structure can no longer fulfill the demand of researchers for superior prediction, and many recent studies have combined the advantages of different models to propose model variants. For instance, Luo et al. proposed a novel convolutionbased attention mechanism bidirectional long and short-term memory (CABLSTM) network to achieve the end-to-end lifetime prediction of rotating machinery [27]. Meng et al. proposed CLSTM by conducting convolutional operation on both the input-to-state and state-to-state transitions of the LSTM to learn high-level features in the time-frequency domain for RUL prediction [28].

With respect to the HI construction, i.e., the RUL labels, since the damage extent of bearings cannot be directly observed, the RUL labels are almost unavailable in real-world scenarios. Hence, there is no uniform criterion of HI construction for RUL prediction models. The existing studies usually extracted the different fault characteristics from the original sensory signal. She et al. [29] proposed a health indicator construction method based on a Sparse Auto-encoder with a Regularization (SAEwR) Model for rolling bearings. Zhang et al. used the summation of the mean maximum radius of the different datasets divided by the k-means clustering algorithm as the health indicator, and then used the local outlier coefficient algorithm to eliminate the outliers' influence [30]. Li et al. designed the generative adversarial network to learn the data distribution in the health states of machines, using the output of the discriminator as HI [31].

In summary, the performance of the existing HIs-based deep-learning approaches heavily relies on whether the HIs are constructed properly. Nevertheless, the models usually fail to represent the changes of local details, while capturing the overall gradation trend of the whole life-cycle of bearings, which restricts the prediction performance of the models.

Therefore, in this paper we introduce self-supervised contrastive learnin, directly using the original vibration data, rather than any HIs, to learn the sequential relationship with respect to the degradation trend of bearings.

#### *2.2. Contrastive Learning*

Self-supervised learning is committed to avoiding the manually annotating large-scale datasets by setting up pretext tasks to learn data representations [32–34]. The learning process is unsupervised and the trained model can be used for multiple downstream tasks. Among self-supervised learning, contrastive learning is the most widely concerned. It uses discriminative modeling to learn useful representations of unlabeled data. Several contrastive learning models, e.g., MoCo [35], SimCLR [36] and BERT [37] provide competitive results in comparison with supervised learning models within the pretext task of image classification or next-sentence prediction.

Specifically, contrastive learning is a discriminative method that aims to compact the similar samples and discriminate the dissimilar samples [38]. The models include: similarity and dissimilarity distributions for sampling positive and negative samples of the query, one or more encoders for each data pattern, and comparative loss functions for evaluating a batch of positive and negative pairs.

Given a set *X* = {*x<sup>i</sup>* | *i* = 1, 2, . . . , *n*} of n samples, for any input sample *x*, *q* is the encoded representation of *x*, function *q* = *f*(*x*), denote *k* <sup>+</sup>, *k* −as a positive and negative sample of the sample *x*.

$$s(q, k^+) \gg s(q, k^-) \tag{1}$$

*s* is a measure of the similarity between the embedded vectors, or it can be a calculation of the distance between the vectors.

To the best of our knowledge, there are limited studies that introduce contrastive learning to the RUL prediction. Mohamed Ragab et al. [39] propose a contrastive adversarial domain adaptation (CADA) method for cross-domain RUL prediction, which transfers knowledge for RUL prediction from one condition(distribution/domain) to another. That is no longer the problem dealt with in this paper. Our paper aims to discuss an unsupervised RUL prediction approach to learn degradation representations from original vibration data by using certain contrastive learning models.

Furthermore, with regard to existing contrastive learning approaches, the positive samples generated by data enhancement of the original data (e.g., rotating, segmenting), while samples are not generated from the same view, are negative samples from each other. Such approaches of constructing samples ignores the role of negative samples in model training. In fact, negative samples can teach learning models to correct their errors faster in representation learning. More importantly, the information-rich counterexamples are intuitively those that map far away from the positive sample but are closer to the positive sample than to other negative samples [40].

Inspired by this idea, in this paper we construct positive and negative samples based on the different temporal relationship of the vibration data, enabling the model to learn the sequence features by discriminating whether the temporal relationship of the vibration data is correct or not. Moreover, we construct hard negative samples to express the most similar to the positive sample to improve the fine-grained feature representation of the model.

#### **3. Proposed Method**

As shown in Figure 1, the complete framework of the proposed method includes construction of the positive sample and hard negative samples, the hard negative sample contrastive learning prediction model and corresponding optimization function.

#### *3.1. Positive Sample and Hard Negative Sample Construction*

Without loss of generality, we denote that the original vibration data of bearing is *D* = {*bi*} *M i*=1 , the positive sample dataset is *D<sup>P</sup>* = {*XP*,*YP*}, *X<sup>P</sup>* = {*x*1, *x*2, . . . , *xN*}, and *Y<sup>P</sup>* = {*y*1, *y*2, . . . , *yN*}, where *x<sup>i</sup>* = - *bi* , . . . , *bi*+*timestep*−<sup>1</sup> and *y<sup>i</sup>* = *bi*+*timestep*, respectively. The timestep is the timestep of the vibration sequence. In this paper, it is set to six.

Random *yj*(*i* 6= *j*) can form a negative sample with *y<sup>i</sup>* . We use the cosine similarity to calculate the negative sample *y<sup>j</sup>* with the highest similarity to *y<sup>i</sup>* .

$$w(y\_i, y\_h) = \max \frac{(y\_{i\prime}, y\_j)}{||y\_i|| \cdot ||y\_j||} \quad y\_j \in Y\_{\text{P}}, i \neq j \tag{2}$$

when *y<sup>j</sup>* = *y<sup>h</sup>* , *w* is maximum, at which *x<sup>i</sup>* and *y<sup>h</sup>* form a hard negative sample, and *x<sup>i</sup>* and *yi* form a positive sample. We use (*x<sup>i</sup>* , *yh*) to construct the hard negative sample dataset *D<sup>N</sup>* = {*XN*,*YN*}, denote *X<sup>N</sup>* = *x* ′ 1 , *x* ′ 2 , . . . , *x* ′ *N* , *Y<sup>N</sup>* = *y* ′ 1 , *y* ′ 2 , . . . , *y* ′ *N* .

Given a dataset *C* = {*c*1, *c*2, . . . , *ck*} of k random samples containing one sample from *X<sup>P</sup>* and *k* − 1 samples from *XN*, corresponding prediction labels, the positive and negative sample labels are *P* = {*p*1, *p*2, . . . , *pk*} and *T* = {*t*1, *t*2, . . . , *tk*}, respectively. Initialize the positive and hard negative sample label set *T* = {*ti*} *N i*=1 .

$$t\_i = \begin{cases} 1, & c\_i \in D\_P \\ 0, & c\_i \in D\_N \end{cases} \tag{3}$$

**Figure 1.** The proposed method.

#### *3.2. Hard Negative Contrastive Prediction Model (Hncpm)*

After the positive and negative sample pairs are generated based on the original vibration data of bearing *D* = {*bi*} *M i*=1 , the feature sequence are extracted through an encoder composed of three layers of a one-dimensional convolutional network, which compresses the original vibration data into a low-dimensional feature space. Then, we use a gated recurrent unit as a regression module to learn the sequence relationship of the feature sequence. Finally a decoder composed of three fully connected layers predicts the vibration data in future time-periods (i.e., RUL).

Unlike the existing supervised learning model to predict the HI labels for RUL, the HNCPM directly predicts the vibration data in future time-periods with the purpose of allowing the model to better fit the time domain graph curve of the original vibration data, so the potential representation of the context is captured in the gated recurrent units and the final prediction results are obtained through a decoder. The model structure is shown in Table 1.

The feature encoder module composed of three convolutional layers is formulated as follows:

$$\pounds\_{i} = \operatorname{ReLU}(\mathcal{W}\_{\mathcal{C}} \otimes \mathfrak{x}\_{i}) \tag{4}$$

where ⊗ is the valid cross-correlation operator and *W<sup>c</sup>* is the weight.


**Table 1.** Model structural parameters.

The regression module uses gated recurrent units, which can effectively suppress gradient disappearance or exploding gradient when capturing long sequence data association. It is better than traditional RNN and has less computational complexity than LSTM.

$$r\_i = \sigma(\mathcal{W}\_r \cdot [h\_{i-1}, \pounds\_i]) \tag{5}$$

$$z\_i = \sigma(\mathcal{W}\_z \cdot [h\_{i-1}, \pounds\_i])\tag{6}$$

$$h\_i = \tanh(\mathcal{W} \cdot [r\_i \odot h\_{i-1}, \pounds])\tag{7}$$

$$h\_{\dot{i}} = (1 - z\_{\dot{i}}) \odot h\_{\dot{i}-1} + z\_{\dot{i}} \odot \ddot{h}\_{\dot{i}} \tag{8}$$

where *h<sup>i</sup>* is the hidden state at time *i*. *r<sup>i</sup>* , *z<sup>i</sup>* , e*h<sup>i</sup>* are the reset, update, and new gates, respectively. *Wr* ,*Wz*, and *W* are the *r<sup>i</sup>* weight, *z<sup>i</sup>* weight, e*h<sup>i</sup>* weight, respectively.

After the regression module, the decoder consisting of three fully connected layers is used for decoder enhancement of the dimensionality of *h<sup>i</sup>* , with the following output:

$$
\mathcal{Y}\_l = \mathcal{W}\_l \cdot h\_i + b \tag{9}
$$

where *W<sup>l</sup>* is the weight of the fully connected layers.

#### *3.3. Optimization Function*

The loss function of the model in this paper consists of two parts: (1) the regression loss function *L*1; (2) the contrastive prediction loss function *L*2.

The regression loss function *L*<sup>1</sup> uses the mean square error to predict the bearing data of the next time period based on the vibration data of the previous time segment. The input dataset is *XP*,*Y<sup>P</sup>* for the model training. The loss function is as follows:

$$L\_1 = \frac{1}{N} \sum\_{i=1}^{N} (\mathcal{Y}\_i - y\_i)^2 \tag{10}$$

where N is the total number of samples.

The contrastive prediction loss function *L*<sup>2</sup> is the loss function *in f oNCE* [19], which is commonly used in contrastive learning. Feed set *C* into HNCPM, the output is *<sup>P</sup>*<sup>ˆ</sup> <sup>=</sup> {*p*ˆ1, *<sup>p</sup>*ˆ2, . . . , *<sup>p</sup>*ˆ*k*}.

According to previous study [19], the higher the density ratio function *f*(*p*ˆ*<sup>i</sup>* , *pi*), the larger the mutual information of *p*ˆ*<sup>i</sup>* and *p<sup>i</sup>* , and the more capable of predicting *p<sup>i</sup>* by *p*ˆ*<sup>i</sup>* . The formula is as follows:

$$f(\mathfrak{p}\_{i\prime}p\_i) = \exp(s(\mathfrak{p}\_{i\prime}p\_i))\tag{11}$$

*s*(*p*ˆ*<sup>i</sup>* , *pi*) measures the degree of similarity between the predicted result *F*(*ci*) and *p<sup>i</sup>* .

$$s(\not p\_i \not p\_i) = \mathcal{E}\_i \odot p\_i \tag{12}$$

where ⊙ is the Hadamard multiplier.

$$L\_2 = -\mathbb{E}\_P\left[\log\frac{f(\hat{p}\_i, p\_i)}{\sum\_{\hat{p}\_i \in P} f(\hat{p}\_i, p\_i)}\right] \tag{13}$$

Finally, the total loss function of HNCPM is as shown:

$$L = L\_1 + \mathfrak{a}L\_2 \qquad \mathfrak{a} \in (0,1) \tag{14}$$

where *α* is the hyperparameter.

The pseudo code of the algorithm for training is shown in Algorithm 1.


#### **Input:**

original bearing samples: *D* = {*b*1, *b*2, . . . , *bM*}. positive samples: *X<sup>P</sup>* = {*x*1, *x*2, . . . , *xN*}, *y<sup>i</sup>* = - *bi* , . . . , *bi*+*timestep*−<sup>1</sup> , *Y<sup>P</sup>* = {*y*1, *y*2, . . . , *yN*}, *y<sup>i</sup>* = *bi*+*timestep*. hard negative samples: *X<sup>N</sup>* = *x* ′ 1 , *x* ′ 2 , . . . , *x* ′ *N* , *Y<sup>N</sup>* = *y* ′ 1 , *y* ′ 2 , . . . , *y* ′ *N* , *Y<sup>N</sup>* select by (2). *F* consists of encoder, gated recurrent unit, decoder. *θ* is the proposed model parameters. *xt* is the test bearing data 1: **while** *i* < number of iterations **do** 2: **for** i to N **do** 3: *x*ˆ*<sup>i</sup>* = *F*(*xi*) 4: Compute loss *L*1 by (8) 5: Construct *C* and *T* by (3) 6: *p*ˆ*<sup>i</sup>* = *F*(*ci*) 7: Calculate the density ratio: *f*(*p*ˆ*<sup>i</sup>* , *pi*) by (8),(9) 8: Compute infoNCE loss function *L*2 by (11) 9: Update *θ* by *L* = *L*<sup>1</sup> + *αL*<sup>2</sup> *α* ∈ (0, 1) 10: **end for** 11: **end while** 12: save *F* **Output:** the model prediction *F*(*xt*)

#### **4. Experimental Section**

In this section, we use the IEEE PHM Challenge 2012 bearing dataset to validate the effectiveness of the proposed method. The Mean Absolute Error (MAE) and Root Mean Square Error(RMSE) are selected as indicators to evaluate the model performance. The smaller the value of MAE and RMSE, the more superior the RUL model is. The MAE and RMSE, respectively, are expressed as follows.

$$MAE = \frac{1}{N} \sum\_{i=1}^{N} |\ p\_i - y\_i| \tag{15}$$

$$RMSE = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} (p\_i - y\_i)^2} \tag{16}$$

where *y<sup>i</sup>* denotes the true value of the sample, *p<sup>i</sup>* denotes the predicted value, and *N* is the total number of samples.

#### *4.1. Data Description*

As shown in Figure 2, The PRONOSTIA test platform contains a rotating part, a load part, and a data collection part. The motor power is 250 W. The power is transferred to the bearing by the axis of rotation. The accelerated degradation experiments was conducted on this platform to generate the run-to-failure vibration signal. The acceleration sensors are

placed in horizontal and vertical directions to collect vibration signals under three working conditions. The sampling frequency is 25.6 kHz. The vibration signal is recorded every 10 s, and each acquisition lasts 0.1 s. For example, under the first working condition, the vibration signals of six bearings were collected, namely, bearing 1\_2, bearing 1\_3, bearing 1\_4, bearing 1\_5, bearing 1\_6 and bearing 1\_7. The motor's rotation speed is 1800 rpm, and the load is 4000 N.

**Figure 2.** PRONOSTIA test platform.

In the following different experiments, the vibration signals of bearing 1\_2 are the training set, while bearing 1\_4, bearing 1\_5, bearing 1\_6, bearing 1\_7, bearing 2\_3, bearing 2\_4, bearing 2\_5, and bearing 2\_6 are used as the testing set, respectively, as shown in Table 2.

#### *4.2. Weighting Factor Analysis*

In Equation (12), the weights *α* of the loss function are empirically determined. Hence, we perform several experiments to discuss their influence on the model performance.

Herein, the experiments *α* are taken as 0.1, 0.3, 0.5, 0.7, 0.9, 1, respectively. Bearing 1\_4, bearing 1\_5, bearing 2\_4, and bearing 2\_6 are used as the test set and RMSE is used as the model predictive ability index. For each series of experiments, we repeat the experiment ten times and report the average performance. Figure 3 depicts the box plot of RUL prediction results for four different bearings at different *α*.

When *α* = 0.1, the weight of *L*<sup>2</sup> in the total loss *L* is too small and plays a low optimization role in the model, resulting in poor prediction results in bearing 1\_4, bearing 2\_4, bearing 2\_6. In contrast, when *α* = 1, the model is overly concerned with the finegrained recognition of vibration data, resulting in overfitting; thus the model has an unsatisfactory generalization effect in testing bearing 1\_5, bearing 2\_4, and bearing 2\_6. When *α* = 0.7, our method performs best on bearing 2\_6, and the maximum and minimum values of RMSE are lower than other weights. It also has the same superiority on bearing 1\_5 and bearing 2\_4 compared with other weights. Considered comprehensively, the prediction accuracy of the model is satisfactory when *α* = 0.7.


**Table 2.** The details of PHM2012 dataset.

**Figure 3.** RUL prediction results with different weighting factor *α*.

279

#### *4.3. Ablation Experiments*

In this section, we perform ablation experiments to verify the contribution of each module in our model. We keep the model structure unchanged but without contrast learning (termed as no-contrast), and GRU model without encoder and decoder modules (termed as GRU) to compare with HNCPM. MAE is used as the predictive ability index of the models. The experimental results are demonstrated in Figure 4.


**Figure 4.** Ablation experiment results.

HNCPM has the best superiority on bearing 2\_6, which is 0.05 lower than the nocontrast model and 0.07 lower than the GRU model. In addition, the model has the largest MAE value on bearing 1\_4, which is 0.03 and 0.22 lower than the no-contrast and GRU model, respectively.

Moreover, compared to the GRU model, the no-contrast model has a better prediction performance because it contains the feature extraction layer to extract high-dimensional degradation features from the original vibration data, facilitating the RUL regression prediction. However, no-contrast ignores latent sequence features during the training, while HNCPM introduces contrast learning to improve the fine-grained model training and improve the prediction efficiency compared to no-contrast. Overall, the prediction results on all bearing data indicate that the proposed HNCPM method is significantly superior to the other two models.

#### *4.4. Comparison with State-Of-The Art Methods*

In this section, our proposed HNCPM is compared with the state-of-the-art novel rolling bearing health-prediction methods based on CNN and BiLSTM models [41], and BiLSTM with attention mechanism [42]. In addition, we compare general encode and regression model combinations (i.e., SAE+GRU, CNN+LSTM) to evaluate the model prediction performance. MAE is used as the predictive ability index of the models. The overall RUL prediction results using the aforementioned models are depicted in Figure 5.

It can be seen that SAE model has the worst performance. It is well known that SAE has strong signal denoising ability, but its feature extraction ability in time series data is not as good as CNN. Attention increases the weight of important features, but it also ignores the bearing sequence information that is masked by noise, which results in an unstable prediction performance on multiple bearings. Therefore, the performances of CNN-based models are superior to the other model.

ours SAE+GRU CNN+LSTM CNN+BiLSTM BiLSTM+Attention

**Figure 5.** RUL prediction results of different comparison models.

In addition, The MAE value of CNN+BiLSTM is, on average, 0.2 lower than CNN+LSTM, as the BiLSTM improves the model prediction ability through backward prediction of past bearing data through future data. However, BiLSTM has one more backward hidden layer relative to LSTM, which boosts the number of model parameters and increases the model overhead, and the prediction performance is not as efficient as HNCPM.

Compared with the other models, HNCPM has the smallest MAE value at bearing 2\_6 and is 0.06 lower than the second-best model (i.e., CNN+BiLSTM). Meanwhile, HNCPM has the largest MAE value at bearing 1\_4, i.e., 0.361, but it is also at least 0.01 lower than the other models. The experiments demonstrate that HNCPM has a superior performance in RUL prediction.

#### *4.5. Prediction Results Visualization*

In this section, unlike the existing RUL prediction techniques which predict the degradation index (DI) [41] or the root mean square(RMS) [42] from the vibration data, we use manifold learning to visualize the fitting curve between the prediction and the original data to evaluate the prediction performance of the model more intuitively. We use the vibration data of the first 1800 time points to predict the remaining vibration data on bearing 1\_3 and bearing 2\_3, respectively. Figure 6a,b are the original time domain vibration data of bearing 1\_3 and bearing 2\_3, respectively. Each time point of vibration data is 2560 in dimension. In order to illustrate the prediction performance more intuitively, TSNE is used to transform the 2560 dimension data of each time point into 1 dimension on bearing 1\_3 and bearing 2\_3, as in Figure 6c,d. The blue curve is the real vibration data; the orange curve is the prediction data.

From Figure 6c, some of the prediction data are biased with the real data curve in the rapid degradation stage because the vibration data of bearing 1\_3 changes drastically at the end of life, which is difficult for the prediction of the model. Meanwhile it can be found from Figure 6d that the vibration data of bearing 2\_3 changes slowly in the whole life-cycle; thus, the prediction curve of model is more feasible to fit the real data curve.

**Figure 6.** (**a**) Bearing1\_3 data set; (**b**) bearing2\_3 data set; (**c**,**d**) are the prediction fitting curves of TSNE corresponding to (**a**,**b**), respectively. The blue represents the real data; the orange represents the prediction.

#### **5. Conclusions**

In this article, we proposed a novel RUL prediction approach of bearings, termed HNCPM, which introduces self-supervised contrast learning to directly use the original vibration data for model training rather than any HIs as the RUL label, improving the model's ability to extract sequential relationships from the original data. Meanwhile, we construct a hard negative sample and further design a novel loss function, combining the MSE with infoNCE loss to address the dilemma for the model to learn fine-grained degradation features due to the insignificant variation of positive and negative samples in the bearing health stage. Our experimental results demonstrate the superiority of the proposed HNCPM method for RUL prediction and the fine fit of the model predictions to true RUL observed by manifold learning.

**Author Contributions:** Conceptualization, J.X.; writing—original draft preparation, L.Q.; formal analysis, W.C.; validation, X.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the National Key Research And Development Plan under Grant 2018YFB2000505, in part by the Key Research and Development Plan of Anhui Province under Grant 202104a04020003, and in part by the Fundamental Research Funds for the Central Universities under Grant PA2021KCPY0045.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Public datasets used in our paper: https://github.com/wkzs111/phmieee-2012-data-challenge-dataset (accessed on 10 April 2022).

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Lubricants* Editorial Office E-mail: lubricants@mdpi.com www.mdpi.com/journal/lubricants

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34

www.mdpi.com

ISBN 978-3-0365-5341-2