In this section, a detailed analysis of the experimental conditions and results are provided. First, we will describe the datasets used, as well as the pre-processing steps and various configurations employed to construct each prediction model (CNN, LSTM, TSBP, DAPROG). Subsequently, we will delve into the methodologies used in the experiments and discuss the obtained results.
3.1. CMAPSS Dataset
The CMAPSS stands for the Commercial Modular Aero-Propulsion System Simulation, which is software that can generate virtual sensor measurements in the turbofan engine of the 90,000 lb thrust undergoing different operation conditions and health degradations [
27]. It consists of the following five parts: the fan, low-pressure compressor (LPC), high-pressure compressor (HPC), high-pressure turbine (HPT), and low-pressure turbine (LPT), as shown in
Figure 4. The operating conditions are given by three variables (altitude, Mach number, and throttle resolver angle). Thirteen health parameters are introduced to simulate the engine performance degradation, which deteriorates over cycles as given by
where
denotes the cycle. The outputs are the 21 sensor measurements, such as the temperature, pressure, and speed at the sections of the five parts [
28]. Given a certain operating condition and degradation parameters, the sensor measurements are made by the CMAPSS once per cycle, starting from the normal state until failure.
While there are four datasets with different scenarios of health degradation and operating conditions, we concentrate on dataset FD001, as it embodies a single type of degradation—High-Pressure Compressor (HPC) failure—under a single operating condition. The reason to choose this is to only focus on the performance under the data deficiency while excluding the other influences, such as the multiple failure modes or operation conditions. There are 100 training and test data, respectively, in dataset FD001. The training data are the RTF data of 21 sensor measurements that reached a pre-defined failure threshold, whereas the test data are those terminated at a certain cycle before failure. The test data are used to predict the RUL via the prognosis algorithm made by the training data, and the prediction accuracy is evaluated by comparing it with the actual RUL. Among the 100 training data, the plots of the sensor measurements are given in
Figure 5 for engine #2 as an example where the end of life (EOL) is 287 cycles. Also, the EOLs of all engines in the training dataset are plotted in
Figure 6, showing a large variance with each other ranging from a minimum of 128 and a maximum 362 cycles for engine 39 and 69, respectively.
To enhance the training accuracy for all methods in RUL prediction, the number of input sensors is reduced by selecting only a few exhibiting higher Spearman rank correlation, which measures the relationship of the sensor data with respect to the cycle. Out of the original 21, only 7 sensors (indices 2, 3, 4, 7, 11, 12, 15) are utilized for further analysis.
3.2. Neural Network-Based Approaches: CNN, LSTM
To enhance the performance of both the CNN [
7] and LSTM [
8] models in RUL prediction, several forms of pre-processing are conducted in advance to the training procedure. The selected sensor values undergo normalization, which differs slightly between the two models. For the CNN model, the sensor data are normalized to fall within a range of −1 to 1. In contrast, the LSTM model utilizes a standard z-score normalization.
For the CNN model, a time window processing technique is incorporated, as implemented in Li’s research [
7]. This approach is particularly effective in multi-variate time series problems, like RUL estimation, where temporal sequence data often provide more insights than multi-variate data points sampled at a single time step. At each time step, sensor data within the time window are collected to form a high-dimensional feature vector, serving as the network’s input. Consequently, for each time step in the CNN model, a normalized subset is prepared, consisting of a 7
30 matrix from the seven selected sensors and the window size of 30 time steps. This matrix represents data within a 30-time step window for a single engine unit in the training sub-dataset. On the other hand, the LSTM model utilizes the signals for the entire period from the seven sensors for training at once, providing a comprehensive dataset for more nuanced temporal analysis.
To more accurately estimate the RUL, a piece-wise linear function is employed, as defined in the studies [
7,
8] and illustrated in
Figure 7. The RUL is assigned as constant at the early stage and then decreases linearly after a certain cycle. The reason is that the RUL accuracies at the early cycles are usually poor and unimportant. In this study, 130 cycles are taken, which is the round-off value of minimum EOL in the training set.
3.2.1. RUL Prediction Using CNN
The CNN model architecture is constructed based on Li’s work [
7]. It comprises five CNN layers in the hidden layer section—four of these have a length of 10 and 10 channels, and one has a length of 3 with a single channel. Following the CNN layers are a flatten layer, a dropout layer with a dropout rate of 0.5, and a fully connected layer consisting of 100 nodes, as illustrated in
Figure 8. All layers employ the tanh activation function.
For optimization, the Adam algorithm is utilized. The training process involves randomly dividing the samples into multiple mini batches, each containing 512 samples. The data split comprises 70% for training and 30% for validation. To enhance the efficiency of the training, an early stopping strategy is applied that halts the training process if there is no observable improvement in the performance of the validation dataset, ensuring that the model does not overfit to the training data while maximizing its predictive accuracy.
3.2.2. RUL Prediction Using LSTM Network
The LSTM network model in this study follows the design in Zheng’s work [
8]. This model’s architecture includes two LSTM layers with 32 and 64 nodes, respectively, followed by two fully connected layers, each comprising eight nodes, as illustrated in
Figure 9. Additionally, a dropout layer with a dropout rate of 0.5 is included in the hidden layers. The RMSprop optimizer is used for training with mini batches of 20 samples each. Similar to the CNN model, the data split for the LSTM model is 70% for training and 30% for validation, and an early stopping strategy is also employed during the training process.
3.3. Similarity-Based Approaches: TSBP, DAPROG
As noted earlier, among the four methods in this study, the CNN and LSTM employ the pre-processed sensor data and the RUL directly in their training, whereas the other two TSBP and DAPROG introduce the HI for the training process. To this end, the HI is defined as follows [
25]:
where
are the selected sensor measurements and
are the coefficients to be determined via regression. The HI values are defined as 1 and 0 during the initial and near-failure periods defined properly. Applying the sensor data in these two periods and their corresponding HI values (1 or 0), one obtains the regression coefficients. Once obtained, the HI can be estimated for any sensor data
. The HI data are thus obtained but exhibit significant oscillations over the cycles. To rectify this, the data are fitted to the exponential function, which we call the degradation trajectory, as follows:
where
and
are the coefficients and
is the cycle. To illustrate this, the HI data and their fitted trajectory are plotted for the four selected training units in
Figure 10.
As a result, the degradation trajectories of all the 100 training units are plotted in
Figure 11a. The HI ranges in the initial cycles are as wide as 0.7~1.3, which agrees with the concept that each engine starts with different degrees of initial wear and manufacturing variation [
29]. For the test units, the HI data are obtained from the sensor data using Equation (1) until their current cycles as shown in
Figure 11b. During HI construction, the same seven sensors are utilized as with the CNN and LSTM models. Note that trajectories like
Figure 11a cannot be utilized as the test unit since they have not yet reached the failure level.
3.3.1. RUL Prediction Using TSBP
In the TSBP, the RUL is predicted from the degradation trajectory of the training unit by moving the test data along the time axis and locating at the point which matches with the highest similarity. This is repeated for all the training units to obtain the same number of RULs, which is integrated into a single RUL based on the weighted sum method. However, during the TSBP execution, a problem may sometimes occur where it fails to find any trajectories that match the test data with a high similarity. This takes place frequently when the number of training data is too small, which gives rise to the missing trajectories that suit this end. An example is given in
Figure 12 when
, in which the red curves and the blue asterisks represent the 10 training trajectories and the test data of unit #15, respectively. In the figure, the training curves and the test data barely or do not overlap when the test data moves along the time axis. In this case, a RUL with a large error is yielded.
3.3.2. RUL Prediction Using DAPROG
The DAPROG takes the same step as the TSBP, which is to uncover the portion of the trajectory curve by moving the test data along the time axis that matches with a high similarity. The DAPROG, however, enables the RUL estimation even when the failure is encountered to reveal the matching curve in the training set. An example is given in
Figure 13 for illustration, in which
Figure 13a is the ordinary case where we can discover the corresponding (thick) portion in the training (red solid) curve that matches with the test data (blue asterisk), which we call the DTW coverage. Then, the mapping by DTW applies only to this portion of the curve, from which the regression and virtual mapping paths are made, as shown in (c), and are extrapolated into future cycles. Next, the virtual paths are mapped to the test data, as shown in (e), from which the RULs are predicted, which are the cycles between the current cycle and the EOL of the virtual curves. Note that the EOLs predicted by virtual paths (x-intercepts of many red curves) are not too different from the true EOL. When the matching curve is not found as shown in (b), applying the DTW algorithm results in the trivial solution, with the DTW coverage involving the same cycles as the test unit (thick portion of the curve) and the warping path becoming simply the linear line, with the coefficient of determination being 1, as shown in (d). The resulting virtual RTF curve becomes (f) where the EOL simply becomes that of the training curve. This is a similar situation to the one addressed in the TSBP.
To solve this problem in the DAPROG, an additional procedure is carried out to uncover the DTW coverage; this procedure involves finding the matching portion in the training curve by not only moving the test data along the cycle (horizontally) but also along the HI (vertically). In fact, it is a 2D search for the location in the training curve. The principle is that the smaller the DTW distance and the higher the coefficient of determination, the greater the DTW mapping results we obtain. For this objective, optimization is performed to minimize the ratio of DTW distance and the coefficient of determination, as represented by Equation (3).
Then, the location of the test data and the corresponding DTW coverage (thick portion of the curve) become those in
Figure 14a, from which the warping path and virtual RTF curves are given by (b) and (c), respectively. In
Figure 14c, the RULs are estimated as the distance between the EOL and the end cycle of the test data.
3.4. RUL Prediction under Data Deficiency
To investigate the impact of data deficiency, the size of the training dataset is intentionally reduced to a fraction of its original size (the number of training units),
. This process is depicted in
Figure 15, where the
of the used training data is progressively increased from 0.1 to 0.9 in increments of 0.1. A smaller ratio signifies a more severe degree of data deficiency. Upon completion of training, these models are used to predict the RUL of test data, with accuracy evaluated against actual values. While the training dataset is reduced, the number of test units remains constant at the original number,
. Given that the reduced training data are randomly selected, the prediction results may vary with each attempt. To account for this variability, the prediction process is repeated 300 times, allowing for the calculation of mean and standard deviation being used as the performance metrics. The process consists of the following steps:
Data Selection: A subset of the training data is randomly selected in the amount of .
Data Normalization and Sensor Selection: Sensor data are normalized, and sensors with high correlation to the cycles are selected to enhance training efficiency.
Model Training: The selected training data are used to train the model. For TSBP and DAPROG, this involves constructing the HI degradation trajectory for training units using specific equations and estimating the HI for test units. In ANN methods, model parameters such as weights and bias are determined using the training data, with sensor data as inputs and the RUL as outputs.
RUL Prediction: The RUL for each test unit is predicted using the trained model. TSBP obtains the RUL by applying similarity measures to the HI trajectory of training units, resulting in as many RUL estimates as there are training data. DAPROG generates virtual training curves for each test unit using DTW, yielding multiple RUL estimates. In both methods, a weighted sum approach is used to aggregate these RUL estimates, assigning higher weights to more similar trajectories. For ANN models, the RUL is directly calculated by applying the sensor data of the test unit to the trained model.
Performance Evaluation: The RMSE are computed using the 100 predicted and actual RULs to assess method performance.
Since this process is repeated 300 times for each ratio and method, a set of 300 RMSEs is obtained, representing the performance variance due to the random selection of training data.
3.5. Comparative Discussion: Overall Performance
As outlined earlier, our experiments, depicted in
Figure 16, demonstrate the RMSE of RUL prediction across every scenario for each method. The box plots at each ratio illustrate the median, first, and third quartiles, with the whiskers indicating the maximum and minimum values, excluding outliers marked by red crosses. These results clearly show that an increase in the training data ratio leads to a decrease in RMSE, thereby improving the accuracy of RUL prediction for all four-methods, as anticipated.
In view of the mean values of the RMSE result, the CNN and LSTM models appear to perform much better and are more stable than TSBP and DAPROG, with most values being below 20. The mean of TSBP is the worst in the small ratios but quickly decreases to less than 20 as the ratios increase, whereas the mean of DAPROG stays near 25. However,, the variance of the RMSE is much greater in the CNN and LSTM across the entire ratios, not only in the small ratios (10% to 30%) but even in the near original dataset. The reason may be due to the inherent randomness of the neural network in every training attempt. On the other hand, those of TSBP and DAPROG show a decrease as the ratio increases and seemingly converges to a single value.
Overall, it can be concluded that DAPROG is the most consistent method across various scenarios in view of the mean and variance. This consistency is further highlighted in
Table 1, which numerically details the performance in data-limited scenarios (ratios of 10% to 30%). Bold letters indicate the best performance among four methods for each ratio. DAPROG maintains a smaller variance than other methods throughout all ratio ranges, indicating its robustness and lower susceptibility to fluctuations by the selected training data.
Figure 17a,b present the 300 RUL estimates for two engines, #2 and #10, which are chosen arbitrarily in the test dataset of 100 engines. The consistency of DAPROG is more evident in this figure, showcasing the smaller variance for all the ratios than the CNN and LSTM, as shown by the box plots. The green horizontal lines show the true RULs for engine #2 and #10, respectively.
As previously mentioned, the RUL prediction for each test unit (or engine) is conducted across 2700 scenarios and consists of nine ratios multiplied by 300 repetitions. Therefore, we can obtain the 2700 RUL estimates, from which the variance can be calculated for each unit. This is used as a measure of how consistent the method is in RUL estimation. Since we have 100 test units, we obtain 100 variance values.
Figure 18 presents a histogram of the variance for the 100 test units for the four methods. The numbers above each figure and the red vertical line indicate the average variance of the estimates. A lower average variance suggests a more consistent estimation. Notably, DAPROG achieved the lowest values, underscoring its superior consistency in RUL estimation.
When it comes to scenarios with limited data, methods like DAPROG, although slightly less accurate than ANN methods, demonstrate competitive potential. This is particularly noteworthy considering the challenges associated with the inherent randomness and the non-intuitive nature of ANN methods. Approaches like TSBP and DAPROG, with their more deterministic and interpretable frameworks, offer compelling alternatives in these contexts.
On the other hand, the critical drawback of the ANN method is its randomness, manifesting in aspects like weight initialization and the sequence of data presentation during the training process. This randomness can lead to variability in model performance, even under similar conditions. Moreover, the decision-making process within ANN methods, especially in deep learning models, is often unclear. The multiple layers of computations in these models obscure the transformation of input data into predictions, contributing to their ‘black box’ nature. This lack of transparency can be a significant hindrance in fields where understanding the rationale behind decisions is crucial. Another challenge in implementing ANN method is the complexity involved in designing their architecture. Selecting the appropriate architecture is not only crucial for optimal performance but also requires substantial time and expertise. There is no universal solution; the determination of the number of layers, types of layers, number of neurons, optimization methods, and mini-batch sizes demands considerable experimentation and domain-specific knowledge. While our experiments successfully implemented architectures based on Li and Zheng’s models, it is worth noting that different architectures might not yield equally effective results.
However, it is observed that, when the data ratio exceeds 0.4, DAPROG does not show significant performance improvements, in contrast to TSBP, which rapidly improves and eventually outperforms DAPROG. In the meantime, CNN and LSTM models consistently demonstrate strong and progressively improving performances. The reason for the differences between TSBP and DAPROG, and their implications in various scenarios, are explained in further detail in the following section.
3.6. Comparative Discussion: TSBP vs. DAPROG
Both TSBP and DAPROG methods share a foundational approach in RUL prediction, focusing on identifying similar trajectories within the training dataset. However, they diverge significantly in their specific methodologies. While DAPROG creates multiple virtual Run-To-Failure (RTF) curves to predict various RULs, TSBP predicts a single RUL value.
Figure 19a illustrates TSBP’s prediction method using full training data (Ratio = 1). Here, test unit #96 (blue asterisks) is compared against similar segments of training data trajectories (red curves). Trajectories that fail to align with the test data, particularly those ending at 400 cycles, are excluded from the final RUL prediction. For instance, TSBP’s RUL estimate of 120.7, closely approximating the actual value of 137, underscores its effectiveness with ample data. However, as seen in
Figure 19b with a limited data ratio (0.1), the absence of matching trajectories in the training data significantly affects the accuracy of RUL predictions, as reflected by the high RMSE and variance for training ratios below 0.3 (
Figure 16).
In contrast, DAPROG demonstrates a consistent performance regardless of the training data ratio.
Figure 20a,b showcase the virtual RTF curves for full (Ratio = 1) and reduced (Ratio = 0.1) data ratios. DAPROG’s consistent RUL distributions, irrespective of data availability, are attributed to its methodology of vertically adjusting test data to enhance DTW (Dynamic Time Warping) coverage and extrapolating multiple virtual paths. This approach results in uniform RUL predictions with lower variance, which is evident in
Figure 16.
DAPROG’s performance may deteriorate with lower-quality test data, particularly if the data are too short or only reflect the early stages of degradation. This is linked to the differences in how training trajectories are matched to test data—TSBP using Euclidean distance and DAPROG employing DTW distance. While Euclidean distance is more intuitive, relying on physical proximity for similarity, DTW can distort similarities if data series patterns vary drastically. In cases where test data are limited or early in the degradation phase, the resulting warped path might be too short for precise future extrapolation. This limitation, potentially affecting some of the 100 test datasets, can impact DAPROG’s overall effectiveness.
When data deficiency is severe (low ratio), DAPROG’s strength lies in its ability to conduct additional searches for matching trajectories and generate virtual RTF data. Conversely, with abundant data (high ratio), TSBP might outperform due to the increased probability of finding suitable training trajectories.