The outcomes of Experiment 2′s four cases are now delved into to evaluate if TL has effectively enhanced the model’s generalizability across different variants of a particular widebody aircraft.
6.2.1. Experiment 2.1
The optimal model trained on Domain 1 (i.e., Variant 1 and 2 data) is a feedforward neural network whose configuration, showcased in
Table 7, comprises three dense layers with 256, 128, and 1 unit, respectively. The dropout rate is set to 16.3%, and the model is run using the
Adam optimizer with a learning rate of 0.00033, a batch size of 32, and over 78 epochs. These optimal parameters were identified when HyperOpt was configured to perform 100 evaluations. It is anticipated that aspects of a model developed on Domain 1 will prove beneficial when adapting the model to fit Domain 2, specifically Variant 3.
Next, a comparative analysis was conducted utilizing three distinct models to understand the advantages of TL in predictive modeling for aircraft brake wear. Initially, the optimal neural network model, described in
Table 7 and identified through the HyperOpt optimization process, was trained and tested exclusively on data from Domain 1, comprising the datasets for Variants 1 and 2. This model is referred to as the standalone model for Domain 1. Following this, the same architectural framework was applied to train and assess a model on Domain 2 data (i.e., Variant 3), hereafter referred to as the standalone model for Domain 2. The results of this standalone model for Domain 2 establish the baseline against which the efficacy of the TL approach is gauged.
The performance metrics for the standalone models trained on different domains, as reported in
Table 8, provide intriguing insights into the models’ predictive abilities across distinct data segments corresponding to specific variants. When evaluating the model trained and tested on Domain 1 (i.e., Variants 1 and 2) data, an MAE of 0.0066, an MSE of 0.0004, an RMSE of 0.0197, and an R
2 of 0.6044 are observed. These values reflect moderate predictive performance and variance explanation in the model’s output. In stark contrast, the standalone model tailored for Domain 2 (i.e., Variant 3) data exhibits a significant deterioration in performance, as indicated by a pronounced increase in error metrics: MAE increased by 335.38%, MSE by 304.99%, and RMSE by 101.25%. Most notably, the R
2 metric plunged to −5.95, signifying that the model’s predictions are severely out of alignment with the actual data, performing worse than a simple average. This drastic drop in R
2 is accompanied by a substantial reduction in training time by 95.21% and prediction time by 78.24%, reflecting the smaller dataset’s reduced complexity and size.
The percentage change between the two domains underscores the challenges in model generalization when transitioning from a larger, composite dataset to a smaller, distinct one. It highlights the need for sophisticated approaches, such as TL, to adapt models trained on abundant and varied data to perform well on limited or specific datasets without compromising prediction quality and reliability. As such, the final phase involves taking the standalone model of Domain 1 and subjecting it to the TL process, which consists of determining the optimal number of layers to freeze (from zero to two in this case)—thus retaining learned features from Domain 1—and identifying the number of new layers to introduce (up to 10), along with their respective units, to fine-tune the model for Domain 2 specifics.
Figure 9 below displays the MSE score (in percent squared, as the prediction target is percent brake wear per flight) as a function of the number of layers frozen and new layers added, with color indicating the number of units used per added layer. Note that the same units are used for each added layer, being either 32, 64, or 128. The minimum MSE found is 0.0002, obtained by freezing only the first layer and adding three new layers with 32 units each.
Upon the completion of this TL adaptation, the resulting model’s performance is rigorously compared against the baseline standalone model for Domain 2 (i.e., Variant 3) to highlight the impact of TL in enhancing model performance and generalizability across different aircraft types. The corresponding results are shown in
Table 9 below.
Table 9 reveals a substantial improvement in all predictive performance metrics when TL is employed. Specifically, the TL model shows a remarkable decrease in MAE by 65.41%, MSE by 86.76%, and RMSE by 63.62% compared to the standalone model trained exclusively on Domain 2 data. Most critically, the R
2 score shifts from negative to positive, demonstrating that the TL model can explain a certain degree of variance in the data that the standalone model failed to capture. Additionally, the TL model’s training time is slightly reduced by 3.78%, and the prediction time sees a more noticeable reduction of 21.23%. This efficiency in training and prediction times, alongside the significant leap in accuracy and consistency of predictions, showcases the efficacy of TL in tuning the model to a specific, smaller dataset, enhancing its generalizability. The results also confirm the inclusion of transferable features during the pre-training phase, as poor feature engineering could have resulted in less effective knowledge transfer.
The improved performance metrics for the TL model indicate that leveraging the knowledge from Domain 1 (i.e., Variants 1 and 2) and fine-tuning for Domain 2 (i.e., Variant 3) effectively addresses the challenge posed by the smaller dataset size of the latter. TL maintains and improves the model’s predictive power, offering a more reliable, efficient, and generalizable tool for predicting aircraft brake wear. This result is a testament to the robustness of TL in applications where data may be scarce or where it is imperative to generalize across related but varied domains, such as different aircraft configurations. These findings align with trends reported in the literature for similar problems involving scarce data, where TL consistently demonstrates its ability to enhance model performance by leveraging knowledge from data-rich domains. However, this study’s significant improvement in both predictive performance and computational efficiency surpasses many reported outcomes, where TL often focuses solely on improving predictive performance without addressing operational constraints like training and prediction times. This dual improvement highlights the practical applicability of TL for real-world deployment in resource-constrained environments.
6.2.2. Experiment 2.2
In this case, the optimal model trained on Domain 1 (i.e., Variant 1) comprises five dense layers with 128, 32, 64, 32, and 1 unit, respectively, as summarized in
Table 10. The dropout rate is set to 11.8%, and the model is run using the
Adam optimizer with a learning rate of 0.00063 and a batch size of 32 for 67 epochs. These optimal parameters were identified when HyperOpt completed 193 iterations.
The optimal model described in
Table 10 is trained and tested exclusively on data from Domain 1 (i.e., Variant 1 data), forming the standalone model for Domain 1. Following this, the exact architectural framework is applied to train and assess a model on Domain 2 (i.e., Variants 2 and 3) data, referred to as the standalone model for Domain 2.
Table 11 presents the comparative performance of the standalone models, emphasizing the change in predictive performance when transitioning from training on a single variant to a combined dataset of two different variants.
For the standalone model trained on Variant 1 data, the MAE is relatively low at 0.0036, and the MSE at 0.0002 reflects a strong concordance between predicted and actual values. The RMSE at 0.0129 and a relatively high R
2 value of 0.6826 indicate a model that fits well with the training data and explains a substantial amount of variance. However, when applying the same model configuration to the combined data of Variants 2 and 3, there is a noticeable increase in all error metrics: MAE rises by 187.89%, MSE by 312.24%, and RMSE by 103.04%. Additionally, the R
2 value decreases by 16.61%, indicating a decline in the model’s ability to capture variance. Additionally, both training and prediction times decrease (by 30.18% and 35.78%, respectively), likely due to the smaller cumulative dataset size for Domain 2 compared to Domain 1 (9316 samples for Variants 2 and 3 combined vs. 11,957 samples for Variant 1, as shown in
Table 2). Another contributing factor could be greater homogeneity in the combined data, resulting in faster convergence during training.
The increased error metrics for Domain 2 suggest that a model optimized for a specific aircraft type does not directly translate to equally effective performance on data from different or combined variants; this reinforces the complexity inherent in aircraft performance data and the nuanced differences between various types of aircraft, even within the same model family. These results re-emphasize the need for model re-tuning or even redesign to accommodate the subtleties of a new data domain.
Subsequently, the standalone model of Domain 1 is subjected to the TL process, during which different numbers of layers are iteratively frozen, and the optimal number of new, additional layers and their units are identified.
Figure 10 below displays the MSE score versus the number of layers frozen and added, colored by the units added for each new layer. The minimum MSE found is 0.0007, obtained by freezing none of the layers and not adding any new layers. This implies that, during the TL process, the entire model’s weights will undergo adjustment, effectively utilizing TL merely as a weight initialization scheme.
The TL model’s performance is compared against the baseline standalone model for Domain 2 (i.e., Variants 2 and 3), shown in
Table 12 below.
The results in
Table 12 indicate a nuanced performance enhancement when applying TL. Specifically, the MAE improved by approximately 11%, signifying a more accurate model. The MSE and RMSE saw minimal improvements of about 1.82% and 0.92%, respectively, suggesting that the TL model is slightly better at reducing the magnitude of prediction errors. A slight improvement is also observed in the R
2 score, which increased by approximately 1.38, indicating that the TL model explains a higher percentage of the variance in the test data for Domain 2 and providing evidence that the TL process has helped the model capture the underlying pattern in the data more effectively. Regarding computational efficiency, the training time for the TL model shows a significant reduction, nearly halving (a decrease of 47.46%), indicating a more efficient learning process likely due to the initial weights providing a better starting point. Additionally, the prediction time decreased by nearly 20%, suggesting that the TL model can make predictions more rapidly, which is an important consideration for specific DT applications where time efficiency is critical.
Overall, these results highlight that the TL model provides a slight edge over the standalone model in terms of predictive performance while offering substantial improvements in computational efficiency. Once again, this demonstrates the effectiveness of TL in refining a pre-existing model trained on one aircraft variant to better suit a related but distinct dataset corresponding to different variants, reinforcing its value as a method for enhancing model generalizability and efficiency in the predictive modeling of brake wear.
6.2.3. Experiment 2.3
In this case, the TL process is executed twice, sequentially. The optimal neural network model derived for Domain 1, corresponding to the Variant 1 dataset, forms the foundation, and its specific architectural details are listed in
Table 10. Subsequently, the predefined model architecture is utilized for each domain to train and evaluate separate models, yielding standalone models for Domain 2, which encompasses the Variant 2 data, and Domain 3, pertaining to the Variant 3 data. These models are then individually assessed to determine their predictive performance;
Table 13 and
Table 14 present the results of the standalone neural network models when applied to different domains.
Table 13 provides the standalone model metrics for each domain. For Domain 1 (i.e., Variant 1), the model achieves the lowest MAE, MSE, and RMSE alongside the highest R
2 score, suggesting that the model fits this domain’s data well. When the model is trained and tested on Domain 2 (i.e., Variant 2), there is a noticeable uptick in the MAE and MSE and a dip in the R
2 score, indicating less precise predictions when compared to Domain 1. The deterioration in predictive performance becomes even more pronounced for Domain 3 (i.e., Variant 3), with the highest MAE and MSE and the lowest R
2 score.
Table 14 provides a comparative perspective detailing the percentage changes in the performance metrics for Domains 2 and 3, using Domain 1 as the benchmark. The metrics exhibit substantial increases in the MAE and MSE for Domains 2 and 3, implying that the model’s error rates inflate when transferred to these domains; this proves that the Domain 1 model is less suited to capturing the brake wear dynamics of the subsequent domains without adjustment. The percentage change in the R
2 score for Domain 3 is particularly striking, declining by over 500%. This stark decrease is a clear indicator that the model, while adequate for the dataset of Variant 1, fails to generalize effectively to the significantly different conditions or operational parameters of Variant 3. Additionally, the percentage changes in training and prediction times highlight a reduction in time for Domains 2 and 3, likely due to their smaller data sizes that require less computational effort.
These results underscore the criticality of domain-specific nuances in predictive modeling. While a model may exhibit robust performance in one domain, its performance can vary significantly when applied to another, emphasizing the need for adaptive modeling approaches—such as TL—that allow models to adjust to variations in data distributions, operational conditions, and feature relevance across different domains. As such, the standalone model of Domain 1 is subjected to the TL process, which determines the optimal number of layers to freeze and the number of new layers to add along with their units, to fine-tune the model for Domain 2.
Figure 11 below displays the MSE score versus the number of layers frozen and added, colored by the units added for each new layer. The minimum MSE across all units is 0.0007, obtained by freezing none of the layers (i.e., weight initialization) and adding four new layers with 32 units each.
Upon completion of this TL adaptation, the resulting model’s performance is rigorously compared against the baseline standalone model for Domain 2 (i.e., Variant 2) to highlight the impact of TL in enhancing model generalizability to Domain 2. The corresponding results are shown in
Table 15 below.
Table 15 compares the performance metrics between the standalone model and the TL model, both tested on Domain 2 data (i.e., Variant 2). The MAE increases by 10.28%, suggesting that, despite the adaptation process, the TL model exhibited a slight decrease in accuracy regarding average prediction errors compared to the standalone model trained directly on Domain 2. MSE and RMSE show minimal changes, with MSE remaining unchanged and RMSE reducing slightly by 3.48%; this indicates a minor improvement in the TL model’s ability to mitigate significant errors, reflected in a slightly more consistent performance despite the higher error rates overall. There is also an improvement in the R
2 score, which increases by 5.79%, indicating that the TL model is better at explaining the variability in the dataset relative to the standalone model. Regarding computational efficiency, the training time for the TL model increased by 28.57%, indicating more computational resources were needed, likely due to the complexity introduced by the TL process (i.e., added layers). Additionally, the prediction time saw a marginal decrease of 2.52%, representing a slightly faster inference time.
The TL model is subsequently fine-tuned once more to adapt it specifically for Domain 3 (i.e., Variant 3).
Figure 12 shows the MSE score for different quantities of frozen and added layers. The minimum MSE on Domain 3 is found by freezing 16 layers of the previous TL model and adding eight new layers with 32 units each.
Table 16 shows the final TL model’s performance on Domain 3 compared to that of the standalone model for the same domain.
Table 16 delineates the comparative outcomes of the final TL model and the standalone model when applied to Domain 3 (i.e., Variant 3). Results show that the TL model significantly outperforms the standalone model tailored to the same domain across several critical performance metrics. It reduces the MAE by 57.68%, indicating a substantial improvement in the average magnitude of the errors and demonstrating the TL model’s enhanced capability to predict closer to the actual values. The MSE decreases by 75.10% and the RMSE by 50.10%, highlighting the TL model’s ability to reduce significant errors. There is an increase in the R
2 score by 102.14%; this dramatic shift from a negative to a positive value indicates that the TL model explains a positive variance in the dataset, unlike the standalone model, which performed worse than a simple average model.
Regarding computational effort, the training time for the TL model increases by 76.05%, reflecting an additional computational burden, likely due to the complexity of the TL process in this case. The prediction time remains almost unchanged, with a negligible decrease of 0.43%. These results affirm that the TL process has effectively leveraged the knowledge gained from the broader datasets of Domains 1 and 2 to enhance the model’s performance on the more challenging Domain 3 dataset. The stark improvements in error metrics and the R2 score particularly underscore the value of employing TL to address challenges posed by limited data diversity and volume, as with the dataset corresponding to Variant 3.
To complete the analysis,
Table 17 contrasts the performance metrics of the standalone model and the final TL model on Domain 2 (i.e., Variant 2), after the TL process was applied to fine-tune the model to Domain 3 data. Note that all TL models were evaluated and compared to the standalone models on the respective target domains.
The results in
Table 17 show a stark divergence in performance compared to the earlier adaptation, with significant regressions in the TL model’s performance metrics. MAE increased dramatically by 164.28%, indicating that the adjustments made to suit Domain 3 better have adversely affected the model’s accuracy on Domain 2. The MSE and RMSE show similar trends, with increases of 139.72% and 54.83%, respectively, confirming that the model’s predictive performance deteriorated for Domain 2. The R2 score declines from a moderately positive value of 0.5419 to −0.0981, a change of −118.11%. This drastic decrease indicates that the model’s ability to explain the variability in the dataset for Domain 2 has become worse than a model that would merely predict the mean of the target values.
Regarding computational efficiency, the training time was reduced by 87.68%, indicating a quicker training phase. The prediction time saw a slight increase of about 6.93%, which, although minimal, indicates that the model requires slightly more computation time for predictions despite its reduced accuracy and effectiveness.
These results illustrate that while the TL model optimized for Domain 3 has improved its performance on that specific dataset, its generalization to Domain 2 has suffered significantly, highlighting a critical challenge in TL applications: optimizing a model for one domain may significantly degrade its performance on another when the domains differ substantially in characteristics or data distribution. This necessitates careful consideration and potentially more bespoke tuning for each domain, ensuring that improvements in one area do not undermine performance elsewhere.
6.2.4. Experiment 2.4
In this scenario, Domain 1 is taken as the data corresponding to Variant 2, which are slightly smaller than those of Variant 1 (as detailed in
Table 2). The optimal model identified through HyperOpt for Variant 2 data consists of six dense layers with 128, 128, 256, 128, 32, and 1 unit, respectively, shown in
Table 18. The dropout rate is set to 13.2%, and the model is run using the
Adam optimizer with a learning rate of 0.00068 a batch size of 64, and it is trained for 90 epochs. These optimal parameters were identified when HyperOpt completed 100 evaluations.
The above optimal model is trained and tested exclusively on data from Domain 1 (i.e., Variant 2 data). The exact architectural framework is then applied to train and assess the model on Domain 2 (i.e., Variants 1 and 3) data.
Table 19 presents a comparative performance of these standalone models.
The comparative analysis reflected in
Table 19 outlines the performance of the standalone models, whose architectures were optimized for Domain 1 (i.e., Variant 2 data). The standalone model trained on Domain 1 shows an MAE of 0.0096, an MSE of 0.0007, and an RMSE of 0.0258, with an R
2 score of 0.5919. These metrics imply reasonable predictive performance.
Interestingly, a performance improvement is observed when the same model structure is applied to Domain 2, combining data from Variants 1 and 3. The MAE drops significantly by approximately 54.33%, while the MSE and RMSE decrease by 69.17% and 44.48%, respectively, suggesting that the model trained on Domain 2 makes fewer significant errors in its predictions. Moreover, the R2 score slightly improves, increasing by 2.35%, showing that the Domain 2 model captures a marginally higher percentage of variance than the Domain 1 model. The improvements in these performance metrics demonstrate that the combined data of Variants 1 and 3 may offer a richer representation of the underlying brake wear patterns and a broader spectrum of operational conditions, helping the model learn more effectively. On the other hand, training and prediction times increase for the Domain 2 model by 50.24% and 53.30%, respectively; this is likely due to the larger combined dataset of Domain 2, which demands more computational resources for training and inference. These results emphasize the importance of comprehensive datasets encompassing various conditions to train models with improved generalization capabilities for predictive maintenance applications.
Still, the standalone model of Domain 1 is subjected to the TL process, where different numbers of layers are frozen and new layers are added.
Figure 13 below shows the MSE score by the number of layers frozen and added, colored by the units for each new layer. The minimum MSE found is 0.0002, obtained by freezing none of the layers and adding one new layer with 32 units.
The TL model’s performance is compared against the baseline standalone model for Domain 2 (Variant 1 and Variant 3), shown in
Table 20 below.
The standalone model’s performance on Domain 2 already demonstrates strong predictive capabilities, capturing over 60% of the variance in the target variable. After applying TL, the model sees a 61.69% increase in MAE and a 7.61% improvement in R2, while MSE remains constant and RMSE improves by a slight 6.03%. Notably, the rise in MAE suggests that, on average, the TL model predictions are further from the actual values than those of the standalone model. However, the R2 score improvement indicates that despite the more significant average errors, the TL model is better at fitting the variability in the data than the standalone model. The enhancement in the R2 score and the slight reduction in RMSE point towards the TL model’s enhanced ability to model the variance within the combined dataset, even if it occasionally makes higher errors. This underscores the balance between achieving lower error rates and capturing the data’s underlying structure.
The training time for the TL model is about 6.56% higher than that of the standalone model, while the prediction time shows a modest improvement of approximately 6.84%. These shifts in computational performance are minor but reveal that the TL process, particularly with adding a new layer, necessitates slightly more time for training, which is a reasonable trade-off considering that the data size for Domain 2 is larger than that for Domain 1. While the TL model does not uniformly surpass the standalone model across all metrics, it showcases its strength in capturing the complex variance of the data, thereby suggesting that it could be a more reliable choice in varied real-world scenarios, even if it incurs slightly higher errors on average.
The aforementioned four experiments were conducted to assess the ability of TL to adapt a model developed on one domain comprising specific aircraft configurations to another, highlighting its potential to improve predictive performance across varied datasets. The results underscore the value of utilizing advanced modeling techniques like TL to ensure that predictive models remain robust and reliable across different aircraft types and their varying operating conditions, enhancing the models’ practical utility in real-world applications. Among Experiments 2.1–2.4, Experiment 2.1 demonstrated the most effective use of TL to enhance model generalizability as it showcased a substantial improvement in all performance metrics when TL was applied, particularly in minimizing MAE, MSE, and RMSE, alongside a pronounced shift from a negative to a positive R2 score. These results indicate that the TL model became more accurate and more capable of capturing the variance of the smaller, specific dataset of Variant 3 compared to the standalone model. This adaptability is important for predictive performance in diverse operational scenarios like different variants of a particular widebody aircraft.
The critical takeaway from these experiments is that TL can significantly enhance a model’s generalizability and predictive accuracy when properly tuned, particularly in fields like aviation, where operational conditions can vary widely between datasets. These results also suggest that the sequence of domains on which the model is trained significantly influences the effectiveness of TL. For instance, starting with a domain that offers greater data diversity or encompasses a broader spectrum of operational conditions (e.g., Variants 1 and 2 combined) provides a stronger foundation for generalization. This ensures that the model learns more comprehensive and transferable features, which can then be fine-tuned for smaller, more specific domains like Variant 3. Conversely, starting with a less diverse or more narrowly focused domain may limit the model’s ability to capture generalizable patterns, making subsequent adaptation through TL less effective. Therefore, the choice and sequence of training domains should be strategically aligned with the data’s characteristics and the desired application to maximize the benefits of TL.