Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems
Abstract
1. Introduction
- We proposed a comprehensive methodology for developing and evaluating cross-layer and cross-platform failure-prediction models in DC environments.
- We conducted an in-depth exploratory data analysis (EDA) and statistically examined system failure characteristics, offering valuable insights for researchers seeking to understand complex failure patterns in DC systems.
- We developed, evaluated, and validated a range of ML-based failure-prediction models at multiple hierarchical levels (cluster and site) within a DC platform, as well as across different DC platforms.
- We proposed a hierarchical architecture and to integrate failure-prediction models to enhance DC system reliability and cost efficiency.
2. Related Work
3. Overview of Failure Traces
4. Methodology
4.1. Data Preprocessing and Analysis
4.2. Model Generation
4.2.1. Data Splitting
4.2.2. Model Selection
4.2.3. Model Training and Testing
5. Model Validation and Discussion
5.1. Intra-Platform Validations
5.1.1. Within-Layer Prediction Results
5.1.2. Cross-Layer Prediction Results
5.2. Inter-Platform Validations
5.2.1. Within-Layer Prediction Results
5.2.2. Cross-Layer Prediction Results
5.3. System Level Validations
6. Proposed System Architecture
6.1. Reliable Distributed Computing System Architecture: A Conceptual Model
6.2. Balancing Reliability and Cost in Service Design
- Cost-Optimized Service (No Explicit Reliability Requirements): For a client submitted a data analytics workload with strict cost constraints but no reliability requirements, the service provider can design a service by using system-level resource manager, . It employs predictions from module, which achieves an of 0.97 for TBF with XGBoost (Table 21), enabling system-wide resource allocation with 81.38% system availability (Section 4.1) at a reduced computational cost. This balances affordability and baseline reliability of the service.
- High-Reliability Service (No Budget Constraints): For a client submitted a mission-critical medical imaging application demanding maximal uptime with no cost constraints, the service will be designed by using cluster-level managers () using predictions from modules, offering 100% FNI accuracy and an of 1.0 for TBF in clusters like s1/c2 (Table 9 and Table 11). This ensures a higher availability of 98.51% (Table 2) and a minimal downtime.
- Balanced Cost-Reliability Service: For users seeking a balance between reliability and cost, the service will be designed by using site-level resource managers () utilizing predictions, such as an of 0.99 TBF in site 4 (Table 21) with availability of 90.23% (Table 1). This strategy strategically allocates resources across sites to balance the cost and reliability of the service.
7. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Mariani, L.; Pezzè, M.; Riganelli, O.; Xin, R. Predicting failures in multi-tier distributed systems. J. Syst. Softw. 2020, 161, 110464. [Google Scholar] [CrossRef]
- Jassas, M.S.; Mahmoud, Q.H. Analysis of job failure and prediction model for cloud computing using machine learning. Sensors 2022, 22, 2035. [Google Scholar] [CrossRef] [PubMed]
- Sharma, Y.; Javadi, B.; Si, W.; Sun, D. Reliability and energy efficiency in cloud computing systems: Survey and taxonomy. J. Netw. Comput. Appl. 2016, 74, 66–85. [Google Scholar] [CrossRef]
- Tengku Asmawi, T.N.; Ismail, A.; Shen, J. Cloud failure prediction based on traditional machine learning and deep learning. J. Cloud Comput. 2022, 11, 47. [Google Scholar] [CrossRef]
- Javadi, B.; Kondo, D.; Iosup, A.; Epema, D. The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems. J. Parallel Distrib. Comput. 2013, 73, 1208–1223. [Google Scholar] [CrossRef]
- Hochenbaum, J.; Vallis, O.S.; Kejariwal, A. Automatic anomaly detection in the cloud via statistical learning. arXiv 2017, arXiv:1704.07706. [Google Scholar] [CrossRef]
- Kourtis, M.A.; Xilouris, G.; Gardikis, G.; Koutras, I. Statistical-based anomaly detection for NFV services. In Proceedings of the 2016 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), Palo Alto, CA, USA, 7–10 November 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 161–166. [Google Scholar]
- Wilson, D.; Veeramachaneni, K.; O’Reilly, U.M. Cloud scale distributed evolutionary strategies for high dimensional problems. In Proceedings of the European Conference on the Applications of Evolutionary Computation, Vienna, Austria, 3–5 April 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 519–528. [Google Scholar]
- Fredriksson, T.; Mattos, D.I.; Bosch, J.; Olsson, H.H. Data labeling: An empirical investigation into industrial challenges and mitigation strategies. In Proceedings of the International Conference on Product-Focused Software Process Improvement, Turin, Italy, 25–27 November 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 202–216. [Google Scholar]
- Tang, X.; Machimura, T.; Li, J.; Liu, W.; Hong, H. A novel optimized repeatedly random undersampling for selecting negative samples: A case study in an SVM-based forest fire susceptibility assessment. J. Environ. Manag. 2020, 271, 111014. [Google Scholar] [CrossRef]
- Gao, J.; Wang, H.; Shen, H. Task failure prediction in cloud data centers using deep learning. IEEE Trans. Serv. Comput. 2020, 15, 1411–1422. [Google Scholar] [CrossRef]
- Shen, J.; Ren, Y.; Wan, J.; Lan, Y. Hard disk drive failure prediction for mobile edge computing based on an LSTM recurrent neural network. Mob. Inf. Syst. 2021, 2021, 8878364. [Google Scholar] [CrossRef]
- Abbasimehr, H.; Paki, R. Improving time series forecasting using LSTM and attention models. J. Ambient Intell. Humaniz. Comput. 2022, 13, 673–691. [Google Scholar] [CrossRef]
- Das, A.; Mueller, F.; Siegel, C.; Vishnu, A. Desh: Deep learning for system health prediction of lead times to failure in hpc. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, Tempe, AZ, USA, 11–15 June 2018; pp. 40–51. [Google Scholar]
- Tang, X. Large-scale computing systems workload prediction using parallel improved LSTM neural network. IEEE Access 2019, 7, 40525–40533. [Google Scholar] [CrossRef]
- Ismail, A.A.; Gunady, M.; Corrada Bravo, H.; Feizi, S. Benchmarking deep learning interpretability in time series predictions. Adv. Neural Inf. Process. Syst. 2020, 33, 6441–6452. [Google Scholar]
- Thompson, N.C.; Greenewald, K.; Lee, K.; Manso, G.F. Deep learning’s diminishing returns: The cost of improvement is becoming unsustainable. IEEE Spectr. 2021, 58, 50–55. [Google Scholar] [CrossRef]
- Gaykar, R.S.; Khanaa, V.; Joshi, S.D. A hybrid supervised learning approach for detection and mitigation of job failure with virtual machines in distributed environments. Ing. Des Syst. D’Inf. 2022, 27, 621. [Google Scholar] [CrossRef]
- Zhong, Y.; Chen, W.; Wang, Z.; Chen, Y.; Wang, K.; Li, Y.; Yin, X.; Shi, X.; Yang, J.; Li, K. HELAD: A novel network anomaly detection model based on heterogeneous ensemble learning. Comput. Netw. 2020, 169, 107049. [Google Scholar] [CrossRef]
- Mohammed, A.; Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 757–774. [Google Scholar] [CrossRef]
- Balouek, D.; Carpen Amarie, A.; Charrier, G.; Desprez, F.; Jeannot, E.; Jeanvoine, E.; Lèbre, A.; Margery, D.; Niclausse, N.; Nussbaum, L.; et al. Adding Virtualization Capabilities to the Grid’5000 Testbed. In Cloud Computing and Services Science; Communications in Computer and Information Science; Ivanov, I.I., van Sinderen, M., Leymann, F., Shan, T., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2013; Volume 367, pp. 3–20. [Google Scholar] [CrossRef]
- Cappello, F.; Caron, E.; Dayde, M.; Desprez, F.; Jégou, Y.; Primet, P.; Jeannot, E.; Lanteri, S.; Leduc, J.; Melab, N.; et al. Grid’5000: A large scale and highly reconfigurable grid experimental testbed. In Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing, 2005, Seattle, WA, USA, 13–15 November 2005; IEEE: Piscataway, NJ, USA, 2005; p. 8. [Google Scholar]
- Bisong, E.; Bisong, E. Google colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Apress: Berkeley, CA, USA, 2019; pp. 59–64. [Google Scholar]
- Numpy 1.25.2. 2023. Available online: https://pypi.org/project/numpy/1.25.2/ (accessed on 26 December 2024).
- Pandas Document. 2023. Available online: https://tinyurl.com/4zvwsyt9 (accessed on 26 December 2024).
- scikit-learn: Machine Learning in Python. 2023. Available online: https://scikit-learn.org/stable/ (accessed on 26 December 2024).
- Bergui, M.; Hourri, S.; Najah, S.; Nikolov, N.S. Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques. J. Big Data 2024, 11, 98. [Google Scholar] [CrossRef]
- Linear Regression. 2024. Available online: https://tinyurl.com/mr3fkeph (accessed on 7 March 2025).
- Adamu, H.; Mohammed, B.; Maina, A.B.; Cullen, A.; Ugail, H.; Awan, I. An approach to failure prediction in a cloud based environment. In Proceedings of the 2017 IEEE 5th International Conference on Future Internet of Things and Cloud (FiCloud), Prague, Czech Republic, 21–23 August 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 191–197. [Google Scholar]
- Gupta, N.; Vaidya, N.H. Byzantine fault-tolerant parallelized stochastic gradient descent for linear regression. In Proceedings of the 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 24–29 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 415–420. [Google Scholar]
- Shi, P.; Wang, P.; Zhang, H. Distributed Logistic Regression for Separated Massive Data. In Proceedings of the CCF Conference on Big Data, Wuhan, China, 26–28 September 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 285–296. [Google Scholar]
- Random Forest. 2025. Available online: https://tinyurl.com/2ecnyukc (accessed on 7 March 2025).
- Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T.; et al. Xgboost: Extreme Gradient Boosting; R Package Version 0.4-2; 2015; Volume 1, pp. 1–4. [Google Scholar]
- Jagannathan, S.; Sharma, Y.; Taheri, J. Generic_Failure_Prediction_Models. 2025. Available online: https://tinyurl.com/55jwkcty (accessed on 3 August 2025).
- Sharma, Y.; Taheri, J.; Si, W.; Sun, D.; Javadi, B. Dynamic resource provisioning for sustainable cloud computing systems in the presence of correlated failures. IEEE Trans. Sustain. Comput. 2020, 6, 641–654. [Google Scholar] [CrossRef]
Site | #Nodes | #Failures | MTBF | SD(MTBF) | CV(MTBF) | MTTR | SD(MTTR) | CV(MTTR) | Availability | Maintainability |
---|---|---|---|---|---|---|---|---|---|---|
s1 | 259 | 69,234 | 28.77 | 73.62 | 2.56 | 13.36 | 64.93 | 4.86 | 68.29 | 31.71 |
s2 | 95 | 8161 | 41.31 | 109.66 | 2.65 | 14.11 | 72.63 | 5.15 | 74.57 | 25.43 |
s3 | 57 | 5760 | 102.71 | 196.58 | 1.91 | 5.85 | 89.97 | 15.39 | 94.62 | 5.38 |
s4 | 163 | 37,470 | 37.47 | 89.72 | 2.39 | 4.06 | 31.27 | 7.70 | 90.23 | 9.77 |
s5 | 125 | 7509 | 79.55 | 271.63 | 3.41 | 36.02 | 217.04 | 6.03 | 68.83 | 31.17 |
s6 | 127 | 14,115 | 89.73 | 207.11 | 2.31 | 6.21 | 74.45 | 11.98 | 93.52 | 6.48 |
s7 | 46 | 21,120 | 14.57 | 33.98 | 2.33 | 0.96 | 15.32 | 15.97 | 93.82 | 6.18 |
s8 | 342 | 123,467 | 22.86 | 46.82 | 2.05 | 4.29 | 39.18 | 9.12 | 84.19 | 15.81 |
s9 | 74 | 7309 | 27.62 | 72.48 | 2.62 | 6.34 | 39.73 | 6.26 | 81.32 | 18.68 |
Cluster | #Nodes | #Failures | MTBF | SD(MTBF) | CV(MTBF) | MTTR | SD(MTTR) | CV(MTTR) | Availability | Maintainability |
---|---|---|---|---|---|---|---|---|---|---|
s1/c1 | 64 | 13,067 | 43.27 | 111.13 | 2.57 | 21.15 | 91.07 | 4.31 | 63.12 | 36.88 |
s1/c2 | 64 | 25,461 | 22.26 | 41.09 | 1.85 | 10.22 | 40.75 | 3.99 | 74.88 | 25.12 |
s1/c3 | 99 | 25,954 | 26.14 | 62.27 | 2.38 | 6.26 | 46.84 | 7.48 | 77.33 | 22.67 |
s1/c4 | 32 | 4752 | 38.23 | 117.71 | 3.08 | 47.50 | 126.78 | 2.67 | 34.47 | 65.53 |
s2/c1 | 95 | 8161 | 41.37 | 109.66 | 2.65 | 14.11 | 72.63 | 5.15 | 74.57 | 25.43 |
s3/c1 | 57 | 5760 | 102.71 | 196.58 | 1.91 | 5.85 | 89.97 | 15.39 | 94.62 | 5.38 |
s4/c1 | 106 | 34,541 | 34.40 | 85.00 | 2.47 | 4.31 | 32.24 | 7.48 | 88.87 | 11.13 |
s4/c2 | 57 | 2929 | 73.76 | 127.94 | 1.73 | 1.12 | 15.51 | 13.90 | 98.51 | 1.49 |
s5/c1 | 56 | 5348 | 101.23 | 316.88 | 3.13 | 29.11 | 167.98 | 5.77 | 77.67 | 22.33 |
s5/c2 | 69 | 2161 | 26.47 | 68.13 | 2.57 | 53.12 | 305.73 | 5.76 | 33.26 | 66.74 |
s6/c1 | 103 | 12,998 | 94.44 | 213.31 | 2.26 | 5.40 | 67.86 | 12.57 | 94.59 | 5.41 |
s6/c2 | 24 | 1117 | 35.93 | 99.69 | 2.77 | 15.69 | 127.97 | 8.16 | 69.61 | 30.39 |
s7/c1 | 46 | 21,120 | 14.57 | 33.98 | 2.33 | 0.96 | 15.32 | 15.97 | 93.82 | 6.18 |
s8/c1 | 342 | 123,467 | 22.86 | 46.82 | 2.05 | 4.29 | 39.18 | 9.12 | 84.19 | 15.81 |
s9/c1 | 74 | 7309 | 27.62 | 72.48 | 2.62 | 6.34 | 39.73 | 6.26 | 81.32 | 18.68 |
Linear Regression | Random Forest | XGBoost | |||||||
---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | ||||
2,907,292.09 | 3,550,861 | 0.93 | 1,391,069.55 | 1,979,144.19 | 0.98 | 528,141.4 | 764,059.13 | 1 | Training |
2,909,707.78 | 3,549,290.19 | 0.93 | 1,425,669.2 | 2,008,907.7 | 0.98 | 547,655.48 | 797,043.5 | 1 | Testing |
Hyperparameter | Random Forest | XGBoost |
---|---|---|
Number of Trees (n_estimators) | 10 | 100 |
Maximum Depth (max_depth) | 5 | 5 |
Random Seed (random_state) | 42 | 42 |
Feature Subsampling (max_features) | 1.0 | Not tuned (default: 1.0) |
Minimum Samples per Split (min_samples_split) | 2 | Not tuned (default: 1) |
Minimum Samples per Leaf (min_samples_leaf) | 1 | Not tuned (default: 1) |
Learning Rate (learning_rate) | NA | 0.1 |
Bootstrap Sampling (bootstrap) | True | Not tuned (default: 1.0) |
Linear Regression | Random Forest | XGBoost | |||||||
---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | ||||
82,799.85 | 241,987.31 | 0.06 | 25,152.81 | 85,626.94 | 0.88 | 72,240.72 | 180,713.41 | 0.48 | Training |
80,839.64 | 173,716.85 | 0.09 | 63,601.27 | 168,353.66 | 0.15 | 71,728.74 | 161,147.80 | 0.22 | Testing |
Hyperparameter | Random Forest | XGBoost |
---|---|---|
Number of Trees (n_estimators) | 5 | 5 |
Maximum Depth (max_depth) | unlimited | unlimited |
Random Seed (random_state) | 42 | 42 |
Feature Subsampling (max_features) | 1.0 | Not tuned (default: 1.0) |
Minimum Samples per Split (min_samples_split) | 2 | Not tuned (default: 1) |
Minimum Samples per Leaf (min_samples_leaf) | 1 | Not tuned (default: 1) |
Learning Rate (learning_rate) | NA | 0.3 |
Bootstrap Sampling (bootstrap) | True | Not tuned (default: 1.0) |
Logistic Regression | Random Forest | XGBoost |
---|---|---|
2.73% | 97% | 99% |
Hyperparameter | Logistic Regression | Random Forest | XGBoost |
---|---|---|---|
Number of Trees (n_estimators) | NA | 10 | 100 |
Maximum Depth (max_depth) | unlimi ted | unlimited | unlimited |
Random Seed (random_state) | 42 | 42 | 42 |
Feature Subsampling (max_features) | NA | max_features: ‘sqrt’ | 1.0 |
Minimum Samples per Split (min_samples_split) | NA | 2 | 1 |
Minimum Samples per Leaf (min_samples_leaf) | NA | 1 | 1 |
Learning Rate (learning_rate) | NA | NA | 0.3 |
Bootstrap Sampling (bootstrap) | NA | True | 1.0 |
Objective | penalty: ‘l2’ | criterion: ‘gini’ | objective: ‘multi:softprob’ |
Regularization Strength (C) | 1.0 | NA | NA |
Linear Regression | Random Forest | XGBoost | |||||||
---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | ||||
1,516,222.63 | 1,878,392.72 | 0.8 | 711,677.83 | 938,202.09 | 0.95 | 249,227.43 | 340,042.93 | 0.99 | Baseline |
1,503,313.88 | 1,863,247.75 | 0.8 | 748,930.95 | 1,000,889.96 | 0.94 | 311,125.64 | 427,993.66 | 0.99 | Generic |
Linear Regression | Random Forest | XGBoost | |||||||
---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | ||||
168,434.33 | 320,717.22 | 0.16 | 54,213.23 | 143,246.53 | 0.83 | 136,879.06 | 274,331.86 | 0.39 | Baseline |
170,774.77 | 335,822.34 | 0.16 | 135,691.16 | 329,574.96 | 0.19 | 149,090.82 | 313,381.94 | 0.27 | Generic |
Logistic Regression | Random Forest | XGBoost | |
---|---|---|---|
5.76% | 91% | 90% | Baseline |
4.01% | 89% | 90% | Generic |
Linear Regression | Random Forest | XGBoost | |||||||
---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | ||||
2,929,578.49 | 3,639,185.1 | 0.93 | 1,504,051.79 | 2,236,507.97 | 0.97 | 587,965.81 | 843,305.69 | 1 | Baseline |
2,970,018.64 | 3,677,263.41 | 0.93 | 1,492,496.03 | 2,197,940.25 | 0.97 | 608,063.14 | 876,199.7 | 1 | Generic |
Linear Regression | Random Forest | XGBoost | |||||||
---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | ||||
90,910.18 | 250,519.23 | 0.07 | 27,631.40 | 96,427.20 | 0.86 | 79,319.45 | 194,747.05 | 0.44 | Baseline |
89,362.67 | 190,010.69 | 0.10 | 71,980.13 | 297,582.00 | −1.21 | 80,320.44 | 203,852.45 | −0.04 | Generic |
Logistic Regression | Random Forest | XGBoost | |
---|---|---|---|
2.83% | 97% | 98% | Baseline |
1.59% | 96% | 98% | Baseline |
Cluster | Linear Regression | Random Forest | XGBoost | |||||||
---|---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | |||||
s1/c1 | 2,199,158.49 | 3,268,040.41 | 0.88 | 1,324,695.21 | 2,316,888.16 | 0.94 | 347,184.94 | 559,061.22 | 1 | Baseline |
2,189,153.88 | 3,218,301.35 | 0.88 | 1,348,533.01 | 2,346,680.86 | 0.94 | 379,420.4 | 632,303.97 | 1 | Generic | |
s1/c2 | 2,659,912.14 | 3,273,118.61 | 0.95 | 1,749,929.32 | 2,314,754.93 | 0.98 | 596,981.65 | 791,038.14 | 1 | Baseline |
2,680,141.68 | 3,298,153.24 | 0.95 | 1,739,446.02 | 2,300,283.33 | 0.98 | 605,162.2 | 804,285.57 | 1 | Generic | |
s1/c3 | 3,233,986.4 | 4,122,216.75 | 0.77 | 970,806.41 | 1,500,475.02 | 0.97 | 459,026.57 | 709,658.3 | 0.99 | Baseline |
3,262,003.85 | 4,140,497.15 | 0.77 | 995,052.43 | 1,552,075.99 | 0.97 | 476,904.74 | 742,083.37 | 0.99 | Generic | |
s1/c4 | 5,405,297.07 | 7,121,341.45 | 0.68 | 1,230,335.78 | 1,693,237.59 | 0.98 | 299,407.66 | 438,295.97 | 1 | Baseline |
5,565,339.12 | 7,293,853.22 | 0.68 | 1,293,780.17 | 1,769,684.09 | 0.98 | 368,256.01 | 587,340.22 | 1 | Generic | |
s5/c1 | 3,908,274.31 | 6,371,524.43 | 0.61 | 1,463,884.23 | 2,141,642.88 | 0.96 | 500,899.21 | 759,944.39 | 0.99 | Baseline |
4,048,997.78 | 6,616,939.18 | 0.59 | 1,524,337.44 | 2,287,389.78 | 0.95 | 596,031.88 | 1,017,532.05 | 0.99 | Generic | |
s5/c2 | 723,416.46 | 1,077,801.67 | 0.62 | 259,957.17 | 418,968.67 | 0.94 | 78,197.26 | 118,401.72 | 1 | Baseline |
796,535.8 | 1,218,561.45 | 0.6 | 272,547.21 | 437,900.86 | 0.95 | 104,169.1 | 180,896.82 | 0.999 | Generic | |
s6/c1 | 3,153,144.41 | 4,279,153.41 | 0.8 | 1,501,901.58 | 2,182,402.99 | 0.95 | 780,128.98 | 1,180,636.25 | 0.99 | Baseline |
3,081,761.98 | 4,162,700.42 | 0.81 | 1,540,490.79 | 2,264,878.39 | 0.94 | 833,206.58 | 1,284,525.99 | 0.98 | Generic | |
s6/c2 | 1,298,294.51 | 1,539,855.13 | 0.72 | 312,892.46 | 569,396.19 | 0.96 | 102,592.13 | 194,585.32 | 1 | Baseline |
1,334,536.26 | 1,563,239.86 | 0.72 | 337,260.72 | 604,949.25 | 0.96 | 202,532.96 | 403,870.33 | 0.98 | Generic |
Cluster | Linear Regression | Random Forest | XGBoost | |||||||
---|---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | |||||
s1/c1 | 152,763.12 | 349,213.89 | 0.07 | 15,694.34 | 77,177.63 | 0.95 | 103,916.68 | 215,901.53 | 0.64 | Baseline |
157,409.89 | 381,882.06 | 0.06 | 43,177.68 | 182,658.91 | 0.78 | 109,583.82 | 240,805.48 | 0.63 | Generic | |
s1/c2 | 69,610.00 | 147,501.28 | 0.03 | 17,886.32 | 51,671.62 | 0.88 | 61,943.50 | 127,711.20 | 0.27 | Baseline |
69,486.31 | 144,658.84 | 0.03 | 47,007.45 | 108,098.98 | 0.46 | 63,099.48 | 126,937.25 | 0.25 | Generic | |
s1/c3 | 82,588.79 | 193,336.03 | 0.07 | 17,144.05 | 68,314.25 | 0.88 | 61,183.82 | 155,740.73 | 0.40 | Baseline |
82,699.51 | 199,264.32 | 0.06 | 44,046.25 | 156,888.36 | 0.42 | 61,663.23 | 167,280.48 | 0.34 | Generic | |
s1/c4 | 214,730.28 | 446,047.88 | 0.03 | 29,888.69 | 117,178.39 | 0.93 | 152,044.45 | 310,917.54 | 0.53 | Baseline |
207,948.02 | 378,717.08 | 0.04 | 851,54.97 | 283,222.59 | 0.46 | 156,238.60 | 295,533.46 | 0.41 | Generic | |
s5/c1 | 336,979.00 | 867,680.11 | 0.07 | 71,563.04 | 278,300.57 | 0.90 | 192,955.42 | 517,171.39 | 0.67 | Baseline |
357,527.44 | 977,364.90 | 0.07 | 218,493.89 | 769,349.08 | 0.42 | 240,967.19 | 769,882.47 | 0.42 | Generic | |
s5/c2 | 219,572.57 | 640,135.30 | 0.38 | 23,718.38 | 90,259.35 | 0.99 | 78,866.88 | 201,647.32 | 0.94 | Baseline |
22,6461.26 | 663,247.17 | 0.13 | 67,366.99 | 299,789.34 | 0.82 | 86,299.34 | 284,238.21 | 0.84 | Generic | |
s6/c1 | 255,238.28 | 558,095.17 | 0.10 | 63,722.97 | 222,363.16 | 0.86 | 179,200.94 | 432,513.96 | 0.46 | Baseline |
252,873.31 | 581,151.54 | 0.09 | 167,258.92 | 521,851.38 | 0.27 | 187,279.23 | 505,451.88 | 0.31 | Generic | |
s6/c2 | 159,384.58 | 419,298.69 | 0.04 | 52,639.95 | 187,566.56 | 0.81 | 102,106.25 | 286,142.70 | 0.55 | Baseline |
151,952.16 | 346,023.51 | 0.03 | 146,792.12 | 412,390.40 | −0.38 | 121,590.90 | 322,413.14 | 0.16 | Generic |
Cluster | Logistic Regression | Random Forest | XGBoost | |
---|---|---|---|---|
s1/c1 | 2.01% | 70% | 72% | Baseline |
1.70% | 68% | 71% | Generic | |
s1/c2 | 3.78% | 100% | 100% | Baseline |
2.76% | 100% | 100% | Generic | |
s1/c3 | 4.52% | 89% | 92% | Baseline |
2.16% | 86% | 91% | Generic | |
s1/c4 | 9.63% | 84% | 87% | Baseline |
9.47% | 83% | 85% | Generic | |
s5/c1 | 7.11% | 96% | 97% | Baseline |
6.37% | 96% | 97% | Generic | |
s5/c2 | 6.47% | 99% | 100% | Baseline |
3.55% | 98% | 100% | Generic | |
s6/c1 | 3.57% | 71% | 77% | Baseline |
2.19% | 66% | 76% | Generic | |
s6/c2 | 8.29% | 100% | 100% | Baseline |
6.43% | 100% | 100% | Generic |
Site | Linear Regression | Random Forest | XGBoost | |||||||
---|---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | |||||
s1 | 3,594,633.79 | 4,804,189.99 | 0.89 | 2,610,815.28 | 4,170,278.14 | 0.92 | 902,671.41 | 1,344,152.73 | 0.99 | Baseline |
3,603,998.9 | 4,820,869.4 | 0.89 | 2,633,029.66 | 4,183,762.66 | 0.92 | 922,609.09 | 1,375,960.08 | 0.99 | Generic | |
s2 | 1,875,998.15 | 2,654,180.88 | 0.89 | 1,298,825.62 | 1,832,759 | 0.95 | 388,923.52 | 542,769.17 | 1 | Baseline |
1,864,392.5 | 2,632,621.32 | 0.89 | 1,297,918.93 | 1,862,982.86 | 0.94 | 435,750.29 | 623,398.86 | 0.99 | Generic | |
s3 | 2,243,590.35 | 3,359,843.84 | 0.88 | 1,474,401.75 | 2,171,123.02 | 0.95 | 579,448.46 | 781,626.05 | 0.99 | Baseline |
2,308,853.7 | 3,520,793.58 | 0.87 | 1,589,971.91 | 246,9831.48 | 0.94 | 667,541.89 | 936,021.59 | 0.99 | Generic | |
s5 | 3,371,720.22 | 5,644,021.54 | 0.64 | 1,629,967.04 | 2,527,218.54 | 0.93 | 468,405.11 | 767,984.11 | 0.99 | Baseline |
3,320,485.51 | 5,528,168.72 | 0.65 | 1,713,441.11 | 2,600,937.34 | 0.92 | 523,765.47 | 891,070.42 | 0.99 | Generic | |
s6 | 3,015,870.24 | 4,134,077.2 | 0.82 | 1,489,560.71 | 2,233,483.46 | 0.95 | 788,098.41 | 1,201,042.1 | 0.98 | Baseline |
3,013,126.08 | 4,107,923.41 | 0.81 | 1,486,652.87 | 2,231,897.67 | 0.94 | 846,030.15 | 1,297,938.08 | 0.98 | Generic | |
s7 | 3,812,223.14 | 4,685,561.77 | 0.59 | 1,355,112.79 | 2,109,539.43 | 0.92 | 468,925.3 | 742,990.44 | 0.99 | Baseline |
3,769,956.96 | 4,637,465.53 | 0.6 | 1,359,839.76 | 2,114,679.41 | 0.92 | 480,980.74 | 761,642.39 | 0.99 | Generic | |
s8 | 2,934,946.18 | 3,850,773.55 | 0.93 | 2,717,223.55 | 3,712,231.89 | 0.94 | 1,192,225.37 | 1,733,187.45 | 0.99 | Baseline |
2,942,925.57 | 3,865,924.17 | 0.93 | 2,738,509.14 | 3,748,577.23 | 0.93 | 1,207,040.17 | 1,751,338.91 | 0.99 | Generic | |
s9 | 1,568,903.71 | 2,132,447.66 | 0.38 | 509,430.12 | 794,061.51 | 0.91 | 221,646.51 | 338,657.31 | 0.98 | Baseline |
1,537,169.02 | 2,107,557.67 | 0.39 | 521,487.09 | 807,399.1 | 0.91 | 247,652.1 | 384,438.38 | 0.98 | Generic |
Site | Linear Regression | Random Forest | XGBoost | |||||||
---|---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | |||||
s1 | 98,084.42 | 243,160.00 | 0.03 | 19,472.36 | 74,867.44 | 0.91 | 86,326.38 | 204,099.35 | 0.32 | Baseline |
99,591.32 | 261,988.05 | 0.03 | 50,707.36 | 164,248.80 | 0.62 | 87,906.56 | 218,229.07 | 0.33 | Generic | |
s2 | 129,295.72 | 344,254.98 | 0.04 | 40,150.03 | 145,700.24 | 0.83 | 97,977.73 | 223,790.45 | 0.59 | Baseline |
123,680.56 | 277,375.12 | 0.04 | 99,671.44 | 246,767.06 | 0.24 | 101,645.88 | 266,755.79 | 0.11 | Generic | |
s3 | 256,350.49 | 516,752.08 | 0.20 | 80,455.05 | 233,111.39 | 0.84 | 190,387.69 | 414,668.29 | 0.48 | Baseline |
269,241.48 | 544,391.12 | 0.13 | 208,308.36 | 496,903.93 | 0.27 | 210,672.04 | 476,608.63 | 0.33 | Generic | |
s5 | 313,675.73 | 855,227.09 | 0.12 | 59,524.0 | 257,933.16 | 0.92 | 184,782.26 | 530,638.40 | 0.66 | Baseline |
295,104.21 | 730,672.56 | 0.14 | 148,396.99 | 498,143.30 | 0.60 | 184,808.14 | 490,943.98 | 0.61 | Generic | |
s6 | 244,538.51 | 539,270.31 | 0.09 | 62333.38 | 214,946.62 | 0.86 | 174,018.99 | 425,989.95 | 0.43 | Baseline |
254,923.49 | 607,804.99 | 0.08 | 170,037.90 | 539,915.14 | 0.27 | 191,738.13 | 534,703.41 | 0.29 | Generic | |
s7 | 42,818.57 | 95,503.49 | 0.08 | 11,834.98 | 44,314.83 | 0.80 | 32,363.12 | 85,351.86 | 0.27 | Baseline |
40,743.34 | 85,885.05 | 0.10 | 27,689.87 | 85,674.92 | 0.10 | 31,179.39 | 79,177.38 | 0.23 | Generic | |
s8 | 59,907.70 | 154,166.46 | 0.06 | 20,572.11 | 69,455.39 | 0.81 | 56,645.48 | 144,189.60 | 0.18 | Baseline |
59,604.74 | 151,910.12 | 0.07 | 52,183.00 | 147,429.36 | 0.12 | 56,475.81 | 143,090.13 | 0.17 | Generic | |
s9 | 88,555.11 | 208,842.04 | 0.08 | 23,839.88 | 82,151.00 | 0.86 | 62,367.80 | 154,173.24 | 0.50 | Baseline |
84,642.17 | 193,272.17 | 0.07 | 60,390.07 | 174,247.89 | 0.24 | 64,478.85 | 163,684.96 | 0.33 | Generic |
Site | Logistic Regression | Random Forest | XGBoost | |
---|---|---|---|---|
s1 | 2.15% | 89% | 74% | Baseline |
1.43% | 88% | 71% | Generic | |
s2 | 4.31% | 99% | 100% | Baseline |
3.79% | 99% | 99% | Generic | |
s3 | 5.21% | 100% | 100% | Baseline |
4.24% | 100% | 100% | Generic | |
s5 | 5.89% | 98% | 99% | Baseline |
5.56% | 97% | 98% | Generic | |
s6 | 3.16% | 72% | 82% | Baseline |
2.19% | 69% | 77% | Generic | |
s7 | 6.14% | 96% | 98% | Baseline |
5.79% | 95% | 97% | Generic | |
s8 | 1.48% | 99% | 100% | Baseline |
0.78% | 99% | 99% | Generic | |
s9 | 6.23% | 100% | 100% | Baseline |
4.06% | 100% | 100% | Generic |
Linear Regression | Random Forest | XGBoost | |||||||
---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | ||||
4,919,256.24 | 6,540,568.72 | 0.78 | 377,005.23 | 512,468.65 | 0.87 | 1,546,876.4 | 2,217,474.9 | 0.97 | Baseline |
4,917,627.14 | 6,556,851.1 | 0.78 | 3,764,432.5 | 5,113,083.8 | 0.87 | 1,554,655.7 | 1,554,655.7 | 0.97 | Generic |
Linear Regression | Random Forest | XGBoost | |||||||
---|---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | ||||
93,686.57 | 277,284.75 | 0.08 | 26,289.11 | 103,791.21 | 0.87 | 84,889.04 | 248,664.81 | 0.26 | Baseline |
94,398.16 | 274,163.97 | 0.07 | 67,538.24 | 224,107.79 | 0.39 | 86,020.85 | 251,870.95 | 0.22 | Generic |
Logistic Regression | Random Forest | XGBoost | |
---|---|---|---|
1.15% | 4.03% | 95% | Baseline |
0.70% | 2.95% | 94.28% | Generic |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jagannathan, S.; Sharma, Y.; Taheri, J. Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems. Electronics 2025, 14, 3386. https://doi.org/10.3390/electronics14173386
Jagannathan S, Sharma Y, Taheri J. Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems. Electronics. 2025; 14(17):3386. https://doi.org/10.3390/electronics14173386
Chicago/Turabian StyleJagannathan, Srigoutam, Yogesh Sharma, and Javid Taheri. 2025. "Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems" Electronics 14, no. 17: 3386. https://doi.org/10.3390/electronics14173386
APA StyleJagannathan, S., Sharma, Y., & Taheri, J. (2025). Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems. Electronics, 14(17), 3386. https://doi.org/10.3390/electronics14173386