Next Article in Journal
Digital Twin-Assisted Deep Reinforcement Learning for Joint Caching and Power Allocation in Vehicular Networks
Next Article in Special Issue
A Sequence-Aware Surrogate-Assisted Optimization Framework for Precision Gyroscope Assembly Based on AB-BiLSTM and SEG-HHO
Previous Article in Journal
Shielding Effectiveness Evaluation of Wall-Integrated Energy Storage Devices
Previous Article in Special Issue
A Simple Open-Loop Control Method for Optimizing Manufacturing Control Knobs Using Artificial Intelligence
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems

1
School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT7 1NN, UK
2
Faculty of Engineering and Applied Science, University of Regina, Regina, SK S4S 0A2, Canada
3
Department of Mathematics and Computer Science, Karlstad University, 651 88 Karlstad, Sweden
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(17), 3386; https://doi.org/10.3390/electronics14173386
Submission received: 23 June 2025 / Revised: 15 August 2025 / Accepted: 22 August 2025 / Published: 26 August 2025

Abstract

The increasing complexity of Distributed Computing (DC) systems requires advanced failure-prediction models to enhance reliability and efficiency. This study proposes a comprehensive methodology for developing generic machine learning (ML) models capable of cross-layer and cross-platform failure-prediction without requiring platform-specific retraining. Using the Grid5000 failure dataset from the Failure Trace Archive (FTA), we explored Linear and Logistic Regression, Random Forest, and XGBoost to predict three critical metrics: Time Between Failures (TBF), Time to Return/Repair (TTR), and Failing Node Identification (FNI). Our approach involved extensive exploratory data analysis (EDA), statistical examination of failure patterns, and model evaluation across the cluster, site, and system levels. The results demonstrate that XGBoost consistently outperforms the other models, achieving near-perfect 100% accuracy for TBF and FNI, with robust generalisability across diverse DC environments. In addition, we introduce a hierarchical DC architecture that integrates these failure-prediction models. In the form of a use case, we also demonstrate how service providers can use these prediction models to balance service reliability and cost.

1. Introduction

Accurate failure identification and predictive modelling using Machine Learning (ML) techniques have become essential components of managing modern Distributed Computing (DC) systems [1,2]. These techniques enable system administrators and resource management units to proactively detect potential failures, thereby minimising service disruptions, optimising resource allocation, and enhancing the system reliability [3]. However, the heterogeneous nature of DC systems, comprising multilayered hierarchical architectures of interconnected nodes, clusters, and sites (datacenters) (Figure 1) combined with the resource-intensive nature of ML methods, poses significant challenges. These include developing advanced and generalised failure-prediction models that enable cross-layer and cross-platform transferability without retraining, that is, models which can be trained on one layer or platform and seamlessly applied to other layers or entirely different DC platforms. This challenge arises from the resource and architectural heterogeneity of DC systems as well as from the fundamental trade-off between prediction accuracy, cost of prediction, and cost of repair in such complex environments [4].
Node-specific failure-prediction models can achieve high accuracy, but are expensive to implement at scale. For example, deploying individual models for thousands of compute nodes would be very expensive in terms of computational resources and management overheads. In contrast, more generic failure-prediction models at the cluster, site, or system level are more cost-effective and easier to manage; however, these broader models tend to sacrifice failure-prediction accuracy as they attempt to be generic across increasingly diverse components in the DC hierarchy. This inverse relationship between model generalisation and prediction accuracy presents a significant challenge for the development of effective ML-based failure-prediction solutions for large-scale DC platforms.
To address this challenge, this study proposes the development and evaluation of generic ML models for failure detection and prediction, which are designed to operate across multiple hierarchical levels of diverse DC environments without requiring platform-specific retraining. To achieve this, we explored various ML models, including Linear and logistic regression, Random Forest, and Extreme Gradient Boost (XGBoost) to perform both regression and classification tasks. These models were explicitly designed to predict three key metrics: (1) time between failures (TBF) to predict the time of the next failure, (2) time to return/repair (TTR) to predict the duration of the next failure, and (3) failing node identification (FNI) to predict the next nodes that are likely to fail. FNI prediction is treated as a classification problem, whereas TBF and TTR predictions are treated as regression problems. For model training and testing, we used several datasets from the Grid5000 failure benchmark obtained from the Failure Trace Archives [5]. The specific contributions of this study are as follows:
  • We proposed a comprehensive methodology for developing and evaluating cross-layer and cross-platform failure-prediction models in DC environments.
  • We conducted an in-depth exploratory data analysis (EDA) and statistically examined system failure characteristics, offering valuable insights for researchers seeking to understand complex failure patterns in DC systems.
  • We developed, evaluated, and validated a range of ML-based failure-prediction models at multiple hierarchical levels (cluster and site) within a DC platform, as well as across different DC platforms.
  • We proposed a hierarchical architecture and to integrate failure-prediction models to enhance DC system reliability and cost efficiency.
The remainder of this paper is organised as follows: Section 2 reviews the related work. Section 3 details the failure traces used in this study. Section 4 outlines the methodology employed, the system setup, and covers data analysis and model generation. Section 5 presents the results and findings, followed by the proposed system architecture in Section 6. Section 7 concludes the study and discusses future work.

2. Related Work

In this section, we examine the existing research on failure-prediction modelling in DC systems. We categorised the related literature into four main approaches: Statistics and Rule-based, Traditional Machine Learning (ML), Deep Learning, and Hybrid techniques. By examining these categories, we aim to clarify how our work contributes to and extends current state-of-the-art DC failure-prediction.
Statistics and Rule-based Methods: Earlier approaches for fault detection relied on rule-based systems and statistical methods. Hochenbaum et al. [6] introduced two methods for detecting anomalies in cloud infrastructure data, both focusing on long-term trends. The authors enhanced the generalised Extreme Studentized Deviate (ESD) test, using time-series decomposition and robust statistics, to improve accuracy while reducing false positives. These methods account for daily and weekly patterns in metrics, such as Tweets-Per-Second and CPU usage. The authors demonstrated the effectiveness of their methods using real production data and a randomised approach to introduce anomalies, both of which help to maintain high availability and performance in cloud environments. Kourtis et al. [7] presented a method for finding anomalies in Network Function Virtualization (NFV) environments using statistical techniques. The authors combined an open-source monitoring system with statistical methods to analyse multiple metrics from the NFV infrastructure and identify unusual behaviours in network functions. This approach detects performance problems and security threats early, thereby ensuring the quality and security of the NFV services. Most statistics and rule-based methods, such as those presented above, are straightforward and interpretable but often struggle with high-dimensional data and may not capture complex fault patterns in current DC platforms [8].
Traditional Machine Learning-based Methods: In recent years, machine learning (ML) techniques have gained significant attention for fault detection and prediction modelling. Mariani et al. [1] introduced PreMiSE, a method for predicting failures and locating faults in DC platforms. It combines anomaly detection using unsupervised ML to identify unusual behaviours with signature-based prediction, which employs supervised ML to recognise known failure patterns. The authors also used clustering to group similar behaviours and identify potential failures. By integrating these methods, PreMiSE was demonstrated to reduce false alarms and improve the accuracy of failure-prediction, thereby enhancing the reliability of multi-tier DC systems. ML approaches offer significant advantages, particularly in their capacity to manage large datasets and model the complex relationships between features. However, they rely heavily on the availability of the data, which is not always accessible [9]. Additionally, the performance of these models depends on the quality and representativeness of the training data, which can restrict their effectiveness in highly dynamic DC environments [10].
Deep Learning-based Methods: These methods are proposed based on the promising results from ML approaches, while addressing some of their shortcomings. For example, Long Short-Term Memory (LSTM) networks have emerged as powerful tools for fault prediction. Gao et al. [11] presented a multilayer Bidirectional Long Short Term Memory (Bi-LSTM) model designed to predict task and job failures in cloud data centres that often face high failure rates owing to hardware and software problems. The authors analysed previous system message logs to improve the prediction accuracy and achieved very high accuracy for tasks and job failures in their experiments. Shen et al. [12] presents a model using LSTM networks to predict hard disk drive (HDD) failures in mobile edge computing. The model analyses long-term data trends to improve the prediction accuracy and introduces a method for assessing HDD health by tracking current conditions over time. Tests on real-world datasets show that their proposed model achieves high accuracy with minimal resource usage, thereby enhancing device reliability in edge computing environments. Most of these approaches are based on LSTM’s learning patterns over time and are known to outperform traditional ML models in time-series fault prediction [13,14,15]. However, they face criticism for their lack of interpretability because their internal mechanisms are not easily understood [16]. Training deep learning models can also be computationally expensive and slow, posing additional challenges for deployment in resource-limited environments or real-time systems [17].
Hybrid Methods: These methods integrate multiple approaches to enhance the fault detection and prediction in DC platforms. Gaykar et al. [18] presented a hybrid ML model that combined various supervised learning techniques to detect and mitigate job failures in virtual machines (VMs) in DC environments. To avoid failures, the proposed model focuses on predicting underperforming nodes using metrics, such as CPU and memory loads, by integrating statistical methods with ML algorithms. The experimental results showed improved prediction accuracy while enhancing the overall system reliability. Ensemble learning techniques that aggregate the strengths of various models have also been employed to achieve superior results. Zhong et al. [19] proposed a framework (HELAD) to effectively detect network anomalies and avoid failures in DC systems. HELAD combines various ML algorithms, specifically the Damped Incremental Statistics algorithm, for feature extraction from network traffic, followed by training an Auto-encoder with a small amount of labelled data to mark abnormal scores. The output of the Auto-encoder is then used to train an LSTM network, which helps predict abnormal scores to recognise malicious activities in diverse environments. Although these hybrid approaches yield a better performance, they add complexity to the model training and deployment process, sharing concerns with the use of deep learning approaches [20].
Despite these advancements, noticeable gaps remain in the literature. First, there is a lack of models that are generalisable across diverse DC systems, both cross-layer and cross-platform. Second, real-time fault prediction remains underexplored, with high-accuracy offline models often untested in live scenarios, limiting their utility for rapid decision-making in modern DC systems. This study addresses these gaps by developing accurate and generalisable fault prediction models using a combination of ML techniques, with the aim of reducing downtime through proactive fault management in multilayered and multiple DC platforms.

3. Overview of Failure Traces

We used multiple failure traces from the Grid5000 testbed, which is a large-scale platform for experiment-driven research in parallel and distributed computing, including cloud computing, high-performance computing (HPC), Big Data, and AI [21]. The traces were sourced from the Failure Trace Archive (FTA), a public repository containing standardised tab-separated files from 26 diverse parallel and distributed computing systems [5].
Grid5000 comprises nine geographically distributed sites (sites 1 to 9), each operating as a local datacentre with its own clusters of nodes that can be independently reserved, configured, and operated. These sites are interconnected via RENATER, a dedicated 10 Gbps national backbone network [22]. The architecture is often described as a “cluster of clusters” or a “meta-datacenter,” exhibiting a hierarchical structure as illustrated in Figure 1, where each site contains multiple clusters, and each cluster consists of individual hosts or nodes. Specifically, site 1 included four clusters (c1–c4), sites 4–6 each had two clusters (c1–c2), and sites 2–3 and 7–9 each had one cluster (c1), totalling 1288 nodes.
This high degree of heterogeneity, geographical independence, and hierarchical structure among the computational resources means that the data collected from the Grid5000 system can be accurately described as originating from different DC systems. This makes the Grid5000 dataset a robust choice for developing targeted generic failure-prediction models.

4. Methodology

Figure 2 illustrates our methodology and highlights the various stages of this study. It is divided into three phases: (1) Data Preprocessing and Analysis, (2) Model Generation, and (3) Experiments and Analysis. Phase-1 focuses on data collection, preprocessing, and initial exploratory and statistical analyses. This ensures data quality, integrity, and preparation for subsequent modelling. Phase-2 involves the development and training of ML-based failure-prediction models. This includes the model selection, design, training and feature importance analysis. Phase-3 evaluates the performance of the ML models using comprehensive metrics and provides critical insights and interpretations.

4.1. Data Preprocessing and Analysis

This section corresponds to Phase-1 of the methodology (Figure 2). It involves pre-processing and analysing Grid5000 failure traces to create a dataset optimised for modelling. The original dataset, consisting of tab-separated files, is pre-processed by selecting and integrating relevant data, removing empty files (’event_state.tab’, ’node_perf.tab’) and eliminating features with no variability (e.g., component_id, platform_id). This resulted in a reduction to seven key features: node_id, node_name, event_id, event_type, event_start_time, event_stop_time, and node_location. Time-related data in epoch format are transformed into a standard data-time format, and new features, such as Time between Failures (TBF), Time to Return (TTR), fault frequency and event count, are engineered from event_start_time, event_stop_time, and event_type features to enhance predictive and analytical capabilities.
Exploratory Data Analysis (EDA) revealed significant variability in features: ‘node_id’, ‘node_location’ and ‘event_type’ across nine sites and 15 clusters (Figure 3). Features with ‘NULL’ or constant values (e.g., ‘num_procs’, ‘event_end_reason’) were excluded due to their lack of predictive utility. ‘event_type’ is binary in nature: ‘1’ for availability and ‘0’ for unavailability events (Figure 3a). ‘node_location’ showed notable variability, with site8/c1 (cluster1 of site8) having the highest node count and event frequency whereas site6/c2 the lowest (Figure 3b–d). This observation suggests that location-specific failure patterns are potentially influenced by node density or external anomalies such as power outages.
Statistical analysis of TBF and TTR at the system, site, and cluster levels (Table 1 and Table 2) further highlights the high variability in the failure patterns. The availability and unavailability were calculated as M T B F M T B F + M T T R and M T T R M T B F + M T T R , respectively. System-level analysis of the entire Grid5000 infrastructure revealed a Mean Time between Failures (MTBF) of 32.41 h (SD = 94.24) and Mean Time to Return (MTTR) of 7.41 h (SD = 60.24), with an overall system availability of 81.38%. Similar to the system-level observations, the site-level analysis revealed diverse patterns, with site s3 having the highest MTBF (102.71 h) and availability (94.62%), whereas site s7 had the lowest MTBF (14.57 h) (Table 1). The cluster-level analysis indicated intra-site similarities (Table 2). For instance, c1–c2 of s1 exhibited a similar TBF behaviour. Based on the observed statistical patterns, we hypothesised that cluster-specific failure-prediction models, which benefit from reduced variability, may offer greater accuracy and robustness than site-level models, whereas site-level models are likely to outperform system-level models. These findings guided the selection of features: ‘node_id’, ‘node_location’, ‘event_types’, ‘event_start_time’, ‘event_stop_time’, ‘TBF’ and ‘TTR’ and the model development, evaluation and validation in Phase-2.

4.2. Model Generation

This section addresses Phase-2 of our methodology, as illustrated in Figure 2. It encompasses three components, namely Data Splitting, Model Selection, and Model Training and Testing. All steps were performed in Google Colab Pro with 52 GB RAM and a T4 GPU [23], using Python 3.10 [23]. The setup included NumPy 1.25.2 [24] for efficient array operations and numerical computing, Pandas 2.1.4 [25] for data table management, and Scikit-learn [26] for regression and classification algorithms.

4.2.1. Data Splitting

We applied an 80–20 split to partition the dataset into the training and testing sets. Approximately 80% of the data were allocated for training to enable robust model learning and 20% were reserved for independent evaluation. Stratified sampling was employed to ensure a balanced test set, supporting an unbiased assessment of the performance on unseen data. Additionally, cross-validation was used to enhance the stability of the results and reduce the dependence on a single split. This involved training the models over various subsets and validating the rest to improve the robustness and provide a reliable measure of generalisation.

4.2.2. Model Selection

The selection of models is guided by the nature of each problem statement, specifically whether it is a regression or classification problem, as well as by the unique characteristics of the dataset identified in Section 4.1. The TBF and TTR predictions are approached as regression problems, whereas the FNI is treated as a classification problem. We employed Linear Regression and Logistic Regression exclusively for regression and classification tasks, respectively. Random Forest and XGBoost (Extreme Gradient Boost) were employed for both regression and classification. We chose these models because of their ability to effectively address both types of problems, and manage high-dimensional and variable data [27].
Linear Regression: It is a fundamental supervised ML technique used for predictive modelling where the goal is to predict the continuous outcomes based on the input features. In linear regression, the models is expressed as z = b + w 1 x 1 + w 2 x 2 + + w n x n where z is the predicted continuous outcome, b is the bias term, and w i is the weight of the features x i identified in Section 4.1. Both b and w i are calculated and fine-tuned during the training phase by minimising the difference between the predicted values and the target values, usually through least squares optimization [28]. This technique is widely applied to failure-prediction scenarios, such as TBF and TTR [29,30].
Logistic Regression: It is a foundational supervised ML technique primarily used for classification tasks, especially to predict the probability that an instance belongs to one of two categories, which in our case is for FNI (i.e., whether a node is likely to fail or not) [31]. Essentially, logistic regression is similar to linear regression, as both begin with a linear combination of input features with corresponding weights, including a bias term that is iteratively optimized during training. However, logistic regression applies a sigmoid (logistic) function y = 1 1 + e z to the output of linear regression z, transforming it into a predicted probability y used for classification tasks such as FNI.
Random Forest: It is an ensemble ML technique that enhances decision trees by combining multiple trees to boost prediction accuracy and robustness. Each tree in a Random Forest is trained on a random subset of the data and features, with the final prediction derived by aggregating the outputs of all trees, typically by averaging for regression tasks or majority voting for classification. For regression, the estimate of a new data point x can be expressed as m M , n ( x , θ 1 , , θ M ) = 1 M j = 1 M m n ( x , θ j ) , where m n ( x , θ j ) is the prediction from j-th tree, M is the total number of trees, and θ j represents random parameters (e.g., feature subsets chosen at each node) that introduce variability during the tree construction [32]. For classification, each tree casts a vote for a class label (i.e., node is likely to fail), and the Random Forest predicts the class with the most votes. This randomness, combined with the model’s ability to manage high-dimensional data and mitigate overfitting, makes Random Forest a powerful tool for many real applications such as failure-prediction.
XGBoost: This is another ensemble ML technique that constructs trees sequentially, unlike Random Forest that trains independent trees in parallel. Each tree is designed to correct the errors of its predecessors by minimising a loss function such as the mean squared error for regression or log loss for classification. This is achieved through gradient boosting, where trees are added iteratively, guided by the gradient of the loss with respect to previous predictions. For a new data point x, the prediction can be expressed as f M ( x ) = j = 1 M h j ( x ) , where h j ( x ) is the output of j-th tree and M is the total number of trees [33]. This iterative error-correcting process enables XGBoost to effectively capture rare events such as system failures, which are often imbalanced or buried in noisy, high-dimensional datasets. In addition, XGBoost incorporates regularisation techniques, such as L1 and L2 penalties, to prevent overfitting. These qualities render XGBoost a powerful tool for failure-prediction and reliability engineering.

4.2.3. Model Training and Testing

To build generic models with cross-layer and cross-platform transferability, we began with clusters (instead of nodes) that represent the smallest collective units in Grid5000 systems. Although node-level failure-prediction models can offer higher accuracy, deploying separate models for thousands of nodes is costly and complex, and has no generalisation. In contrast, cluster-level models are more cost-effective, easier to manage, and can achieve accuracy comparable to that of node-level models. In addition, they are better suited for capturing the generic behaviour of systems. Among the nine sites analysed in Table 2, four (s1, s4, s5, and s6) contained multiple clusters, whereas the remaining five operated with single-cluster configurations. From the multi-cluster sites, we selected cluster 1 of site 4 (s4/c1) to build the failure-prediction models. This cluster was chosen based on two key factors: (1) it has the highest number of documented failure events (34,541) and (2) it has the largest compute node count (106) among all clusters in multi-cluster sites. The models were then evaluated using a comprehensive set of metrics. For regression tasks, we used the Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared ( R 2 ); for the classification models, we used the accuracy. Additionally, we performed a comprehensive feature importance analysis for each model to identify the most significant features and quantify their relative contributions to the predictions of TBF, TTR and FNI. This information can be used to recalibrate the models when they are deployed for production, as discussed in Section 6. To ensure reproducibility, the models and data have been made available in a public repository at [34].
TBF Prediction: Table 3 presents the prediction results for TBF representing the occurrence of the next failure. The analysis revealed that XGBoost was the superior model, demonstrating the best performance with the lowest MAE of 528,141.4, and the highest R 2 score of 1 for both the training and testing datasets. The calculated R 2 value indicates that the model explains 100% of the variance in the TBF, suggesting a failure-prediction accuracy of 100%, which represents an exceptional level of performance compared with the results seen in related works. This strength is underpinned by XGBoost’s balanced feature importance profile (Figure 4c), where total_event_duration (32.2%), event_id (26.7%), avg_event_duration (15.4%) and event_count (13.2%) emerge as the most influential features. This highlights XGBoost’s ability to leverage both default and engineered features drawn from the Grid5000 dataset. The XGBoost model was configured with key hyper-parameters, including n_estimators=100, max_depth=5, learning_rate=0.1, and random_state=42, optimized using grid search to minimise the root mean squared error (RMSE) (Table 4). Random Forest showed moderate performance, with a TBF prediction accuracy of 98%, driven by a focus on features such as event_id (57.3%), total_event_duration (17.3%), and avg_event_duration (11.4%), indicating its dependence on event-based features (Figure 4b). For Random Forest, the hyper-parameters were set to n_estimators=10, max_depth=5, random_state=42, max_features=1.0, min_samples_split=2, and min_samples_leaf=1, also tuned using grid search. Linear Regression, the weakest model with a 93% accuracy, heavily depended on event_count (97.6%) and was negatively affected by total_event_duration (−0.4%) and fault_frequency (−100%) (Figure 4a). This suggests that the linear regression–based model did not leverage the available features effectively, which in turn limited its predictive accuracy. The model produced a learned intercept of 1.1439 × 10 9 and 12 coefficients with magnitudes ranging from 4.125 × 10 8 to 4.026 × 10 8 .
TTR Prediction: Table 5 presents the prediction results for TTR, representing the duration of the next failure. Random Forest initially demonstrated strong performance during training, achieving a low MAE of 25,152.81 and R 2 f 0.88 with hyper-parameters set to n_estimators=5, max_depth=None, and random_state=42 (Table 6). However, it experienced a significant performance drop owing to overfitting, as reflected in the testing phase results (MAE: 63601.27, R 2 : 0.15). In contrast, XGBoost showed moderate but consistent performance across both the training and testing phases, with R 2 values of 0.48 and 0.22, respectively, indicating that it explained 48% and 22% of the variance in TTR during these phases. The model was configured with n_estimators=5, max_depth=None, and learning_rate=0.3. Linear Regression performed poorly, with an R 2 of 0.06 during testing and a learned intercept of 71,594.74, and 13 coefficients (ranging from 2.372 × 10 9 to 1.239 × 10 5 ), indicating a minimal ability to explain the variability in failure duration. This poor performance corresponds with the feature importance results as well, where the model relies heavily on attributes related to time_to_repair and time_between_failures (1.000 and 0.621, respectively). Features like event_start_seconds (−0.012) and event_stop_seconds (0.012) have little impact, suggesting a linear focus on repair metrics that may oversimplify the data (Figure 5a). In contrast, Random Forest and XGBoost both rank time_to_repair (0.400) and time_between_failures (0.343) as their top features, and have negligible contributions from event_stop_seconds (0.094) and other features (Figure 5b,c). While none of the models achieved high prediction accuracy, XGBoost emerged as the most reliable method for predicting TTR owing to its consistency compared with Random Forest and Linear Regression.
FNI Prediction: Table 7 presents the prediction results for FNI, representing the next node that is likely to fail. A significant contrast in the performance was observed among the three models. XGBoost, configured with n_estimators=100, learning_rate=0.3, and objective=multi:softprob (Table 8), emerged as a superior classifier with an exceptional accuracy of 99%, followed by a Random Forest with an accuracy of 97%. In contrast, the Logistic Regression showed surprisingly poor performance, with an accuracy of 2.73%. This performance disparity aligns with the feature importance results. XGBoost highlights total_event_duration (0.322) and event_id (0.267) as its top predictors, supported by avg_event_duration (0.154) and event_count (0.132), reflecting its strong predictive ability (Figure 6c). Random Forest set with n_estimators=10, bootstrap sampling, and criterion:‘gini’, distributes the importance more evenly across avg_event_duration (0.238), event_count (0.217), and fault_frequency (0.214), showing its ensemble strength (Figure 6b). However, the Logistic Regression using L2 regularization (penalty:‘l2’) with regularization strength, C of 1.0 and random_state=42 relies heavily on total_event_duration (1.000) with negative contributions from event_start_time (−0.655) and event_stop_seconds (−0.163) (Figure 6a). This indicate its linear focus that likely explains its lower accuracy due to misalignment with the dataset’s complex patterns.
In summary, the outstanding performance of XGBoost in predicting TBF and FNI, along with its moderate reliability for TTR, highlights the potential of ensemble methods for failure-prediction in cross-layer and cross-platform DC systems. These findings are validated and discussed in the following section.

5. Model Validation and Discussion

In this section, the models trained in Section 4.2.3 were rigorously tested across various datacenters (represented as sites in the Grid5000 system) to assess their generalisability and robustness. Validation includes both intra- and inter-platform approaches. Intra-platform validation covers all clusters associated with Site 4, excluding Cluster 1, as well as the results for Site 4 itself. The inter-platform covers all clusters and sites, with the exception of Site 4. To validate the model’s generalisability, their performance was compared against the specialized baseline models trained and tested on inter or intra-platform domains. By using this approach, we performed a comprehensive evaluation of the proposed models, demonstrating that they achieves near-comparable prediction accuracy to baseline models without requiring domain-specific retraining.

5.1. Intra-Platform Validations

This section concentrates on Site 4, in which Cluster 1 resources are used to build candidate generic models. Validations were performed at the cluster level (within-layer) using Cluster 2 and at the site level (cross-layer) across the entire Site 4. This approach enables a comprehensive evaluation of model performance and generalisability within and across datacentre layers.

5.1.1. Within-Layer Prediction Results

At the intra-platform within-layer level (i.e., Cluster 2 of Site 4), the models from Section 4.2.3 demonstrated performance patterns consistent with those observed in their system of origin (i.e., Cluster 1 of Site 4). The XGBoost-based model delivered the highest prediction accuracy, achieving an R 2 score of 0.99, matching the specialized baseline model trained and tested on Cluster 2 of Site 4 (Table 9). Random Forest and Linear Regression-based models yielded R 2 values of 0.94 and 0.80, respectively for generic models, which is almost identical to baseline values of 0.95 and 0.80, respectively. This highlights the superior reliability of XGBoost for TBF predictions. For TTR prediction (Table 10), all models struggled, with Linear Regression and Random Forest producing low R 2 values of 0.16 and 0.19, respectively for generic model, compared to 0.16 and 0.83 for the baseline model. However, XGBoost reached a modest 0.27 for generic model in comparison to 0.39 for baseline model. This highlights the inherent difficulty in forecasting repair times. For FNI, XGBoost and Random Forest achieved accuracies of 90% and 89%, respectively, for the generic model, which was nearly equivalent to 90% and 91% for the baseline model (Table 11). These prediction results highlight the ability of models from Cluster 1 of Site 4 to perform comparably to specialized baseline models, demonstrating strong generalisation and transferability.

5.1.2. Cross-Layer Prediction Results

After validating the models for individual clusters of site 4, this section presents the cross-layer validation results, evaluating their performance and generalisability across the entirety of Site 4. Consistent with cluster-level trends, the generic XGBoost-based model delivered the strongest results for TBF prediction, with a perfect R 2 of 1, matching the specialized baseline model trained and tested on Site 4 (Table 12). Random Forest and Linear Regression produced R 2 values of 0.97 and 0.93 for the generic model, respectively, which are identical to their baseline values of 0.97 and 0.93, respectively. For TTR prediction, all models struggled, with XGBoost yielding the negative R 2 (−0.04) for the generic model, compared to 0.44 for baseline. However, Random Forest and Linear Regression produced R 2 values of −1.21 and 0.10 (generic) versus 0.86 and 0.07 (baseline), respectively, reflecting the challenge of modelling repair times at the site level as well (Table 13). For FNI, XGBoost achieved 98% accuracy for both generic and baseline models, with Random Forest at 96% (generic) vs. 97% (baseline), whereas Logistic Regression trailed at 1.59% (generic) vs. 2.83% (baseline) (Table 14). These prediction results, particularly for TBF and FNI, shows that the model from Cluster 1 of Site 4 performed comparably to specialized models on Site 4, supporting their effectiveness and generalisability across datacentre layers.
These within- and cross-layer validation results for Site 4 confirm that the XGBoost- and Random Forest-based models, trained on Cluster 1 of Site 4, offer strong generalisability for the failure prediction by performing comparably to specialized baseline models. These models function as robust and generic tools that can be deployed across various layers of the resource hierarchy of datacenters without requiring domain-specific retraining. The next section evaluates the validity of these models across different datacenters.

5.2. Inter-Platform Validations

This section presents the inter-platform (cross-platform) validation results starting with cluster-level (within-layer) validation, followed by site-level (cross-layer) validation. The within-layer analysis covered all clusters except Clusters 1 and 2 of Site 4, while the cross-layer analysis included all sites except Site 4.

5.2.1. Within-Layer Prediction Results

Applying models to clusters, such as s1/c1–c4, s5/c1–c2, and s6/c1–c2, showed varying performances, highlighting the challenges of their inter-platform applicability. For TBF prediction (Table 15), the XGBoost-based model maintained high accuracy, with R 2 scores ranging from 0.98 to 1.0, closely matching the specialized baseline models trained and tested on each cluster and confirming its reliability across different clusters. Random Forest scored R 2 from 0.94 to 0.98, and Linear Regression varied from 0.59 to 0.95, both performing comparably to their respective baselines. TTR prediction (Table 16) proved difficult, with Linear Regression producing consistently low R 2 values (0.03 to 0.13) and Random Forest achieving moderate results ( R 2 between 0.27 and 0.82). In relative terms, XGBoost achieved the highest TTR prediction accuracy with R 2 as high as 0.84, though generally trailing baseline models. This mirrors the pattern from the development and intra-platform tests, highlighting the challenging nature of TTR prediction. For FNI (Table 17), XGBoost and Random Forest achieved 71% to 100% and 66% to 100% accuracy, respectively, performing comparably to baseline models across most clusters. Overall, these results suggest that the models trained and tested on Cluster 1 of Site 4, specifically the XGBoost-based model, are sufficiently generalisable across platforms, without retraining.

5.2.2. Cross-Layer Prediction Results

This section presents the cross-layer site-level validation results for all sites except Site 4. Consistent with previous findings, the XGBoost-based model, trained on Cluster 1 of Site 4, achieved the best results for TBF prediction (Table 18), maintaining R 2 scores of 0.98 to 0.99. Notably, a value of 0.99 was observed for sites 1–3, 5, 7, and 8, closely matching the performance of specialized baseline models trained and tested on each site. Random Forest delivered reliable R 2 values ranging from 0.91 to 0.94. These results align with cluster-level findings and confirm the strong cross-layer generalisability. For TTR prediction (Table 19), challenge persisted, with XGBoost reaching a peak R 2 of 0.61 on Site 5, which is approximately 26% lower than the corresponding value for Cluster 2 of the same site, and generally lower than the values of the baseline models. Random Forest underperformed with R 2 values ranging from −0.38 to 0.68, and Linear Regression yielded the lowest performance. This reflects the ongoing difficulty of repair-time prediction but highlights the relative strength of XGBoost. For FNI (Table 20), the XGBoost-based model achieved 71% to 100% accuracy and Random Forest achieved 69% to 100%, showing almost similar performance to baseline models across most sites, mirroring the cluster-level trends and demonstrating robust performance across layers.
These cross-layer validation results across all sites, except Site 4, further substantiate the generalisability of the XGBoost and Random Forest-based models, trained on Cluster 1 of Site 4, for failure-prediction tasks. XGBoost consistently delivered robust and reliable performance, particularly for TBF and FNI predictions, where it closely matched its baseline model, while Random Forest also showed strong performance relative to its baseline model. The complexity of TTR prediction remained evident at both cluster and site levels. Overall, these findings confirm the suitability of XGBoost and Random Forest as effective generic models capable of sustaining accurate failure prediction across diverse DC layers and environments.

5.3. System Level Validations

The conclusions drawn from Section 5.1 on intra-platform validations, together with those from Section 5.2 on inter-platform validations, provide strong evidence of the model’s generalisability. Building on this confidence, this section evaluates their performance at the meta-datacentre (system) level across the entire Grid5000 system.
Table 21 shows that the XGBoost-based model, trained on Cluster 1 of Site 4, achieved the highest TBF prediction accuracy with an R 2 score of 0.97, aligning precisely with the specialized baseline model trained and tested on the entire system, with only a 2% drop from the lower levels in the system hierarchy, confirming its robust generalisability without retraining. Random Forest also showed reasonable performance and achieved an R 2 of 0.87, a 10% decline from lower levels. In contrast, Table 22 reveals ongoing TTR prediction challenges, with XGBoost at 0.22 R 2 and Random Forest at 0.39, both lagging behind their baseline equivalents, indicating persistent difficulty across all models. For FNI (Table 23), XGBoost achieved the highest accuracy of 94.28%, nearly identical to the baseline’s 95% with only a modest 5% drop compared with the lower levels, further reinforcing the evidence of its strong generalisability. However, the performance of Random Forest-based model dropped sharply to 2.95%, compared to the baseline’s 4.03%, representing an approximately 80% decline compared to the lower levels.
These intra-platform, inter-platform, and system-level validation results for the models developed in Cluster 1 of Site 4 demonstrate their strong generalisability and robustness by performing similarly to specialized baseline models trained and tested on each respective level. XGBoost consistently delivered high TBF and FNI prediction accuracies across the DC system layers, with only slight performance drops at higher levels. The Random Forest method performed well, except at the system level. TTR prediction remains a challenge across all levels. Overall, the XGBoost-based model proved to be a reliable, generic solution for failure prediction in diverse DC setups, and Random Forest suits specific layers. Their accuracy across levels without retraining supports cost-effective and proactive resource management in DC systems.

6. Proposed System Architecture

This section introduces the conceptual architecture of a reliable DC system that incorporates the failure-prediction models proposed in Section 5. In addition, a practical case is presented to show the applicability of the proposed architecture in a real-world scenario.

6.1. Reliable Distributed Computing System Architecture: A Conceptual Model

Figure 7 illustrates the hierarchical architecture of a DC system designed to integrate failure-prediction modules across multiple levels (system, site, and cluster) to enhance the reliability and optimise resource utilisation. The architecture comprises compute nodes organised into clusters that are grouped into geographically distributed sites, forming the overall DC system. Each level is equipped with dedicated failure-prediction modules ( PM CT for clusters, PM S T for sites, and PM S S for the system). These modules leverage the ML models from Section 5 to predict the key metrics TBF, TTR, and FNI. These modules continuously perform predictions by using the monitoring data collected in the corresponding data repositories ( DR CT for clusters, DR S T for sites, and DR S S for the system). These predictions are fed to the corresponding resource managers ( RM C T , RM S T and RM S S ). These resource managers (RM) are responsible for failure prevention and mitigation and employ various failure-aware resource provisioning and allocation policies to reliably compute resource assignments to service requests. They also implemented fault-tolerance mechanisms (e.g., check-pointing and migration) to minimise the impact of failures on hosted services and ensure service continuity [35]. Thus, all of these components work in a closed-loop manner, with the prediction modules closing the loop.

6.2. Balancing Reliability and Cost in Service Design

To demonstrate the practical applicability of the proposed architecture, we consider a cloud service provider managing a DC system, structured as shown in Figure 7. By leveraging the results from Section 5, the service provider can tailor services to align with the user’s cost and reliability priorities as follows:
  • Cost-Optimized Service (No Explicit Reliability Requirements): For a client submitted a data analytics workload with strict cost constraints but no reliability requirements, the service provider can design a service by using system-level resource manager, R M S S . It employs predictions from P M S S module, which achieves an R 2 of 0.97 for TBF with XGBoost (Table 21), enabling system-wide resource allocation with 81.38% system availability (Section 4.1) at a reduced computational cost. This balances affordability and baseline reliability of the service.
  • High-Reliability Service (No Budget Constraints): For a client submitted a mission-critical medical imaging application demanding maximal uptime with no cost constraints, the service will be designed by using cluster-level managers ( R M C T ) using predictions from P M C T modules, offering 100% FNI accuracy and an R 2 of 1.0 for TBF in clusters like s1/c2 (Table 9 and Table 11). This ensures a higher availability of 98.51% (Table 2) and a minimal downtime.
  • Balanced Cost-Reliability Service: For users seeking a balance between reliability and cost, the service will be designed by using site-level resource managers ( R M S T ) utilizing P M S T predictions, such as an R 2 of 0.99 TBF in site 4 (Table 21) with availability of 90.23% (Table 1). This strategy strategically allocates resources across sites to balance the cost and reliability of the service.
By integrating the hierarchical DC architecture with generalisable failure-prediction models, providers can dynamically address diverse user needs by maximising cost efficiency, reliability, or a hybrid of both while optimising infrastructure utilisation. However, the integration of ML models may influence failure patterns over time by adapting system operations, potentially requiring service providers to perform periodic model adjustments to maintain the predictive accuracy. This ongoing adaptability ensures the architecture remains responsive to changing DC conditions, enhancing its long-term effectiveness across diverse service requirements.

7. Conclusions and Future Work

This study presents a robust framework for developing generic machine-learning-based failure-prediction models tailored for large-scale distributed computing (DC) systems, addressing the need for cross-layer and cross-platform transferability. By leveraging the Grid5000 failure dataset, we designed and evaluated models based on Logistic Regression, Random Forest, and XGBoost to predict Time Between Failures (TBF), Time to Return/Repair (TTR), and Failing Node Identification (FNI). Our findings demonstrate that XGBoost consistently outperforms other models, achieving near-perfect TBF prediction ( R 2 up to 1.0) and high FNI accuracy (up to 100%) across the cluster, site, and system levels, with minimal performance drops at higher hierarchical levels. Random Forest also showed strong performance, particularly for TBF and FNI, while TTR prediction remained challenging across all models owing to the inherent complexity of repair time estimation. The proposed models showed strong generalisability, enabling their application across diverse DC platforms without retraining, thereby reducing the computational costs and management overhead. Additionally, we proposed a hierarchical DC system architecture that integrates failure-prediction modules, supported by its applicability in balancing the cost and reliability for diverse service requirements. This research lays the foundation for proactive, cost-effective, and reliable resource management in modern DC systems.
While generic models are cost-effective and easier to manage, performing deep analysis to quantify their practical benefits (e.g., computational cost and training time) based on the characteristics of available data remains an open issue to be addressed in our future work. Based on our observations, we predict that XGBoost, which delivers top performance (e.g., an R 2 of 1.0 for TBF), likely requires more resources owing to its ensemble nature; Linear Regression, with lower accuracy (e.g., 93% FNI), probably uses fewer resources; and Random Forest, with balanced performance (e.g., 98% FNI accuracy) may have moderate training costs. In future work, we will propose models to predict/link their exact amounts based on the several characteristics (size, mean, standard-deviation, etc.) of the available data. Additionally, we will explore advanced time-series models, including autoregressive integrated moving average (ARIMA), to enhance the prediction accuracy, specifically for TTR. We will also extend our analysis to datasets from other DC environments, such as Google and Los Alamos National Laboratory (LANL), to provide a more comprehensive evaluation of our proposed models. Furthermore, we aim to simulate and implement the proposed hierarchical DC system architecture with integrated prediction modules and use cases, allowing rigorous performance evaluation under real-time dynamic scenarios.

Author Contributions

Conceptualization, S.J., Y.S. and J.T.; Methodology, S.J., Y.S. and J.T.; Software, S.J.; Validation, S.J., Y.S. and J.T.; Formal analysis, S.J., Y.S. and J.T.; Investigation, Y.S.; Resources, J.T.; Data curation, S.J.; Writing—original draft, S.J.; Writing—review & editing, Y.S. and J.T.; Visualization, S.J.; Supervision, J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This study was partially supported by the Knowledge Foundation of Sweden (KKS).

Data Availability Statement

All data and source codes for all algorithms are publicly available through GitHub at https://github.com/yogeshwsu/Generic_Failure_Prediction_Models.git (accessed on 21 August 2025) or https://tinyurl.com/55jwkcty (accessed on 21 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Mariani, L.; Pezzè, M.; Riganelli, O.; Xin, R. Predicting failures in multi-tier distributed systems. J. Syst. Softw. 2020, 161, 110464. [Google Scholar] [CrossRef]
  2. Jassas, M.S.; Mahmoud, Q.H. Analysis of job failure and prediction model for cloud computing using machine learning. Sensors 2022, 22, 2035. [Google Scholar] [CrossRef] [PubMed]
  3. Sharma, Y.; Javadi, B.; Si, W.; Sun, D. Reliability and energy efficiency in cloud computing systems: Survey and taxonomy. J. Netw. Comput. Appl. 2016, 74, 66–85. [Google Scholar] [CrossRef]
  4. Tengku Asmawi, T.N.; Ismail, A.; Shen, J. Cloud failure prediction based on traditional machine learning and deep learning. J. Cloud Comput. 2022, 11, 47. [Google Scholar] [CrossRef]
  5. Javadi, B.; Kondo, D.; Iosup, A.; Epema, D. The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems. J. Parallel Distrib. Comput. 2013, 73, 1208–1223. [Google Scholar] [CrossRef]
  6. Hochenbaum, J.; Vallis, O.S.; Kejariwal, A. Automatic anomaly detection in the cloud via statistical learning. arXiv 2017, arXiv:1704.07706. [Google Scholar] [CrossRef]
  7. Kourtis, M.A.; Xilouris, G.; Gardikis, G.; Koutras, I. Statistical-based anomaly detection for NFV services. In Proceedings of the 2016 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), Palo Alto, CA, USA, 7–10 November 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 161–166. [Google Scholar]
  8. Wilson, D.; Veeramachaneni, K.; O’Reilly, U.M. Cloud scale distributed evolutionary strategies for high dimensional problems. In Proceedings of the European Conference on the Applications of Evolutionary Computation, Vienna, Austria, 3–5 April 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 519–528. [Google Scholar]
  9. Fredriksson, T.; Mattos, D.I.; Bosch, J.; Olsson, H.H. Data labeling: An empirical investigation into industrial challenges and mitigation strategies. In Proceedings of the International Conference on Product-Focused Software Process Improvement, Turin, Italy, 25–27 November 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 202–216. [Google Scholar]
  10. Tang, X.; Machimura, T.; Li, J.; Liu, W.; Hong, H. A novel optimized repeatedly random undersampling for selecting negative samples: A case study in an SVM-based forest fire susceptibility assessment. J. Environ. Manag. 2020, 271, 111014. [Google Scholar] [CrossRef]
  11. Gao, J.; Wang, H.; Shen, H. Task failure prediction in cloud data centers using deep learning. IEEE Trans. Serv. Comput. 2020, 15, 1411–1422. [Google Scholar] [CrossRef]
  12. Shen, J.; Ren, Y.; Wan, J.; Lan, Y. Hard disk drive failure prediction for mobile edge computing based on an LSTM recurrent neural network. Mob. Inf. Syst. 2021, 2021, 8878364. [Google Scholar] [CrossRef]
  13. Abbasimehr, H.; Paki, R. Improving time series forecasting using LSTM and attention models. J. Ambient Intell. Humaniz. Comput. 2022, 13, 673–691. [Google Scholar] [CrossRef]
  14. Das, A.; Mueller, F.; Siegel, C.; Vishnu, A. Desh: Deep learning for system health prediction of lead times to failure in hpc. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, Tempe, AZ, USA, 11–15 June 2018; pp. 40–51. [Google Scholar]
  15. Tang, X. Large-scale computing systems workload prediction using parallel improved LSTM neural network. IEEE Access 2019, 7, 40525–40533. [Google Scholar] [CrossRef]
  16. Ismail, A.A.; Gunady, M.; Corrada Bravo, H.; Feizi, S. Benchmarking deep learning interpretability in time series predictions. Adv. Neural Inf. Process. Syst. 2020, 33, 6441–6452. [Google Scholar]
  17. Thompson, N.C.; Greenewald, K.; Lee, K.; Manso, G.F. Deep learning’s diminishing returns: The cost of improvement is becoming unsustainable. IEEE Spectr. 2021, 58, 50–55. [Google Scholar] [CrossRef]
  18. Gaykar, R.S.; Khanaa, V.; Joshi, S.D. A hybrid supervised learning approach for detection and mitigation of job failure with virtual machines in distributed environments. Ing. Des Syst. D’Inf. 2022, 27, 621. [Google Scholar] [CrossRef]
  19. Zhong, Y.; Chen, W.; Wang, Z.; Chen, Y.; Wang, K.; Li, Y.; Yin, X.; Shi, X.; Yang, J.; Li, K. HELAD: A novel network anomaly detection model based on heterogeneous ensemble learning. Comput. Netw. 2020, 169, 107049. [Google Scholar] [CrossRef]
  20. Mohammed, A.; Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 757–774. [Google Scholar] [CrossRef]
  21. Balouek, D.; Carpen Amarie, A.; Charrier, G.; Desprez, F.; Jeannot, E.; Jeanvoine, E.; Lèbre, A.; Margery, D.; Niclausse, N.; Nussbaum, L.; et al. Adding Virtualization Capabilities to the Grid’5000 Testbed. In Cloud Computing and Services Science; Communications in Computer and Information Science; Ivanov, I.I., van Sinderen, M., Leymann, F., Shan, T., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2013; Volume 367, pp. 3–20. [Google Scholar] [CrossRef]
  22. Cappello, F.; Caron, E.; Dayde, M.; Desprez, F.; Jégou, Y.; Primet, P.; Jeannot, E.; Lanteri, S.; Leduc, J.; Melab, N.; et al. Grid’5000: A large scale and highly reconfigurable grid experimental testbed. In Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing, 2005, Seattle, WA, USA, 13–15 November 2005; IEEE: Piscataway, NJ, USA, 2005; p. 8. [Google Scholar]
  23. Bisong, E.; Bisong, E. Google colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Apress: Berkeley, CA, USA, 2019; pp. 59–64. [Google Scholar]
  24. Numpy 1.25.2. 2023. Available online: https://pypi.org/project/numpy/1.25.2/ (accessed on 26 December 2024).
  25. Pandas Document. 2023. Available online: https://tinyurl.com/4zvwsyt9 (accessed on 26 December 2024).
  26. scikit-learn: Machine Learning in Python. 2023. Available online: https://scikit-learn.org/stable/ (accessed on 26 December 2024).
  27. Bergui, M.; Hourri, S.; Najah, S.; Nikolov, N.S. Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques. J. Big Data 2024, 11, 98. [Google Scholar] [CrossRef]
  28. Linear Regression. 2024. Available online: https://tinyurl.com/mr3fkeph (accessed on 7 March 2025).
  29. Adamu, H.; Mohammed, B.; Maina, A.B.; Cullen, A.; Ugail, H.; Awan, I. An approach to failure prediction in a cloud based environment. In Proceedings of the 2017 IEEE 5th International Conference on Future Internet of Things and Cloud (FiCloud), Prague, Czech Republic, 21–23 August 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 191–197. [Google Scholar]
  30. Gupta, N.; Vaidya, N.H. Byzantine fault-tolerant parallelized stochastic gradient descent for linear regression. In Proceedings of the 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 24–29 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 415–420. [Google Scholar]
  31. Shi, P.; Wang, P.; Zhang, H. Distributed Logistic Regression for Separated Massive Data. In Proceedings of the CCF Conference on Big Data, Wuhan, China, 26–28 September 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 285–296. [Google Scholar]
  32. Random Forest. 2025. Available online: https://tinyurl.com/2ecnyukc (accessed on 7 March 2025).
  33. Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T.; et al. Xgboost: Extreme Gradient Boosting; R Package Version 0.4-2; 2015; Volume 1, pp. 1–4. [Google Scholar]
  34. Jagannathan, S.; Sharma, Y.; Taheri, J. Generic_Failure_Prediction_Models. 2025. Available online: https://tinyurl.com/55jwkcty (accessed on 3 August 2025).
  35. Sharma, Y.; Taheri, J.; Si, W.; Sun, D.; Javadi, B. Dynamic resource provisioning for sustainable cloud computing systems in the presence of correlated failures. IEEE Trans. Sustain. Comput. 2020, 6, 641–654. [Google Scholar] [CrossRef]
Figure 1. Multilayered Distributed Computing (DC) system consisting of hierarchical arrangement of compute Sites (ST), Clusters (CL) and Nodes (CN).
Figure 1. Multilayered Distributed Computing (DC) system consisting of hierarchical arrangement of compute Sites (ST), Clusters (CL) and Nodes (CN).
Electronics 14 03386 g001
Figure 2. Multistage methodology used to predict failures in Distributed Computing (DC) systems.
Figure 2. Multistage methodology used to predict failures in Distributed Computing (DC) systems.
Electronics 14 03386 g002
Figure 3. Frequency distributions of Numerical (a,b) and Categorical (c,d) features in Grid5000 failure traces.
Figure 3. Frequency distributions of Numerical (a,b) and Categorical (c,d) features in Grid5000 failure traces.
Electronics 14 03386 g003
Figure 4. Feature importance for Time between Failure (TBF) prediction: (a) Linear Regression, (b) Random Forest, (c) XGBoost.
Figure 4. Feature importance for Time between Failure (TBF) prediction: (a) Linear Regression, (b) Random Forest, (c) XGBoost.
Electronics 14 03386 g004
Figure 5. Feature importance for Time to Return/Repair (TTR): (a) Linear Regression, (b) Random Forest, (c) XGBoost.
Figure 5. Feature importance for Time to Return/Repair (TTR): (a) Linear Regression, (b) Random Forest, (c) XGBoost.
Electronics 14 03386 g005
Figure 6. Feature Importance for Failure Node Identification (FNI): (a) Logistic Regression, (b) Random Forest, (c) XGBoost.
Figure 6. Feature Importance for Failure Node Identification (FNI): (a) Logistic Regression, (b) Random Forest, (c) XGBoost.
Electronics 14 03386 g006
Figure 7. Hierarchical distributed computing system architecture with failure-prediction modules.
Figure 7. Hierarchical distributed computing system architecture with failure-prediction modules.
Electronics 14 03386 g007
Table 1. Site level statistical analysis of Grid5000 failure traces.
Table 1. Site level statistical analysis of Grid5000 failure traces.
Site#Nodes#FailuresMTBFSD(MTBF)CV(MTBF)MTTRSD(MTTR)CV(MTTR)AvailabilityMaintainability
s125969,23428.7773.622.5613.3664.934.8668.2931.71
s295816141.31109.662.6514.1172.635.1574.5725.43
s3575760102.71196.581.915.8589.9715.3994.625.38
s416337,47037.4789.722.394.0631.277.7090.239.77
s5125750979.55271.633.4136.02217.046.0368.8331.17
s612714,11589.73207.112.316.2174.4511.9893.526.48
s74621,12014.5733.982.330.9615.3215.9793.826.18
s8342123,46722.8646.822.054.2939.189.1284.1915.81
s974730927.6272.482.626.3439.736.2681.3218.68
Table 2. Cluster level statistical analysis of Grid5000 failure traces.
Table 2. Cluster level statistical analysis of Grid5000 failure traces.
Cluster#Nodes#FailuresMTBFSD(MTBF)CV(MTBF)MTTRSD(MTTR)CV(MTTR)AvailabilityMaintainability
s1/c16413,06743.27111.132.5721.1591.074.3163.1236.88
s1/c26425,46122.2641.091.8510.2240.753.9974.8825.12
s1/c39925,95426.1462.272.386.2646.847.4877.3322.67
s1/c432475238.23117.713.0847.50126.782.6734.4765.53
s2/c195816141.37109.662.6514.1172.635.1574.5725.43
s3/c1575760102.71196.581.915.8589.9715.3994.625.38
s4/c110634,54134.4085.002.474.3132.247.4888.8711.13
s4/c257292973.76127.941.731.1215.5113.9098.511.49
s5/c1565348101.23316.883.1329.11167.985.7777.6722.33
s5/c269216126.4768.132.5753.12305.735.7633.2666.74
s6/c110312,99894.44213.312.265.4067.8612.5794.595.41
s6/c224111735.9399.692.7715.69127.978.1669.6130.39
s7/c14621,12014.5733.982.330.9615.3215.9793.826.18
s8/c1342123,46722.8646.822.054.2939.189.1284.1915.81
s9/c174730927.6272.482.626.3439.736.2681.3218.68
Table 3. Time Between Failures (TBF) prediction results for Cluster 1 of Site 4.
Table 3. Time Between Failures (TBF) prediction results for Cluster 1 of Site 4.
Linear RegressionRandom ForestXGBoost
MAE RMSE R 2 MAE RMSE R 2 MAE RMSE R 2
2,907,292.093,550,8610.931,391,069.551,979,144.190.98528,141.4764,059.131Training
2,909,707.783,549,290.190.931,425,669.22,008,907.70.98547,655.48797,043.51Testing
Table 4. Hyperparameters for Random Forest and XGBoost models for Time between Failures (TBF) prediction for Cluster 1 of Site 4.
Table 4. Hyperparameters for Random Forest and XGBoost models for Time between Failures (TBF) prediction for Cluster 1 of Site 4.
HyperparameterRandom ForestXGBoost
Number of Trees (n_estimators)10100
Maximum Depth (max_depth)55
Random Seed (random_state)4242
Feature Subsampling (max_features)1.0Not tuned (default: 1.0)
Minimum Samples per Split (min_samples_split)2Not tuned (default: 1)
Minimum Samples per Leaf (min_samples_leaf)1Not tuned (default: 1)
Learning Rate (learning_rate)NA0.1
Bootstrap Sampling (bootstrap)TrueNot tuned (default: 1.0)
Table 5. Time to Return/Repair (TTR) prediction results for Cluster 1 of Site 4.
Table 5. Time to Return/Repair (TTR) prediction results for Cluster 1 of Site 4.
Linear RegressionRandom ForestXGBoost
MAE RMSE R 2 MAE RMSE R 2 MAE RMSE R 2
82,799.85241,987.310.0625,152.8185,626.940.8872,240.72180,713.410.48Training
80,839.64173,716.850.0963,601.27168,353.660.1571,728.74161,147.800.22Testing
Table 6. Hyperparameters for Random Forest, and XGBoost models for Time to Return/Repair (TTR) prediction for Cluster 1 of Site 4.
Table 6. Hyperparameters for Random Forest, and XGBoost models for Time to Return/Repair (TTR) prediction for Cluster 1 of Site 4.
HyperparameterRandom ForestXGBoost
Number of Trees (n_estimators)55
Maximum Depth (max_depth)unlimitedunlimited
Random Seed (random_state)4242
Feature Subsampling (max_features)1.0Not tuned (default: 1.0)
Minimum Samples per Split (min_samples_split)2Not tuned (default: 1)
Minimum Samples per Leaf (min_samples_leaf)1Not tuned (default: 1)
Learning Rate (learning_rate)NA0.3
Bootstrap Sampling (bootstrap)TrueNot tuned (default: 1.0)
Table 7. Failing Node Identification (FNI) for Cluster 1 of Site 4 (s4/c1).
Table 7. Failing Node Identification (FNI) for Cluster 1 of Site 4 (s4/c1).
Logistic RegressionRandom ForestXGBoost
2.73%97%99%
Table 8. Hyperparameters for Logistic Regression, Random Forest, and XGBoost models for Failure Node Identification (FNI) for Cluster 1 of Site 4.
Table 8. Hyperparameters for Logistic Regression, Random Forest, and XGBoost models for Failure Node Identification (FNI) for Cluster 1 of Site 4.
HyperparameterLogistic RegressionRandom ForestXGBoost
Number of Trees (n_estimators)NA10100
Maximum Depth (max_depth)unlimi tedunlimitedunlimited
Random Seed (random_state)424242
Feature Subsampling (max_features)NAmax_features: ‘sqrt’1.0
Minimum Samples per Split (min_samples_split)NA21
Minimum Samples per Leaf (min_samples_leaf)NA11
Learning Rate (learning_rate)NANA0.3
Bootstrap Sampling (bootstrap)NATrue1.0
Objectivepenalty: ‘l2’criterion: ‘gini’objective: ‘multi:softprob’
Regularization Strength (C)1.0NANA
Table 9. Time Between Failures (TBF) prediction for Cluster 2 of Site 4 (s4/c2).
Table 9. Time Between Failures (TBF) prediction for Cluster 2 of Site 4 (s4/c2).
Linear RegressionRandom ForestXGBoost
MAE RMSE R 2 MAE RMSE R 2 MAE RMSE R 2
1,516,222.631,878,392.720.8711,677.83938,202.090.95249,227.43340,042.930.99Baseline
1,503,313.881,863,247.750.8748,930.951,000,889.960.94311,125.64427,993.660.99Generic
Table 10. Time to Return/Repair (TTR) prediction for Cluster 2 of Site 4 (s4/c2).
Table 10. Time to Return/Repair (TTR) prediction for Cluster 2 of Site 4 (s4/c2).
Linear RegressionRandom ForestXGBoost
MAE RMSE R 2 MAE RMSE R 2 MAE RMSE R 2
168,434.33320,717.220.1654,213.23143,246.530.83136,879.06274,331.860.39Baseline
170,774.77335,822.340.16135,691.16329,574.960.19149,090.82313,381.940.27Generic
Table 11. Failing Node Identification (FNI) for Cluster 2 of Site 4 (s4/c2).
Table 11. Failing Node Identification (FNI) for Cluster 2 of Site 4 (s4/c2).
Logistic RegressionRandom ForestXGBoost
5.76%91%90%Baseline
4.01%89%90%Generic
Table 12. Time Between Failures (TBF) Prediction for Site 4 (s4).
Table 12. Time Between Failures (TBF) Prediction for Site 4 (s4).
Linear RegressionRandom ForestXGBoost
MAE RMSE R 2 MAE RMSE R 2 MAE RMSE R 2
2,929,578.493,639,185.10.931,504,051.792,236,507.970.97587,965.81843,305.691Baseline
2,970,018.643,677,263.410.931,492,496.032,197,940.250.97608,063.14876,199.71Generic
Table 13. Time to Return/Repair (TTR) prediction for Site 4 (s4).
Table 13. Time to Return/Repair (TTR) prediction for Site 4 (s4).
Linear RegressionRandom ForestXGBoost
MAE RMSE R 2 MAE RMSE R 2 MAE RMSE R 2
90,910.18250,519.230.0727,631.4096,427.200.8679,319.45194,747.050.44Baseline
89,362.67190,010.690.1071,980.13297,582.00−1.2180,320.44203,852.45−0.04Generic
Table 14. Failing Node Identification (FNI) for Site 4.
Table 14. Failing Node Identification (FNI) for Site 4.
Logistic RegressionRandom ForestXGBoost
2.83%97%98%Baseline
1.59%96%98%Baseline
Table 15. Time Between Failures (TBF) prediction for clusters.
Table 15. Time Between Failures (TBF) prediction for clusters.
ClusterLinear RegressionRandom ForestXGBoost
MAE RMSE R 2 MAE RMSE R 2 MAE RMSE R 2
s1/c12,199,158.493,268,040.410.881,324,695.212,316,888.160.94347,184.94559,061.221Baseline
2,189,153.883,218,301.350.881,348,533.012,346,680.860.94379,420.4632,303.971Generic
s1/c22,659,912.143,273,118.610.951,749,929.322,314,754.930.98596,981.65791,038.141Baseline
2,680,141.683,298,153.240.951,739,446.022,300,283.330.98605,162.2804,285.571Generic
s1/c33,233,986.44,122,216.750.77970,806.411,500,475.020.97459,026.57709,658.30.99Baseline
3,262,003.854,140,497.150.77995,052.431,552,075.990.97476,904.74742,083.370.99Generic
s1/c45,405,297.077,121,341.450.681,230,335.781,693,237.590.98299,407.66438,295.971Baseline
5,565,339.127,293,853.220.681,293,780.171,769,684.090.98368,256.01587,340.221Generic
s5/c13,908,274.316,371,524.430.611,463,884.232,141,642.880.96500,899.21759,944.390.99Baseline
4,048,997.786,616,939.180.591,524,337.442,287,389.780.95596,031.881,017,532.050.99Generic
s5/c2723,416.461,077,801.670.62259,957.17418,968.670.9478,197.26118,401.721Baseline
796,535.81,218,561.450.6272,547.21437,900.860.95104,169.1180,896.820.999Generic
s6/c13,153,144.414,279,153.410.81,501,901.582,182,402.990.95780,128.981,180,636.250.99Baseline
3,081,761.984,162,700.420.811,540,490.792,264,878.390.94833,206.581,284,525.990.98Generic
s6/c21,298,294.511,539,855.130.72312,892.46569,396.190.96102,592.13194,585.321Baseline
1,334,536.261,563,239.860.72337,260.72604,949.250.96202,532.96403,870.330.98Generic
Table 16. Time to Return/Repair (TTR) prediction for clusters.
Table 16. Time to Return/Repair (TTR) prediction for clusters.
ClusterLinear RegressionRandom ForestXGBoost
MAE RMSE R 2 MAE RMSE R 2 MAE RMSE R 2
s1/c1152,763.12349,213.890.0715,694.3477,177.630.95103,916.68215,901.530.64Baseline
157,409.89381,882.060.0643,177.68182,658.910.78109,583.82240,805.480.63Generic
s1/c269,610.00147,501.280.0317,886.3251,671.620.8861,943.50127,711.200.27Baseline
69,486.31144,658.840.0347,007.45108,098.980.4663,099.48126,937.250.25Generic
s1/c382,588.79193,336.030.0717,144.0568,314.250.8861,183.82155,740.730.40Baseline
82,699.51199,264.320.0644,046.25156,888.360.4261,663.23167,280.480.34Generic
s1/c4214,730.28446,047.880.0329,888.69117,178.390.93152,044.45310,917.540.53Baseline
207,948.02378,717.080.04851,54.97283,222.590.46156,238.60295,533.460.41Generic
s5/c1336,979.00867,680.110.0771,563.04278,300.570.90192,955.42517,171.390.67Baseline
357,527.44977,364.900.07218,493.89769,349.080.42240,967.19769,882.470.42Generic
s5/c2219,572.57640,135.300.3823,718.3890,259.350.9978,866.88201,647.320.94Baseline
22,6461.26663,247.170.1367,366.99299,789.340.8286,299.34284,238.210.84Generic
s6/c1255,238.28558,095.170.1063,722.97222,363.160.86179,200.94432,513.960.46Baseline
252,873.31581,151.540.09167,258.92521,851.380.27187,279.23505,451.880.31Generic
s6/c2159,384.58419,298.690.0452,639.95187,566.560.81102,106.25286,142.700.55Baseline
151,952.16346,023.510.03146,792.12412,390.40−0.38121,590.90322,413.140.16Generic
Table 17. Failing Node Identification (FNI) for clusters.
Table 17. Failing Node Identification (FNI) for clusters.
ClusterLogistic RegressionRandom ForestXGBoost
s1/c12.01%70%72%Baseline
1.70%68%71%Generic
s1/c23.78%100%100%Baseline
2.76%100%100%Generic
s1/c34.52%89%92%Baseline
2.16%86%91%Generic
s1/c49.63%84%87%Baseline
9.47%83%85%Generic
s5/c17.11%96%97%Baseline
6.37%96%97%Generic
s5/c26.47%99%100%Baseline
3.55%98%100%Generic
s6/c13.57%71%77%Baseline
2.19%66%76%Generic
s6/c28.29%100%100%Baseline
6.43%100%100%Generic
Table 18. Time between Failures (TBF) prediction for sites.
Table 18. Time between Failures (TBF) prediction for sites.
SiteLinear RegressionRandom ForestXGBoost
MAE RMSE R 2 MAE RMSE R 2 MAE RMSE R 2
s13,594,633.794,804,189.990.892,610,815.284,170,278.140.92902,671.411,344,152.730.99Baseline
3,603,998.94,820,869.40.892,633,029.664,183,762.660.92922,609.091,375,960.080.99Generic
s21,875,998.152,654,180.880.891,298,825.621,832,7590.95388,923.52542,769.171Baseline
1,864,392.52,632,621.320.891,297,918.931,862,982.860.94435,750.29623,398.860.99Generic
s32,243,590.353,359,843.840.881,474,401.752,171,123.020.95579,448.46781,626.050.99Baseline
2,308,853.73,520,793.580.871,589,971.91246,9831.480.94667,541.89936,021.590.99Generic
s53,371,720.225,644,021.540.641,629,967.042,527,218.540.93468,405.11767,984.110.99Baseline
3,320,485.515,528,168.720.651,713,441.112,600,937.340.92523,765.47891,070.420.99Generic
s63,015,870.244,134,077.20.821,489,560.712,233,483.460.95788,098.411,201,042.10.98Baseline
3,013,126.084,107,923.410.811,486,652.872,231,897.670.94846,030.151,297,938.080.98Generic
s73,812,223.144,685,561.770.591,355,112.792,109,539.430.92468,925.3742,990.440.99Baseline
3,769,956.964,637,465.530.61,359,839.762,114,679.410.92480,980.74761,642.390.99Generic
s82,934,946.183,850,773.550.932,717,223.553,712,231.890.941,192,225.371,733,187.450.99Baseline
2,942,925.573,865,924.170.932,738,509.143,748,577.230.931,207,040.171,751,338.910.99Generic
s91,568,903.712,132,447.660.38509,430.12794,061.510.91221,646.51338,657.310.98Baseline
1,537,169.022,107,557.670.39521,487.09807,399.10.91247,652.1384,438.380.98Generic
Table 19. Time to Return/Repair (TTR) prediction for sites.
Table 19. Time to Return/Repair (TTR) prediction for sites.
SiteLinear RegressionRandom ForestXGBoost
MAE RMSE R 2 MAE RMSE R 2 MAE RMSE R 2
s198,084.42243,160.000.0319,472.3674,867.440.9186,326.38204,099.350.32Baseline
99,591.32261,988.050.0350,707.36164,248.800.6287,906.56218,229.070.33Generic
s2129,295.72344,254.980.0440,150.03145,700.240.8397,977.73223,790.450.59Baseline
123,680.56277,375.120.0499,671.44246,767.060.24101,645.88266,755.790.11Generic
s3256,350.49516,752.080.2080,455.05233,111.390.84190,387.69414,668.290.48Baseline
269,241.48544,391.120.13208,308.36496,903.930.27210,672.04476,608.630.33Generic
s5313,675.73855,227.090.1259,524.0257,933.160.92184,782.26530,638.400.66Baseline
295,104.21730,672.560.14148,396.99498,143.300.60184,808.14490,943.980.61Generic
s6244,538.51539,270.310.0962333.38214,946.620.86174,018.99425,989.950.43Baseline
254,923.49607,804.990.08170,037.90539,915.140.27191,738.13534,703.410.29Generic
s742,818.5795,503.490.0811,834.9844,314.830.8032,363.1285,351.860.27Baseline
40,743.3485,885.050.1027,689.8785,674.920.1031,179.3979,177.380.23Generic
s859,907.70154,166.460.0620,572.1169,455.390.8156,645.48144,189.600.18Baseline
59,604.74151,910.120.0752,183.00147,429.360.1256,475.81143,090.130.17Generic
s988,555.11208,842.040.0823,839.8882,151.000.8662,367.80154,173.240.50Baseline
84,642.17193,272.170.0760,390.07174,247.890.2464,478.85163,684.960.33Generic
Table 20. Failing Node Identification (FNI) for sites.
Table 20. Failing Node Identification (FNI) for sites.
SiteLogistic RegressionRandom ForestXGBoost
s12.15%89%74%Baseline
1.43%88%71%Generic
s24.31%99%100%Baseline
3.79%99%99%Generic
s35.21%100%100%Baseline
4.24%100%100%Generic
s55.89%98%99%Baseline
5.56%97%98%Generic
s63.16%72%82%Baseline
2.19%69%77%Generic
s76.14%96%98%Baseline
5.79%95%97%Generic
s81.48%99%100%Baseline
0.78%99%99%Generic
s96.23%100%100%Baseline
4.06%100%100%Generic
Table 21. Time between Failures (TBF) prediction for the system.
Table 21. Time between Failures (TBF) prediction for the system.
Linear RegressionRandom ForestXGBoost
MAE RMSE R 2 MAE RMSE R 2 MAE RMSE R 2
4,919,256.246,540,568.720.78377,005.23512,468.650.871,546,876.42,217,474.90.97Baseline
4,917,627.146,556,851.10.783,764,432.55,113,083.80.871,554,655.71,554,655.70.97Generic
Table 22. Time to Return/Repair (TTR) prediction for the system.
Table 22. Time to Return/Repair (TTR) prediction for the system.
Linear RegressionRandom ForestXGBoost
MAE RMSE R 2 MAE RMSE R 2 MAE RMSE R 2
93,686.57277,284.750.0826,289.11103,791.210.8784,889.04248,664.810.26Baseline
94,398.16274,163.970.0767,538.24224,107.790.3986,020.85251,870.950.22Generic
Table 23. Failing Node Identification (FNI) for the system.
Table 23. Failing Node Identification (FNI) for the system.
Logistic RegressionRandom ForestXGBoost
1.15%4.03%95%Baseline
0.70%2.95%94.28%Generic
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jagannathan, S.; Sharma, Y.; Taheri, J. Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems. Electronics 2025, 14, 3386. https://doi.org/10.3390/electronics14173386

AMA Style

Jagannathan S, Sharma Y, Taheri J. Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems. Electronics. 2025; 14(17):3386. https://doi.org/10.3390/electronics14173386

Chicago/Turabian Style

Jagannathan, Srigoutam, Yogesh Sharma, and Javid Taheri. 2025. "Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems" Electronics 14, no. 17: 3386. https://doi.org/10.3390/electronics14173386

APA Style

Jagannathan, S., Sharma, Y., & Taheri, J. (2025). Towards Generic Failure-Prediction Models in Large-Scale Distributed Computing Systems. Electronics, 14(17), 3386. https://doi.org/10.3390/electronics14173386

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop