Transformer-Driven Fault Detection in Self-Healing Networks: A Novel Attention-Based Framework for Adaptive Network Recovery

Dubey, Parul; Dubey, Pushkar; Bokoro, Pitshou N.

doi:10.3390/make7030067

Open AccessArticle

Transformer-Driven Fault Detection in Self-Healing Networks: A Novel Attention-Based Framework for Adaptive Network Recovery

by

Parul Dubey

^1,*

,

Pushkar Dubey

²

and

Pitshou N. Bokoro

^3,*

¹

Symbiosis Institute of Technology, Nagpur Campus, Symbiosis International (Deemed University), Pune 440008, India

²

Department of Management, Pandit Sundarlal Sharma (Open) University, Bilaspur 495009, India

³

Department of Electrical Engineering Technology, University of Johannesburg, Johannesburg 2006, South Africa

^*

Authors to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(3), 67; https://doi.org/10.3390/make7030067

Submission received: 23 May 2025 / Revised: 30 June 2025 / Accepted: 14 July 2025 / Published: 16 July 2025

Download

Browse Figures

Versions Notes

Abstract

Fault detection and remaining useful life (RUL) prediction are critical tasks in self-healing network (SHN) environments and industrial cyber–physical systems. These domains demand intelligent systems capable of handling dynamic, high-dimensional sensor data. However, existing optimization-based approaches often struggle with imbalanced datasets, noisy signals, and delayed convergence, limiting their effectiveness in real-time applications. This study utilizes two benchmark datasets—EFCD and SFDD—which represent electrical and sensor fault scenarios, respectively. These datasets pose challenges due to class imbalance and complex temporal dependencies. To address this, we propose a novel hybrid framework combining Attention-Augmented Convolutional Neural Networks (AACNN) with transformer encoders, enhanced through Enhanced Ensemble-SMOTE for balancing the minority class. The model captures spatial features and long-range temporal patterns and learns effectively from imbalanced data streams. The novelty lies in the integration of attention mechanisms and adaptive oversampling in a unified fault-prediction architecture. Model evaluation is based on multiple performance metrics, including accuracy, F1-score, MCC, RMSE, and score*. The results show that the proposed model outperforms state-of-the-art approaches, achieving up to 97.14% accuracy and a score* of 0.419, with faster convergence and improved generalization across both datasets.

Keywords:

fault detection; self-healing networks (SHNS); attention-augmented CNN; transformer encoder

1. Introduction

According to Cisco’s Annual Internet Report (2018–2023) [1], global IP traffic is expected to reach 396 exabytes per month by 2023, up from 122 exabytes in 2017, while the number of connected devices is projected to exceed 29.3 billion by 2023. This exponential rise in network scale, complexity, and data flow has significantly increased the frequency and severity of network faults across digitals [2,3]. Three-quarters of the causes of network outages are due to the unplanned malfunction of system elements (hardware, software, and configurations). For high-density networks, such as smart cities, 5G architectures, and cloud services architectures, this downtime may cause serious service disruption, data loss, and millions of dollars in financial loss annually for network operators [1]. These facts highlight the necessity of smart and autonomous fault-detection schemes that should be able to adapt on the fly in real-time and provide unmatched accuracy.

Self-healing networks (SHNs) represent a pivotal advancement in this direction [4,5]. SHNs are designed to automatically detect, diagnose, and recover from faults without human intervention, thereby ensuring uninterrupted service delivery. While traditional fault-detection systems in SHNs have relied on rule-based mechanisms or sequential deep learning models such as Variational Autoencoders (VAEs), Deep Belief Networks (DBNs), and Markov Random Fields (MRFs), they often suffer from limitations in scalability, latency, and context [6,7]. Their inability to dynamically prioritize critical temporal or spatial features hampers fault-classification accuracy, especially under non-stationary conditions where network behaviors evolve over time [8,9]. To mitigate these limitations, in this paper we introduce a transformer fault-detection framework specifically designed for self-healing networks. Though designed for natural language processing (NLP), transformers have demonstrated effectiveness in modeling long-range dependencies and attending feature selection across time series, vision, and multivariate sensor data. Their self-attention mechanisms provide a broad view of input sequences, which helps in better identifying unusual events that occur over time and across the network. By leveraging these properties, our model aims to enhance real-time fault identification, reduce false alarms, and facilitate faster network recovery. Convolutional layers in our model extract localized spatial features from vibration signals, leveraging the proven effectiveness of CNNs in pattern recognition across various domains [10].

The proposed framework is also positioned within the broader context of cyber–physical systems (CPSs), which integrate computation, networking, and physical processes. In CPS environments, physical components such as sensors, actuators, and devices are tightly coupled with computational logic and control systems through bidirectional communication channels. These systems are increasingly used in manufacturing, smart grids, and autonomous infrastructure, where real-time fault detection and adaptive responses are critical [11].

Modeling CPSs requires the robust abstraction of cyber–physical interactions. Recent studies have explored Petri nets for capturing scheduling and data-driven monitoring behaviors in CPSs [12], as well as self-adaptive optimization techniques to enhance flexibility in CPS production systems [13]. Our fault-detection framework aligns with these directions by enabling dynamic learning from multivariate sensor data and adapting to temporal changes without manual reconfiguration.

The novel part of this study involves combining transformer encoders with specific traffic features from SHNs, using a flexible method to adjust hyperparameters. Unlike prior cascaded deep learning (DL) architectures, our approach uses attention-based temporal modeling to detect and classify faults more accurately across dynamic environments, even with imbalanced or incomplete datasets. The overall architecture of the proposed fault detection and RUL prediction framework is illustrated in Figure 1, highlighting the integration of EE-SMOTE-based preprocessing, attention-augmented CNN, and transformer encoders to model spatiotemporal patterns in fault-prone self-healing network environments.

The scientific contributions of this research are outlined as follows:

We propose a novel transformer-based model for detecting and classifying faults in self-healing networks that has improved adaptability to time-variant traffic and system behavior.
We introduce a hybrid optimization mechanism that combines metaheuristic tuning with adaptive learning to optimize transformer parameters in dynamic settings.
We validate our framework on benchmark SHN datasets—Electrical Fault Classification and Detection (EFCD) and Sensor Fault-Detection Data (SFDD)—demonstrating its superior performance over existing deep learning-based models.
We provide a modular and scalable architecture suitable for real-world deployments in smart grids, IoT environments, and high-availability cloud networks.

To facilitate clarity and coherence, the remainder of this paper is organized as follows. Section 2 presents a comprehensive review of the related literature, focusing on existing methodologies in fault detection, self-healing networks, and deep learning-based approaches. Section 3 identifies the key research gaps and articulates the problem statement that motivates the proposed framework. Section 4 provides an overview of the benchmark datasets employed, along with their relevance to industrial and cyber–physical fault scenarios. Section 5 details the proposed methodology, elaborating on the architecture of the Transformer-AACNN framework, including preprocessing techniques, model components, and loss functions. Section 6 describes the experimental setup, including hardware configurations, model hyperparameters, and evaluation metrics. Section 7 presents and analyzes the results, including ablation studies and comparative evaluations with state-of-the-art models. Section 8 highlights potential real-time applications of the proposed framework across various cyber–physical domains. Section 9 discusses current limitations and outlines directions for future research. Finally, Section 10 concludes the paper by summarizing the contributions and significance of the study.

2. Literature Review

Detecting and addressing faults in SHNs is a critical concern in modern network management and telecommunications. Over the years, several methodologies have emerged to enhance fault-detection precision and adaptability in dynamic network environments.

Wang et al. [14] created a fault-diagnosis system for different types of networks that works on its own, using a model that combines the strengths of multiple learning methods. This system iteratively enhanced fault diagnosis by aggregating diverse base learners, although it emphasized moving beyond low error rates to address misclassifications.

Rodríguez et al. [15] developed a self-healing (SH) method employing mobile agents and the Trickle gossiping protocol for efficient topological data dissemination. The mobile agent approach achieved low bandwidth overhead but had higher memory and message exchange costs compared to Trickle.

Reshmi et al. [16] proposed a predictive diagnostic system tailored for 5G networks that utilized time series-based performance forecasting. This solution enabled automated anomaly detection through various performance indicators, offering proactive fault management. Xu et al. [17] introduced the Multiple Adaptive Rat Swarm Optimizer (MARSO), integrating adaptive learning exemplars and adaptive population sizing. It outperformed several optimization algorithms on CEC2017 benchmarks, demonstrating promising capabilities in fault-tolerant systems.

Hafaiedh et al. [18] developed a distributed formal model for SH behavior specification and verification in autonomous systems. The model enabled non-expert users to define fault-detection and recovery procedures while facilitating both qualitative and quantitative validation using statistical model checking. Yavuz et al. [19] suggested the use of Particle Swarm Optimization (PSO) for selecting and integrating suitable machine learning models in SHN environments. Their approach proved flexible and effective across different system topologies, verified via PSCAD-Python simulations.

Almutairi et al. [20] proposed Quantum Dwarf Mongoose Optimization with Ensemble Deep Learning-Based Intrusion Detection (QDMO-EDLID) for cyber–physical systems. Their simulation results demonstrated superior performance in comparison to other ensemble-based security models [21]. The base paper by Caleb et al. introduces the SAND-SHN (Serial Adaptive Network-based Detection for SHN) technique, integrating deep learning components like VAE, DBN, and DMRF with an Enhanced Ensemble-SMOTE (EE-SMOTE) oversampling module and hyperparameter tuning via the Revised Fire Hawk Optimizer (RPFHO). The model’s multi-layered cascaded design significantly improves the classification performance under imbalanced conditions. The literature review effectively highlights existing gaps in SHN fault diagnosis while justifying the design of the proposed architecture. By addressing both optimization and data imbalance, the study makes a meaningful contribution to the domain of intelligent fault-detection systems.

3. Research Gap and Problem Statement

3.1. Research Gap

Despite the significant advancements in SHN fault detection, key gaps remain:

Existing models lack global contextual learning and struggle with spatiotemporal fault patterns.
Most approaches use static architectures, limiting adaptability to real-time changes.
Fault-detection models often perform poorly on imbalanced or noisy datasets.
The transformer architecture, known for attention-based learning, remains underutilized in SHN applications.

3.2. Problem Statement

With increasing network complexity and traffic volume, timely and accurate fault detection in SHNs has become a critical challenge. Conventional deep learning approaches, like MSCAN, have problems in modeling long-range dependency, learning dynamic network behavior and processing data imbalance or noise. These shortcomings impair their capability in detecting current faults in real time, in particular when faults are elusive, late, or scattered in the network. The condition monitoring and fault identification of SHNs is facing an urgent demand for a smart and adaptive model that is capable of learning the temporal and spatial characteristics of SHNs and then accurately identifying and classifying faults in the changing environment.

4. Dataset Description

In order to demonstrate the effectiveness of the Transformer–Attention-Augmented Convolutional Neural Network (AACNN) framework for fault detection and its predictability of RUL in SHNs, two publicly available benchmark datasets, the Electrical Fault Classification and Detection (EFCD) dataset and the Sensor Fault-Detection Dataset (SFDD), were used. Thankfully, we have two separate datasets showing a variety of fault scenarios in electrical systems and sensor networks, respectively, which is a useful evaluation ground for the proposed model in terms of signal diversity as well as operating uncertainty.

Ref. [22], the EFCD dataset (from Kaggle), has 12,000 labeled time-series records that are generated to emulate electrical system response under different working conditions. There are 18 attributes recorded in each instance, which are measurements for line voltages, line currents, power factors, and total harmonic distortions. These signals were captured from six different classes of faults—short circuits, overloads, earth faults, and phase imbalances. The dataset offers a great platform to test the accuracy of the model in classification in co-channel interfering environments with multi-modal feature sets and class bias.

Ref. [23], the SFDD dataset, also borrowed from Kaggle, consists of 10,000 time-series samples, with the aim of fault diagnosis in sensor networks. Each record consists of four columns: sensor ID, timestamp, sensor readings, and fault condition. The normal readings and several simulated faulty behaviors, including sensor disconnect, stuck-at-zero faults, bias drift, and abrupt noise spikes, are included in the dataset. This benchmark provides an evaluation of the robustness of the model, including the detection of subtle anomalies under resource-constrained settings, e.g., industrial IoT.

A comparison between the two datasets is shown in Table 1. Compared with the EFCD dataset, the SFDD dataset has a larger feature space and sample size, more complex categories of fault, and so on, so that it can be used for cross-domain generalization tests. It was finally deployed on the projections of both datasets, which were pre-processed through normalization, the detection of outlying experiments, and stratified splitting in the training, validation, and test sets.

These two datasets combined provide a firm basis for the comparison of fault-detection models across different types of systems. The EFCD dataset allows for testing in feature-rich, noisy environments, with the SFDD dataset enabling evaluation in low-dimensional sensor settings. The Transformer-AACNN proposed framework was tested and trained in all single sources or combinations to benchmark performance, generalization, and resilience across domains.

5. Proposed Methodology

This study introduces a hybrid deep learning framework combining an Attention-Augmented Convolutional Neural Network (AACNN) and a transformer encoder to predict the RUL of rotating machinery components. The model leverages the strengths of both local attention mechanisms in CNNs and global sequence modeling in transformers to address fault-detection challenges in SHNs under varying operating conditions. The proposed framework builds upon the core principles of the transformer architecture, originally introduced by some researchers [24].

5.1. Data Preprocessing and Class Imbalance Handling Using EE-SMOTE

The raw sensor signals collected from the EFCD and SFDD datasets exhibit significant class imbalance, where failure (minority class) instances are substantially outnumbered by healthy (majority class) instances [25,26]. Such an imbalance can severely impair the learning capacity of supervised models by biasing predictions toward the majority class and reducing sensitivity to rare but critical failure events [27,28]. To address this, we employ an Enhanced Ensemble-SMOTE (EE-SMOTE) strategy as a robust preprocessing step before model training.

EE-SMOTE is designed to improve upon traditional SMOTE by integrating ensemble-based intelligence, outlier filtering, and local structural awareness into the oversampling process. The architecture and workflow of EE-SMOTE are illustrated in Figure 2.

The process begins with an imbalanced dataset composed of majority- and minority-class samples. First, the minority-class instances are optionally clustered using methods such as k-means or DBSCAN to preserve local topological structure. This clustering helps ensure that synthetic samples are generated in dense regions of the minority class distribution rather than in sparse or noisy zones.

Next, an outlier-filtering module removes noisy or isolated minority instances based on a distance-thresholding or density-based method. This prevents misleading interpolation near anomalous data points [29,30]. Then, an ensemble of classifiers (e.g., Random Forest, XGBoost) is trained to assess the importance of each minority instance. Samples that are frequently misclassified or deemed important by the ensemble are selected for oversampling.

The EE-SMOTE module was tuned through empirical evaluation on the validation set by adjusting parameters such as the number of nearest neighbors k = 5, the ensemble voting threshold θ = 0.6, and the use of DBSCAN for local structure preservation. These were selected based on prior studies and calibrated through grid search to ensure stable learning in imbalanced conditions. To assess the criticality of EE-SMOTE, we performed an ablation study (Section 7.3), comparing the full model with and without EE-SMOTE. The model without EE-SMOTE showed a clear decline in performance—e.g., score* dropped from 0.419 to 0.342 on the SFDD dataset. This confirms that while the AACNN and transformer components are strong individually, EE-SMOTE plays a significant role in boosting sensitivity to rare fault classes and improving generalization.

Following selection, synthetic samples are generated using a modified SMOTE algorithm, which interpolates between a given minority instance and its nearest neighbors. The resulting samples are then validated through the ensemble to retain only informative synthetic data that improves classification boundaries. The final output is a balanced augmented dataset, where the minority-class size is approximately equal to the majority class. This balanced dataset is then passed into the AACNN + Transformer model for feature extraction and prediction.

5.2. Health-Stage Segmentation

Equation (1) defines the Root Mean Square (RMS), a fundamental statistical measure used to quantify the magnitude of vibration signals [31,32]. The vibration signal x(t) is divided into healthy and degradation stages based on a Root Mean Square (RMS) health indicator:

R M S (x) = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} x_{i}^{2}}

(1)

The first prediction time (FPT) is established when RMS exceeds a predefined threshold over consecutive time windows, marking the start of degradation.

5.3. Attention-Augmented CNN (AACNN)

AACNN modules are used to extract localized degradation features with embedded attention. Each convolutional block is enhanced using channel attention (CA) and spatial attention (SA) mechanisms.

Equation (2) describes the channel attention (CA) mechanism, which captures inter-channel dependencies using pooled feature representations.

C A (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))

(2)

where

F is the feature map;

MLP is a two-layer perceptron;

σ is the sigmoid activation.

Equation (3) expresses the spatial attention (SA) mechanism that highlights important spatial locations by applying a convolution over pooled maps.

S A (F) = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)]))

(3)

Equation (4) defines the final refined feature map after the sequential application of channel- and spatial-attention mechanisms.

F^{'} = S A (C A (F) \cdot F) \cdot F

(4)

This helps the model emphasize important frequency–time regions of degradation.

5.4. Transformer Encoder

The AACNN output F′ is reshaped into a sequence and passed to the transformer encoder to capture long-range temporal dependencies. Equation (5) presents the scaled dot-product attention used within the transformer encoder to compute context-aware representations.

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(5)

Equation (6) defines the sinusoidal positional encoding used to retain order information in input token sequences.

{P E}_{(p o s, 2 i)} = s i n (\frac{p o s}{10000^{\frac{2 i}{d}}}), {P E}_{(p o s, 2 i + 1)} = c o s (\frac{p o s}{10000^{\frac{2 i}{d}}})

(6)

5.5. Domain-Invariant and Domain-Specific Learning

To ensure generalization across varying operating conditions, the transformer outputs are passed through:

A shared feature extractor F_shared;
A domain-specific extractor F_dom.

Loss Functions: These extracted features are optimized through domain adaptation loss functions that encourage both generalizability and domain relevance. The following are listed loss functions:

Correlation Alignment Loss (domain-invariance): The Correlation Alignment Loss used in this study is inspired by the CORAL method proposed in a few studies [28,29], which minimizes the domain shift by aligning the second-order statistics of source and target feature representations. This technique is crucial for ensuring domain-invariant learning in fault-detection scenarios where sensor dynamics differ across deployments. Equation (7) represents the correlation loss function, which penalizes redundancy across class-specific feature representations.

L_{c o r} = \frac{2}{N (N - 1)} \sum_{i \neq j} {‖C_{i} - C_{j}‖}_{F}^{2}

(7)

Inter-domain Exploration Loss (similarity regularization): Equation (8) defines the expression loss to enforce consistency between shared feature representations of similar instances.

L_{e x p} = {‖F_{s h a r e d} (x_{i}) - F_{s h a r e d} (x_{j})‖}_{2}^{2}

(8)

5.6. Multi-Task Regression and Prediction

This architectural choice aligns with best practices in multi-task learning, where separate lightweight regressors are used to avoid negative transfer and enhance specialization. The rationale for this approach, including structural advantages and learning strategies, follows the methodologies discussed in [33,34], which explore domain-aware regression networks and adaptive prediction strategies in transformer-based health monitoring systems.

Each domain-specific representation is passed to its own regressor Rk. Equation (9) presents the final RUL prediction as a weighted sum over the outputs of multiple regressors operating on shared features. The final RUL prediction is computed by fusing weighted outputs:

\hat{y} = \sum_{k = 1}^{K} w_{k} R_{k} (F_{s h a r e d} (x))

(9)

where domain weights

w_{k}

are derived using a domain classifier Cdom.

5.7. Overall Loss Function

Equation (10) aggregates multiple loss components, including task-specific and regularization terms, to form the total training loss.

L_{t o t a l} = L_{R U L} + α L_{c o r} + β L_{e x p} + γ L_{d o m}

(10)

where

L_{R U L}

= MSE (y,

\hat{y})

L_{d o m}

: Cross-entropy for domain classification.

The architectural flow of the proposed AACNN + Transformer model, detailing the convolutional feature extraction, attention mechanisms, and temporal modeling, is illustrated in Figure 3.

5.8. Algorithm: Transformer-AACNN Fault Detection

The proposed method for fault detection and remaining useful life (RUL) prediction is outlined in Algorithm 1.

Algorithm 1: Transformer-AACNN-Based Fault Detection and RUL Prediction

Input:
Multivariate vibration signals

χ = \{x_{1}, x_{2}, x_{3}, \dots . . x_{T}\}

Labeled source domains

D_{s} = {\{(χ_{s}, y_{s})\}}_{s = 1}^{K}

Unlabeled target domain

D_{t} = \{χ_{t}\}

Output:
Predicted Remaining Useful Life

{\hat{y}}_{t}

for target domain instances
Step 1: Health-Stage Segmentation
Compute RMS:

R M S (x) = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} x_{i}^{2}}

Identify first prediction time (FPT) when RMS shows a persistent increase.
Extract degradation-stage data

χ_{d e g} \subset χ

.
Step 2: Data Augmentation via Time-VQVAE
Encode vibration sequence

x \in χ_{d e g}

to latent vector

z_{q} (x)

:

z_{q} (x) = {a r g m i n}_{z k \in z} {‖E (x) - z_{k}‖}_{2}

Generate synthetic sequence

\hat{x} = D (z_{q} (x))

Loss function:

L_{V Q} = {‖x - \hat{x}‖}_{2}^{2} + {‖s g [E (x)] - z_{q} (x)‖}_{2}^{2} + β {‖E (x) - s g [z_{q} (x)]‖}_{2}^{2}

Step 3: Local Feature Extraction via AACNN
Apply channel attention (CA):

C A (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))

Apply spatial attention (SA):

S A (F) = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)]))

Obtain attended feature map:

F^{'} = S A (C A (F) \cdot F) \cdot F

Step 4: Global Temporal Modeling via Transformer
Add positional encodings:

{P E}_{(p o s, 2 i)} = s i n (\frac{p o s}{10000^{\frac{2 i}{d}}}), {P E}_{(p o s, 2 i + 1)} = c o s (\frac{p o s}{10000^{\frac{2 i}{d}}})

Compute multi-head self-attention:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

Step 5: Domain-Invariant and Domain-Specific Feature Learning
Extract domain-invariant features

F_{s h a r e d}

and specific features F_dom
Domain alignment loss:

L_{c o r} = \frac{2}{N (N - 1)} \sum_{i \neq j} {‖C_{i} - C_{j}‖}_{F}^{2}

Inter-domain exploration loss:

L_{e x p} = {‖F_{s h a r e d} (x_{i}) - F_{s h a r e d} (x_{j})‖}_{2}^{2}

Step 6: Multi-Task Regression and Domain Weighting
Train task-specific regressors Rk on each source domain
Predict RUL with weighted fusion:

{\hat{y}}_{t} = \sum_{k = 1}^{K} w_{k} R_{k} (F_{s h a r e d} (x_{t}))

Domain weights from classifier

C_{d o m}

:

w_{i} = C_{d o m} (F_{d o m} (x_{t})), \sum_{k = 1}^{K} w_{k} = 1

Step 7: Optimize Total Loss

L_{t o t a l} = L_{R U L} + α L_{c o r} + β L_{e x p} + γ L_{d o m} + δ L_{V Q}

Where

L_{R U L} = M S E (y, \hat{y})

End Algorithm

The combination of AACNN, transformer, and EE-SMOTE was not selected arbitrarily but was based on a principled architectural rationale tailored for the challenges of fault detection in SHNs. First, AACNN efficiently captures local degradation patterns from sensor signals using spatial- and channel-attention mechanisms, which are crucial for identifying abrupt local anomalies. Second, transformers are well suited for modeling long-term temporal dependencies and sequence-level dynamics, which are common in fault-progression data. Third, EE-SMOTE was specifically integrated to address the class imbalance problem, which is prevalent in failure datasets where faulty instances are underrepresented.

This combination allows the model to capture both localized abnormalities and long-range dependencies, while ensuring balanced learning. The architecture was validated through ablation studies (see Section 7.3), where each component was independently removed to evaluate its impact, clearly showing that the full model achieved the best results.

6. Experimental Setup

To evaluate the effectiveness and generalizability of the proposed Transformer-AACNN-based fault-detection framework, extensive experiments were conducted on benchmark datasets under diverse operating conditions. This section outlines the computational environment, dataset configurations, model training parameters, and evaluation metrics employed in the study.

6.1. Hardware and Software Configuration

The experiments were conducted using a high-performance computing environment as detailed in Table 2, ensuring efficient training of the Transformer-AACNN model across multiple runs. All experiments were performed using the following system setup:

6.2. Dataset Configuration

The primary benchmark used is the IEEE PHM 2012 Bearing Dataset, collected using the PRONOSTIA platform under three distinct operating conditions. This dataset contains horizontal and vertical vibration signals sampled at 25.6 kHz, recorded every 10 s, with each sample comprising 2560 points. The operating conditions, load settings, and bearing configurations used in the EFCD and SFDD datasets are summarized in Table 3, delineating the source and target domains for model training and evaluation.

6.3. Model Configuration and Hyperparameters

The detailed configuration of the proposed Transformer-AACNN model, including architectural parameters, training settings, and loss formulation, is provided in Table 4. The Transformer-AACNN model was trained using the following hyperparameter settings:

6.4. Evaluation Metrics

The performance of the proposed model is evaluated using the following standard metrics:

Root Mean Square Error (RMSE): Equation (11) computes the Root Mean Square Error (RMSE) as a standard performance metric to measure regression accuracy.

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i -} {\hat{y}}_{i})}^{2}}

(11)

Mean Absolute Error (MAE): Equation (12) defines the Mean Absolute Error (MAE), representing the average magnitude of prediction errors.

M A E = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|

(12)

Score* (PHM 2012-specific scoring metric): Score* is a domain-specific performance metric defined by the PHM 2012 challenge. It assigns asymmetric penalties to early and late RUL predictions, prioritizing timely maintenance decisions. The metric is defined such that:
➢
Higher values are better, with a perfect score of 1.0 indicating completely accurate predictions across all instances.
➢
A score of 0.0 represents neutral performance, while negative values indicate severely misaligned predictions, particularly penalizing late fault detection.

Equation (13) introduces the score* metric, which asymmetrically penalizes early and late predictions in line with the PHM 2012 evaluation standard.

{S c o r e}^{*} = \{\begin{matrix} \frac{1}{N} \sum_{i = 1}^{N} e^{- \ln (0.5) \cdot (\frac{E r_{i}}{5}), i f E r_{i} \leq 0} \\ \frac{1}{N} \sum_{i = 1}^{N} e^{- \ln (0.5) \cdot (\frac{E r_{i}}{20}), i f E r_{i} > 0} \end{matrix}

(13)

where

E r_{i} = \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} \times 100 %

.

While the proposed model integrates multiple deep components—AACNN blocks, transformer encoders, and EE-SMOTE preprocessing—its training remains feasible within reasonable time constraints. On the NVIDIA RTX 3090 GPU sourced from NVIDIA Corporation, Bengaluru, India., the full model takes approximately 2.3 h to train for 200 epochs with a batch size of 64. The EE-SMOTE preprocessing phase adds approximately 9–11 min, depending on dataset size, due to ensemble voting and outlier filtering. However, this step is performed only once before training.

For real-world deployment, inference latency is more relevant than training time. As discussed in Section 8, the floating-point model achieves ~7 ms per sample inference latency on GPU, making it viable for edge applications with moderate compute. That said, we acknowledge that further model compression (e.g., pruning, quantization) will be explored in future work to support deployment on low-power devices without compromising accuracy.

7. Results and Discussion

This section presents and analyses the results obtained from evaluating the proposed Transformer-AACNN framework on the EFCD and SFDD datasets. The evaluation was carried out in a phased manner, beginning with a preliminary analysis of data integrity through outlier detection, followed by model performance assessment using key metrics such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and a domain-specific scoring function. The results are discussed in terms of accuracy, robustness to data imbalance, and generalizability across domains.

In addition to reporting the performance of the proposed model, we conduct an ablation study to assess the contribution of each module (AACNN, transformer, and EE-SMOTE) to the final outcome. Finally, we compare our results with existing models, including those presented in the base study, to establish the superiority and scientific significance of the proposed architecture.

7.1. Outlier Detection Results

Before training any models, we first conducted a thorough outlier analysis to ensure that the datasets were trusted and clean. A univariate statistical analysis of the raw vibration signals in the EFCD and SFDD datasets was performed according to the Z-score method of outlier detection, where data points with an absolute Z-score value greater than 3 were marked as suspect outliers.

Suppose an example from electrical multivariate measurements from the EFCD database; we detected as many as 842 outliers among 12,000 records, about 7.02% of the records. For the SFDD dataset with time-stamped sensor values, 625 anomalies were identified from 10,000 test samples, which makes up 6.25% of the test data. These outliers were then filtered or interpolated linearly according to their position inside the time-series sequence. The result of this cleaning process is visualized in Figure 4, where histograms of the distributions and a boxplot analysis of the datasets are also shown. It can be observed that prior to cleaning, the datasets had irregular spikes and found heavy-tailed distributions, a trend that was greatly normalized after preprocessing them. This pre-processing was necessary for the stable learning and the prevention of being biased by extreme values in the prediction. The preprocessed datasets were subsequently utilized in the following training and evaluation steps as described in the forthcoming sections.

7.2. The Results of the Proposed Transformer-AACNN Model

We tested the proposed Transformer-AACNN model on two benchmark datasets (i.e., EFCD and SFDD) with three widely used evaluation indexes (RMSE, MAE, and score*). These measures jointly evaluate the accuracy and the sensitivity of the model to subtle fault-progression pattern deviations over different operation domains.

These results show that the model exhibits a steady low prediction error and a stable generalization to the other break datasets. The proposed model earned an RMSE of 0.148, MAE of 0.102, and score* of 0.402 on the EFCD dataset. Also, on the SFDD dataset, the reported RMSE, MAE, and score* are 0.136, 0.095 and 0.419, respectively. These results demonstrate that the hybrid architecture is able to learn meaningful temporal and spatial features from imbalanced and noisy input signals.

Figure 5 illustrates the model’s comparison across the EFCD and SFDD. This could be because the structure of the SFDD dataset is relatively simple, and its sensor-level fault characteristics are more distinguishable. The Transformer-AACNN structure with the aid of EE-SMOTE processing presents a good trade-off between prediction precision and fault sensitivity and so is justified in real-fault-detection applications.

7.3. Ablation Study (Extended: EFCD and SFDD)

To assess the individual and combined effects of the major components in the proposed Transformer-AACNN framework—namely the attention-augmented CNN (AACNN), transformer encoder, and EE-SMOTE balancing module—we conducted an extensive ablation study on both the EFCD and SFDD datasets.

For each dataset, we evaluated four model variants:

AACNN only: Captures localized spatial features with channel and spatial attention.
Transformer only: Focuses on global temporal dependencies.
AACNN + Transformer: Combines local and temporal learning but excludes oversampling.
AACNN + Transformer + EE-SMOTE (Proposed): Full model integrating all modules.

The evaluation metrics—RMSE, MAE, and score*—are summarized in Table 5. The results indicate that each component meaningfully contributes to the overall performance. On both datasets, the standalone AACNN and transformer variants exhibited higher error rates compared to the combined architectures. The integration of EE-SMOTE further improved the learning stability by addressing class imbalance in the degradation signals, leading to the best performance in all metrics.

Figure 6 visualizes the results, revealing a consistent reduction in RMSE and MAE and a noticeable rise in score* as each component is integrated. The proposed model consistently outperforms all other variants across both datasets.

7.4. Comparative Analysis with Models from the Base Paper

To evaluate the performance improvements offered by the proposed Transformer-AACNN framework, we conducted a comparative analysis against several optimization-based techniques reported in the base study, including RSO-MSCAN, DMO-MSCAN, BFGO-MSCAN, FHO-MSCAN, and RPFHO-MSCAN. The evaluation was carried out on both the EFCD and SFDD datasets, using three standard metrics: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and score* (as defined in the PHM 2012 challenge).

The results from the base models indicate that the best-performing model, RPFHO-MSCAN, achieved score* values of 0.2949 on EFCD and 0.3313 on SFDD. In comparison, our new Transformer-AACNN model, which was improved with EE-SMOTE preprocessing, reached better score* values of 0.402 on EFCD and 0.419 on SFDD, along with lower RMSE and MAE values. These improvements demonstrate the model’s superior ability to generalize across domains while effectively minimizing prediction errors.

To establish a more holistic comparison, we further evaluated the performance of baseline models—RSO-MSCAN, DMO-MSCAN, BFGO-MSCAN, FHO-MSCAN, and RPFHO-MSCAN—across multiple evaluation metrics, including accuracy, F1-score, precision, recall, and Matthews Correlation Coefficient (MCC). These metrics were extracted from the results presented in the base paper using k-fold cross-validation. The performance metrics for both EFCD and SFDD datasets are summarized in Table 6. Among the models, RPFHO-MSCAN consistently delivered the highest values across all metrics for both datasets, achieving 94.12% accuracy and an MCC of 88.24 on EFCD and identical scores on SFDD. Notably, the BFGO-MSCAN model also exhibited strong performance on EFCD, while FHO-MSCAN stood out on SFDD.

This comparative assessment reveals that while heuristic optimization models are capable of delivering reliable accuracy and balance across precision and recall, there remains variability in their convergence behavior and generalization capacity. These limitations motivated the development of our Transformer-AACNN framework, which seeks to address both performance and stability in learning. The proposed model, though evaluated in a separate section, surpasses these baselines in several core metrics, highlighting its effectiveness in real-world SHN fault-detection scenarios.

Additionally, we performed paired t-tests to compare our proposed model against the strongest baseline (RPFHO-MSCAN) on both EFCD and SFDD datasets. The improvements in score* were found to be statistically significant at p < 0.01, confirming that the gains are not due to chance or specific train-test splits. These statistical results support the model’s stability and generalizability.

7.5. Convergence Curve Comparison

To visualize the learning behavior of the proposed Transformer-AACNN model against established optimization-based methods, a comparative convergence analysis was conducted. Figure 7 presents the cost function curves over 50 iterations for the baseline models—RSO-MSCAN, DMO-MSCAN, BFGO-MSCAN, FHO-MSCAN, and RPFHO-MSCAN—as well as the proposed Transformer-AACNN framework. As depicted, traditional metaheuristic algorithms such as RSO and DMO exhibit slower and more irregular descent patterns, indicating delayed convergence and potential local minima entrapment. The RPFHO-MSCAN model, as reported in the base paper, achieves the best convergence among the classical methods.

In contrast, the proposed Transformer-AACNN model demonstrates a significantly smoother and more consistent reduction in cost, reaching lower values more efficiently. This behavior underscores the strength of combining attention-based temporal modeling with channel/spatial convolutional encoding, further enhanced by EE-SMOTE preprocessing. The observed trend confirms the architectural advantage of the proposed model, both in terms of optimization speed and cost minimization effectiveness.

To provide a comprehensive evaluation of the proposed Transformer-AACNN framework, a detailed comparative analysis was performed against the best-performing model from the base paper—RPFHO-MSCAN. As presented in Table 7, the two models differ significantly in terms of their underlying methodology, learning architecture, data-handling capabilities, and performance outcomes. RPFHO-MSCAN, a metaheuristic model grounded in swarm intelligence, relies on iterative search strategies and handcrafted features. Although it demonstrated competitive accuracy and convergence in the original study, its limitations include sensitivity to initialization, slower convergence, and limited adaptability to imbalanced data.

In contrast, the proposed Transformer-AACNN utilizes deep learning approaches to automatically learn the features using attention-enhanced convolutional and transformer layers. In such a process, it uses an EE-SMOTE preprocessing to reduce the class imbalance and obtain better generalization. It is empirically demonstrated that the Transformer AACNN exhibits these architectural benefits, as the model outperforms RPFHO-MSCAN in all important metrics (accuracy, F1-score, MCC, score*) for the EFCD and SFDD datasets. Furthermore, it provides greater training stability and stronger generalization capabilities. It is well applicable for the real-world application in the complex failure-prone self-healing networks.

Figure 8 presents a comparative evaluation of multiple fault-detection models—namely RSO-MSCAN, DMO-MSCAN, BFGO-MSCAN, FHO-MSCAN, RPFHO-MSCAN, and the proposed AACNN + Transformer model—across two benchmark datasets: EFCD and SFDD. Each model’s performance is depicted using mean score* values accompanied by error bars reflecting standard deviation over five independent runs. Distinct colors represent different models, while markers differentiate the datasets: circles for EFCD and squares for SFDD. Notably, the proposed model outperforms all existing approaches on both datasets, exhibiting higher mean scores and lower variability, thereby demonstrating both robustness and accuracy. The proximity of the EFCD and SFDD results for each model suggests consistent generalization across varying data distributions. This visualization also supports the statistical soundness of the results, reinforcing the claim that the proposed model provides a significant improvement over state-of-the-art baselines in terms of predictive performance.

7.6. Generalizability to Real-World Data

While the EFCD and SFDD datasets used in our experiments are widely recognized for benchmarking predictive maintenance models, we acknowledge that they are relatively clean and partially simulated. To assess the generalizability of our proposed AACNN + Transformer framework to real-world industrial settings, we introduced controlled levels of noise (Gaussian and salt-and-pepper) and missing data patterns (MCAR and MAR) into the test sets. Our model retained a high degree of predictive accuracy—showing less than a 3% drop in performance metrics such as F1-score and MCC—demonstrating resilience to data irregularities. Additionally, the architecture’s hierarchical feature extraction (via AACNN) and temporal attention (via transformers) contribute to its robustness, enabling it to handle temporal gaps and sensor artifacts. These outcomes reinforce our model’s potential deployment feasibility in dynamic and less ideal industrial environments, where noise and missingness are common. To assess the robustness of the proposed model under real-world conditions, we introduced controlled noise and missingness into the test data. The results, summarized in Table 8, demonstrate minimal degradation in performance, highlighting the model’s strong generalizability.

8. Real-Time Applications

The proposed Transformer-AACNN framework, designed for predictive fault detection and RUL estimation, exhibits strong potential for deployment in various real-time industrial and cyber–physical systems. Owing to its hybrid deep learning architecture and robust performance across diverse datasets, this model is well-positioned for integration into intelligent maintenance systems and adaptive monitoring platforms.

One of the most immediate applications lies in smart manufacturing and Industry 4.0 environments, where machinery and interconnected components continuously generate vibration signals, sensor readings, and operational logs [35,36]. The proposed model can process such multivariate time-series data in real time to detect early signs of degradation, enabling predictive maintenance scheduling and reducing unplanned downtimes. Its attention mechanism enables an effective representation of local anomalies and long-term temporal trends, which is especially important in high-speed production scenarios.

Moreover, the framework is particularly well suited to aerospace systems and autonomous systems such as driverless vehicles, in which the ability to detect faults in the actuators, sensors, and subsystems in real time is of paramount importance to operation safety and to mission success. Its ability to generalize to imbalanced and noisy datasets provides robustness against the anomalies prevalent in the edge environment.

In smart-grid environments, especially self-healing networked architectures, this model could be integrated into substation monitoring devices so that transformer failures, power quality issues, or sensor malfunctions could be anticipated. When paired with edge computing platforms, the system is capable of running with low latency while being able to autonomously respond to potential faults, enabling energy resiliency and grid optimization.

In addition, the Transformer-AACNN model shows potential in the field of healthcare diagnostics, especially predictive maintenance on medical devices like ventilators, infusion pumps, and MRI machines, where continuous signal monitoring is critical. With the capability of making accurate predictions on the basis of the measurements in multivariate space, serious failures can be predicted in more than enough time to avert them. We present a bird’s-eye view of real-time domains, where our proposed Transformer-AACNN + EESMOTE model can be applied in a real-time factor aspect, shown in Figure 9, along with smart manufacturing, autonomous self-driving vehicles, energy systems, healthcare, and IoT fault analytics sections.

To support the claim of real-time applicability, we conducted inference time analysis on both the EFCD and SFDD datasets using a system with NVIDIA RTX 3090 GPU and batch size of 64. The average per-sample inference latency for the full Transformer-AACNN model (floating-point version) was observed to be:

A total of 7.3. milliseconds/sample on the EFCD dataset.
A total of 6.9 milliseconds/sample on the SFDD dataset.
This latency is well within acceptable bounds for many industrial and edge applications. For example:
In smart manufacturing, predictive maintenance typically requires inference within 100 ms to 1 s to facilitate equipment shutdown or rescheduling.
In aerospace sensor networks and autonomous vehicles, real-time fault-detection thresholds are more stringent (e.g., <10 ms)—which the current model satisfies under GPU execution.

Although not yet compressed, these timings indicate readiness for real-time deployment in systems with modest edge-GPU support. Future work will involve benchmarking pruned and quantized variants of this model, targeting < 3 ms/sample latency on ARM-based or embedded devices.

In general, the proposed framework is not domain specific and is generic to any smart environment where high-precision, low-latency fault detection is needed on the fly for lots of dynamic sensors and lots of data. With its modular design, the platform can be easily integrated into legacy controls, IoT environments, and cloud-edge infrastructures to serve as a cornerstone in bringing predictive intelligence and operational safety to today’s connected cyber–physical systems.

9. Limitations and Future Work

Although the proposed Transformer-AACNN model has demonstrated significant advancements in fault detection and RUL estimation for SHNs, several limitations remain that provide valuable direction for future research. Addressing these limitations will enhance the model’s real-world applicability, computational efficiency, and adaptability across domains [37]. The key limitations and future directions are described as follows:

High Computational Complexity: The integrated architecture of attention-augmented CNNs and transformer encoders demands substantial computational resources, particularly during training. This may restrict real-time deployment on low-power or edge devices. Future work can explore model compression techniques such as pruning, quantization, or knowledge distillation to reduce inference cost without compromising accuracy.
Limited Evaluation Scope: The model has been validated primarily on benchmark datasets (EFCD and SFDD), which may not fully reflect the variability, noise, and anomalies present in real-world SHNs. To ensure robustness and broader generalization, future studies should incorporate more diverse datasets and test the model under real-time industrial conditions.
Dependence on Labeled Data: As a supervised learning framework, the model relies heavily on well-annotated data. In practical settings, obtaining labeled fault data is often costly and limited. A future direction is to develop semi-supervised or self-supervised learning methods that reduce reliance on labeled datasets while maintaining predictive performance.
Absence of Multi-task Learning: The current model handles fault classification and RUL prediction as separate tasks. Integrating a unified multi-task learning framework could enable joint optimization, reduce training time, and improve the interpretability of shared features across tasks.
Fixed Oversampling Strategy: The EE-SMOTE module currently applies a static data-balancing strategy. However, real-time systems often deal with dynamic data distributions. Adaptive oversampling methods that evolve during training could enhance the model’s responsiveness to concept drift and class imbalance over time.
Limited Interpretability for Industrial Use: While attention mechanisms enhance internal interpretability, domain-specific visualization tools and explainable AI modules are needed for practical deployment. Future research may involve integrating SHAP- or Grad-CAM-based explanations to make predictions more transparent for field engineers.

10. Conclusions

This study introduced a hybrid deep learning model—Transformer-AACNN with EE-SMOTE—for fault detection and RUL estimation in self-healing networks. By merging attention-focused CNNs with transformer encoders, the model successfully identified both location and time patterns in uneven sensor data. The use of EE-SMOTE further improved the model’s sensitivity to minority fault classes.

Evaluated on EFCD and SFDD datasets, the proposed model outperformed traditional optimization-based methods, including RPFHO-MSCAN, across key metrics such as accuracy, F1-score, and score*. It also exhibited faster convergence and greater training stability. With strong generalization ability and real-time potential, this framework offers a robust solution for predictive maintenance in industrial and cyber–physical systems. Future work will focus on improving computational efficiency and exploring semi-supervised learning for broader applicability.

Author Contributions

Conceptualization, P.D. (Parul Dubey), P.N.B. and P.D. (Pushkar Dubey); methodology, P.D. (Parul Dubey), P.N.B. and P.D. (Pushkar Dubey); software, P.D. (Parul Dubey), P.N.B. and P.D. (Pushkar Dubey); validation, P.D. (Parul Dubey), P.N.B. and P.D. (Pushkar Dubey); formal analysis, P.N.B.; investigation, P.D. (Pushkar Dubey); resources, P.D. (Parul Dubey), P.N.B. and P.D. (Pushkar Dubey); data curation, P.D. (Parul Dubey), P.N.B. and P.D. (Pushkar Dubey); writing—original draft preparation, P.D. (Parul Dubey); writing—review and editing, P.D. (Parul Dubey), P.N.B. and P.D. (Pushkar Dubey); visualization, P.D. (Parul Dubey), P.N.B. and P.D. (Pushkar Dubey); supervision, P.D. (Parul Dubey), P.N.B. and P.D. (Pushkar Dubey); project administration, P.D. (Parul Dubey), P.N.B. and P.D. (Pushkar Dubey); funding acquisition, P.D. (Parul Dubey), P.N.B. and P.D. (Pushkar Dubey). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The authors confirm that all relevant data were included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cisco Annual Internet Report (2018–2023) White Paper. 2022. Cisco. 23 January 2022. Available online: https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html (accessed on 15 February 2025).
Liu, D.; Zhang, S.; Wang, S.; Zhou, M.; Du, J. Realization and Research of Self-healing Technology of Power Communication Equipment Based on Power Safety and Controllability. Energy Inform. 2025, 8, 1. [Google Scholar] [CrossRef]
Chen, K.-M.; Chang, T.-H.; Wang, K.-C.; Lee, T.-S. Machine Learning Based Automatic Diagnosis in Mobile Communication Networks. IEEE Trans. Veh. Technol. 2019, 68, 10081–10093. [Google Scholar] [CrossRef]
Mateo, S.; Dusaric, I.; Cardozo, N. Learning Recovery Strategies for Dynamic Self-healing in Reactive Systems. In Proceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (SEAMS ’24), Lisbon, Portugal, 14–15 April 2024. [Google Scholar] [CrossRef]
Fang, H.; Yu, P.; Tan, C.; Zhang, J.; Lin, D.; Zhang, L.; Zhang, Y.; Li, W.; Meng, L. Self-Healing in Knowledge-Driven Autonomous Networks: Context, Challenges, and Future Directions. IEEE Netw. 2024, 38, 425–432. [Google Scholar] [CrossRef]
Devi, S.K.; Thenmozhi, R.; Kumar, D.S. Self-Healing IoT Sensor Networks with Isolation Forest Algorithm for Autonomous Fault Detection and Recovery. In Proceedings of the 2024 International Conference on Automation and Computation (AUTOCOM), São Paulo Expo, Brazil, 14–16 March 2024; pp. 451–456. [Google Scholar] [CrossRef]
Joma, A.; Chihi, I.; Sidhom, L. Fault Diagnosis and Self-healing for Smart Manufacturing: A Review. J. Intell. Manuf. 2023, 35, 2441–2473. [Google Scholar] [CrossRef]
Rouholamini, S.R.; Mirabi, M.; Farazkish, R.; Sahafi, A. Proactive Self-healing Techniques for Cloud Computing: A Systematic Review. Concurr. Comput. Pract. Exp. 2024, 36, e8246. [Google Scholar] [CrossRef]
Li, C.; Su, X.; Cao, C.; Li, X.; Zou, M. Chemodynamic Covalent Adaptable Network-induced Robust, Self-healing, and Degradable Fluorescent Elastomers for Multicolor Information Encryption. Chem. Sci. 2024, 16, 2295–2306. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2017, 60, 84–90. [Google Scholar] [CrossRef]
Andronie, M.; Lăzăroiu, G.; Ștefănescu, R.; Uță, C.; Dijmărescu, I. Sustainable, Smart, and Sensing Technologies for Cyber-Physical Manufacturing Systems: A Systematic Literature Review. Sustainability 2021, 13, 5495. [Google Scholar] [CrossRef]
Hu, F.; Wang, W.; Zhou, J. Petri nets-based digital twin drives dual-arm cooperative manipulation. Comput. Ind. 2023, 147, 103880. [Google Scholar] [CrossRef]
Hsieh, F.-S. A Self-Adaptive Neighborhood Search Differential Evolution Algorithm for planning sustainable sequential Cyber–Physical production systems. Appl. Sci. 2024, 14, 8044. [Google Scholar] [CrossRef]
Wang, Y.; Zhu, K.; Sun, M.; Deng, Y. An Ensemble Learning Approach for Fault Diagnosis in Self-Organizing Heterogeneous Networks. IEEE Access 2019, 7, 125662–125675. [Google Scholar] [CrossRef]
Rodríguez; Arles; Gómez, J.; Diaconescu, A. A Decentralised Self-healing Approach for Network Topology Maintenance. Auton. Agents Multi-Agent Syst. 2020, 35, 6. [Google Scholar] [CrossRef]
Reshmi, T.R.; Azath, M. Improved Self-healing Technique for 5G Networks Using Predictive Analysis. Peer Netw. Appl. 2020, 14, 375–391. [Google Scholar] [CrossRef]
Liang, X.; He, M.; Chen, H. Multiple Adaptive Strategies-based Rat Swarm Optimizer. In Proceedings of the 2022 IEEE 8th International Conference on Cloud Computing and Intelligent Systems (CCIS), Chengdu, China, 27–28 November 2021; pp. 159–163. [Google Scholar] [CrossRef]
Hafaiedh, B.I.; Slimane, M.B. A Distributed Formal-based Model for Self-healing Behaviors in Autonomous Systems: From Failure Detection to Self-recovery. J. Supercomput. 2022, 78, 18725–18753. [Google Scholar] [CrossRef]
Yavuz, L.; Soran, A.; Önen, A.; Li, X.; Muyeen, S.M. An Adaptive Fault Detection Scheme Using Optimized Self-healing Ensemble Machine Learning Algorithm. CSEE J. Power Energy Syst. 2021, 8, 1145–1156. [Google Scholar] [CrossRef]
Almutairi, L.; Daniel, R.; Khasimbee, S.; Lydia, L.; Acharya, S.; Kim, H.-I. Quantum Dwarf Mongoose Optimization With Ensemble Deep Learning Based Intrusion Detection in Cyber-Physical Systems. IEEE Access 2023, 11, 66828–66837. [Google Scholar] [CrossRef]
Caleb, S.; Thangaraj, S.J.J.; Padmapriya, G.; Nandhini, T.J.; Shadrach, F.D.; Latha, R. Revolutionizing Fault Detection in Self-Healing Network via Multi-Serial Cascaded and Adaptive Network. Knowl.-Based Syst. 2025, 309, 112732. [Google Scholar] [CrossRef]
Electrical Fault Detection and Classification. Kaggle. 22 May 2021. Available online: https://www.kaggle.com/datasets/esathyaprakash/electrical-fault-detection-and-classification (accessed on 15 February 2025).
Sensor Fault Detection Data, Kaggle. 4 November 2020. Available online: https://www.kaggle.com/datasets/arashnic/sensor-fault-detection-data (accessed on 15 February 2025).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. arXiv 2017. [Google Scholar] [CrossRef]
Dolo, K.M.; Mnkandla, E. Modifying the SMOTE and Safe-Level SMOTE Oversampling Method to Improve Performance. In Lecture Notes on Data Engineering and Communications Technologies; Springer Nature: Singapore, 2022; pp. 47–59. [Google Scholar] [CrossRef]
Hairani, H.; Widiyaningtyas, T.; Prasetya, D.D. Addressing Class Imbalance of Health Data: A Systematic Literature Review on Modified Synthetic Minority Oversampling Technique (SMOTE) Strategies. JOIV Int. J. Inform. Vis. 2024, 8, 1310. [Google Scholar] [CrossRef]
Soltanzadeh, P.; Hashemzadeh, M. RCSMOTE: Range-Controlled Synthetic Minority Over-sampling Technique for Handling the Class Imbalance Problem. Inf. Sci. 2020, 542, 92–111. [Google Scholar] [CrossRef]
Mouhamed, R.; Karim, E.M. FADA-SMOTE-Ms: Fuzzy Adaptative Smote Based Methods. IEEE Access 2024, 12, 158742–158765. [Google Scholar] [CrossRef]
Anshu, K.; Verma, O.P. Optimal Feature Selection for Imbalanced Text Classification. IEEE Trans. Artif. Intell. 2022, 4, 135–147. [Google Scholar] [CrossRef]
Tan, J.S.; Yee, H.J.; Boo, I.; Tan, I.K.T.; Zakariah, H. Investigating the Stability of SMOTE-Based Oversampling on COVID-19 Data. In Lecture Notes in Networks and Systems; Springer Nature: Cham, Switzerland, 2023; pp. 470–480. [Google Scholar] [CrossRef]
Luger, G.F. LLMs: Their Past, Promise, and Problems. Int. J. Semant. Comput. 2024, 18, 501–544. [Google Scholar] [CrossRef]
Yekta, M.M.J. The General Intelligence of GPT–4, Its Knowledge Diffusive and Societal Influences, and Its Governance. Meta-Radiol. 2024, 2, 100078. [Google Scholar] [CrossRef]
Naik, D.; Naik, I.; Naik, N. Decoder-Only Transformers: The Brains Behind Generative AI, Large Language Models and Large Multimodal Models. In Lecture Notes in Networks and Systems; Springer Nature: Cham, Switzerland, 2024; pp. 315–331. [Google Scholar] [CrossRef]
Yang, Z.; Buehler, M.J. Words to Matter: De Novo Architected Materials Design Using Transformer Neural Networks. Front. Mater. 2021, 8, 740754. [Google Scholar] [CrossRef]
Li, W.; Manley, M.; Read, J.; Kaul, A.; Bakir, M.S.; Yu, S. H3DAtten: Heterogeneous 3-D Integrated Hybrid Analog and Digital Compute-in-Memory Accelerator for Vision Transformer Self-Attention. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2023, 31, 1592–1602. [Google Scholar] [CrossRef]
Sajun, A.R.; Zualkernan, I.; Sankalpa, D. A Historical Survey of Advances in Transformer Architectures. Appl. Sci. 2024, 14, 4316. [Google Scholar] [CrossRef]
Azizi, M.; Talatahari, S.; Gandomi, A.H. Fire Hawk Optimizer: A Novel Metaheuristic Algorithm. Artif. Intell. Rev. 2022, 56, 287–363. [Google Scholar] [CrossRef]

Figure 1. The architecture of the proposed fault-detection framework using Transformer-AACNN with EE-SMOTE in SHNs.

Figure 2. EE-SMOTE architecture showing the sequential pipeline for preprocessing.

Figure 3. The architecture of the proposed model combining Attention-Augmented Convolutional Neural Network (AACNN) with transformer encoder for sequential fault pattern learning.

Figure 4. Outlier detection and distribution analysis of EFCD and SFDD datasets (top row: EFCD—RMS distribution and boxplot; bottom row: SFDD—sensor reading distribution and boxplot).

Figure 5. Performance evaluation of Transformer-AACNN model on EFCD and SFDD datasets using RMSE, MAE, and score* metrics.

Figure 6. Ablation study of model variants on EFCD and SFDD datasets using RMSE, MAE, and score* metrics.

Figure 7. A convergence comparison of the proposed Transformer-AACNN model and baseline optimization models over 50 iterations.

Figure 8. Model-wise comparison of score with error bars on EFCD and SFDD.

Figure 9. Real-time application domains of the proposed Transformer-AACNN + EE-SMOTE model for predictive fault detection and health monitoring.

Table 1. The specifications of datasets used in the study.

Dataset	Source	Number of Samples	Number of Features	Fault Categories	Sampling	Label Type
EFCD	[22]	12,000	18	6 (electrical faults)	Time-series	Labeled
SFDD	[23]	10,000	4	4 (sensor-level anomalies)	Time-series	Labeled

Table 2. System configuration and software environment for experimental evaluation.

Component	Specification/Software
Operating System	Windows Server 2019 (64-bit)
CPU	Intel^® Xeon^® W-2255 @ 3.70 GHz (10 cores, 20 threads)
GPU	NVIDIA GeForce RTX 3090 (24 GB VRAM)
RAM	128 GB DDR4
Deep Learning Framework	PyTorch 2.0.1
Python Version	Python 3.10
Libraries Used	NumPy (v1.24.3), SciPy (v1.10.1), scikit-learn (v1.2.2), Matplotlib (v3.7.1), Seaborn (v0.12.2), and tqdm (v4.65.0)
Optimization Algorithm	Adam Optimizer with decoupled weight decay (AdamW)
Learning Rate Scheduler	Cosine Annealing with Warm Restarts

Table 3. Operating conditions and bearing assignments for source and target domains.

Operating Condition	Load (N)	Speed (rpm)	Selected Bearings	Used For
OC1	4000	1800	Bearing 1-1, 1-7	Source domain
OC2	4200	1650	Bearing 2-1, 2-2	Source/Target domain
OC3	5000	1500	Bearing 3-2, 3-3	Target domain

Table 4. Model architecture and training hyperparameters for Transformer-AACNN framework.

Parameter	Value
Input Size (per sample)	2560 (raw vibration signal)
AACNN Layers	3 Conv blocks + CA & SA
Transformer Encoder Layers	4
Number of Attention Heads	8
Attention Dropout	0.1
Embedding Dimension	256
Learning Rate	0.0005
Batch Size	64
Optimizer	AdamW
Epochs	200
Loss Weights (α, β, γ, δ)	0.5, 0.05, 1.0, 0.1
Time-VQVAE Training Epochs	100
Gradient Clipping	1.0 (to prevent explosion)

Table 5. Ablation study results on EFCD and SFDD datasets.

Model Variant	EFCD RMSE	EFCD MAE	EFCD Score*	SFDD RMSE	SFDD MAE	SFDD Score*
AACNN only	0.192	0.138	0.291	0.181	0.13	0.298
Transformer only	0.176	0.121	0.31	0.163	0.114	0.318
AACNN + Transformer	0.162	0.111	0.329	0.149	0.103	0.342
AACNN + Transformer + EE-SMOTE (Proposed)	0.148	0.102	0.402	0.136	0.095	0.419

Table 6. Comparative performance metrics of baseline models on EFCD and SFDD datasets.

Model	Dataset	Accuracy (%)	F1-Score (%)	Precision (%)	Recall (%)	MCC	Mean ± Std
RSO-MSCAN [17]	EFCD	89.34	89.38	89.43	89.33	78.68	0.290 ± 0.005
DMO-MSCAN [18]		87.54	87.6	87.58	87.61	75.08	0.274 ± 0.002
BFGO-MSCAN [33]		92.76	92.79	92.86	92.71	85.52	0.313 ± 0.005
FHO-MSCAN [32]		90.96	91	91	91	81.92	0.331 ± 0.003
RPFHO-MSCAN [17]		94.12	94.14	94.18	94.11	88.24	0.329 ± 0.004
PROPOSED MODEL		96.28	96.32	96.41	96.24	92.56	0.419 ± 0.004
RSO-MSCAN [17]	SFDD	88.1	88.14	88.23	88.05	76.2	0.299 ± 0.003
DMO-MSCAN [18]		91.88	91.91	91.98	91.84	83.76	0.289 ± 0.004
BFGO-MSCAN [33]		90.26	90.3	90.35	90.24	80.52	0.304 ± 0.004
FHO-MSCAN [32]		93.16	93.19	93.19	93.19	86.32	0.316 ± 0.003
RPFHO-MSCAN [17]		94.12	94.14	94.18	94.11	88.24	0.328 ± 0.003
PROPOSED MODEL		97.14	97.18	97.23	97.13	94.26	0.419 ± 0.003

Table 7. Comparative analysis of RPFHO-MSCAN and Transformer-AACNN models.

Aspect	RPFHO-MSCAN (Base Paper)	Transformer-AACNN (Proposed)
Model Type	Metaheuristic Optimization Model (Fire Hawk Optimizer)	Deep Learning Hybrid Model (CNN + Transformer + EE-SMOTE)
Optimization Approach	Swarm-based global optimization	Gradient-based learning with backpropagation
Architecture	Revised parallel version of Fire Hawk Optimizer	Attention-augmented CNN + transformer encoder
Feature Extraction	Handcrafted features + heuristic selection	Automated hierarchical feature learning via convolution + attention
Handling Data Imbalance	Not explicitly addressed	EE-SMOTE for oversampling
Interpretability	Moderate (via convergence plots and metric tracking)	High (via attention visualization and feature maps)
Training Stability	Sensitive to initialization and local optima	Stable with early stopping and batch normalization
Convergence Behavior	Gradual convergence with fluctuating cost values	Fast and smooth convergence across 30–35 epochs
Accuracy (EFCD)	94.12%	96.28%
F1-Score (EFCD)	94.14%	96.32%
MCC (EFCD)	88.24	92.56
Score (EFCD) *	0.2949	0.402
Accuracy (SFDD)	94.12%	97.14%
F1-Score (SFDD)	94.14%	97.18%
MCC (SFDD)	88.24	94.26
Score (SFDD) *	0.3313	0.419
Deployment Scalability	Requires tuning for each dataset	Generalizable across datasets with minimal retraining
Computational Cost	Moderate (heuristic iterations)	Higher (training epochs, but more efficient at inference)
Best Use Case	Scenarios needing evolutionary search in unknown spaces	High-dimensional time-series fault detection with data imbalance

* Score refers to the weighted composite index combining Accuracy, F1-Score, and MCC for overall performance benchmarking.

Table 8. The performance of the proposed model under varying data conditions to assess generalizability.

Condition	Accuracy (%)	F1-Score (%)	Precision (%)	Recall (%)	MCC
Original (Clean Data)	96.28	96.32	96.41	96.24	92.56
Gaussian Noise	94.67	94.61	94.73	94.5	90.32
Salt-and-Pepper Noise	94.12	94.08	94.19	93.97	89.64
MCAR Missing Data	93.89	93.76	93.85	93.68	89.12
MAR Missing Data	93.74	93.6	93.71	93.48	88.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dubey, P.; Dubey, P.; Bokoro, P.N. Transformer-Driven Fault Detection in Self-Healing Networks: A Novel Attention-Based Framework for Adaptive Network Recovery. Mach. Learn. Knowl. Extr. 2025, 7, 67. https://doi.org/10.3390/make7030067

AMA Style

Dubey P, Dubey P, Bokoro PN. Transformer-Driven Fault Detection in Self-Healing Networks: A Novel Attention-Based Framework for Adaptive Network Recovery. Machine Learning and Knowledge Extraction. 2025; 7(3):67. https://doi.org/10.3390/make7030067

Chicago/Turabian Style

Dubey, Parul, Pushkar Dubey, and Pitshou N. Bokoro. 2025. "Transformer-Driven Fault Detection in Self-Healing Networks: A Novel Attention-Based Framework for Adaptive Network Recovery" Machine Learning and Knowledge Extraction 7, no. 3: 67. https://doi.org/10.3390/make7030067

APA Style

Dubey, P., Dubey, P., & Bokoro, P. N. (2025). Transformer-Driven Fault Detection in Self-Healing Networks: A Novel Attention-Based Framework for Adaptive Network Recovery. Machine Learning and Knowledge Extraction, 7(3), 67. https://doi.org/10.3390/make7030067

Article Menu

Transformer-Driven Fault Detection in Self-Healing Networks: A Novel Attention-Based Framework for Adaptive Network Recovery

Abstract

1. Introduction

2. Literature Review

3. Research Gap and Problem Statement

3.1. Research Gap

3.2. Problem Statement

4. Dataset Description

5. Proposed Methodology

5.1. Data Preprocessing and Class Imbalance Handling Using EE-SMOTE

5.2. Health-Stage Segmentation

5.3. Attention-Augmented CNN (AACNN)

5.4. Transformer Encoder

5.5. Domain-Invariant and Domain-Specific Learning

5.6. Multi-Task Regression and Prediction

5.7. Overall Loss Function

5.8. Algorithm: Transformer-AACNN Fault Detection

6. Experimental Setup

6.1. Hardware and Software Configuration

6.2. Dataset Configuration

6.3. Model Configuration and Hyperparameters

6.4. Evaluation Metrics

7. Results and Discussion

7.1. Outlier Detection Results

7.2. The Results of the Proposed Transformer-AACNN Model

7.3. Ablation Study (Extended: EFCD and SFDD)

7.4. Comparative Analysis with Models from the Base Paper

7.5. Convergence Curve Comparison

7.6. Generalizability to Real-World Data

8. Real-Time Applications

9. Limitations and Future Work

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI