Next Article in Journal
A Method for Constructing a Loss Function for Multi-Scale Object Detection Networks
Previous Article in Journal
Multimode Fiber Specklegram Sensor for Multi-Position Loads Recognition Using Traversal Occlusion
Previous Article in Special Issue
Electroencephalography Signal Processing: A Comprehensive Review and Analysis of Methods and Techniques
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Explainable Attention-Based Meta-Learning Framework for Imbalanced Brain Stroke Prediction

by
Inam Abousaber
Department of Information Technology, Faculty of Computers and Information Technology, University of Tabuk, Tabuk 47912, Saudi Arabia
Sensors 2025, 25(6), 1739; https://doi.org/10.3390/s25061739
Submission received: 7 January 2025 / Revised: 1 March 2025 / Accepted: 7 March 2025 / Published: 11 March 2025
(This article belongs to the Collection Deep Learning in Biomedical Informatics and Healthcare)

Abstract

:
The accurate prediction of brain stroke is critical for effective diagnosis and management, yet the imbalanced nature of medical datasets often hampers the performance of conventional machine learning models. To address this challenge, we propose a novel meta-learning framework that integrates advanced hybrid resampling techniques, ensemble-based classifiers, and explainable artificial intelligence (XAI) to enhance predictive performance and interpretability. The framework employs SMOTE and SMOTEENN for handling class imbalance, dynamic feature selection to reduce noise, and a meta-learning approach combining predictions from Random Forest and LightGBM, and further refined by a deep learning-based meta-classifier. The model uses SHAP (Shapley Additive Explanations) to provide transparent insights into feature contributions, increasing trust in its predictions. Evaluated on three datasets, DF-1, DF-2, and DF-3, the proposed framework consistently outperformed state-of-the-art methods, achieving accuracy and F1-Score of 0.992189 and 0.992579 on DF-1, 0.980297 and 0.981916 on DF-2, and 0.981901 and 0.983365 on DF-3. These results validate the robustness and effectiveness of the approach, significantly improving the detection of minority-class instances while maintaining overall performance. This work establishes a reliable solution for stroke prediction and provides a foundation for applying meta-learning and explainable AI to other imbalanced medical prediction tasks.

1. Introduction

Stroke, and especially brain stroke, remains one of the top health burdens worldwide, with stroke being the second leading cause of mortality and disability globally [1]. Despite improvements in healthcare, stroke affects 15 million people per year and kills more than 5 million people per year, with 5 million more permanently disabled [2]. Before stroke occurs, it is essential to detect potential risk factors early on and accurately predict the factors that lead to a brain stroke for timely intervention, minimizing long-term consequences. Complications and improving patient recovery [3]. Predicting stroke is a complex task due to the relatively complex nature of medical datasets, the high dimensionality, and the imbalance between positive (stroke) and negative (no stroke) cases.
Stroke remains one of the leading causes of disability and mortality worldwide, affecting approximately 15 million people annually, with 5 million cases resulting in permanent disability, according to the World Health Organization. The socioeconomic burden of stroke is substantial, affecting healthcare systems around the world due to the long-term care required for stroke survivors. Early prediction and timely intervention are critical not only for improving patient outcomes but also for alleviating the economic strain associated with post-stroke rehabilitation and care. Machine learning (ML) offers a robust framework for analyzing large and complex medical datasets, efficiently processing heterogeneous data such as patient demographics, lifestyle factors, and clinical histories to identify subtle patterns and risk factors that traditional statistical methods often do not detect [4].
The dynamic and adaptive nature of ML algorithms allows them to continuously refine their predictive capabilities, enabling accurate identification of individuals at high risk of stroke even in data-rich, high-dimensional settings [5]. Unlike conventional models, ML techniques such as ensemble learning, neural networks, and meta-learning frameworks can integrate diverse sources of information, enhancing both predictive accuracy and generalizability across different populations. Integration of ML in predictive modeling facilitates timely diagnosis. It makes individualized treatment strategies and guided interventions possible, which contribute immensely in preventing long-term disability and improving stroke patients’ quality of life. The integration of ML in predictive modeling highlights the revolutionized role of cutting-edge ML technologies in stroke prediction and care, opening new avenues for clinical decision support and optimization of medical care in healthcare systems [6,7].
In imbalanced datasets, where the number of positive cases (stroke patients) is disproportionately more minor than negative cases, machine learning (ML) models tend to perform poorly on the minority class [8]. Such models are often biased towards the majority class, which diminishes their effectiveness in clinical scenarios where predicting the minority class is critical [9]. Various techniques have been proposed to address this imbalance, including oversampling methods such as Synthetic Minority Oversampling Technique (SMOTE) [10] and hybrid methods like SMOTEENN [11]. While these techniques improve the balance in data distribution, achieving a robust prediction for highly skewed datasets, such as stroke datasets, remains challenging.
Recent advances in ensemble and deep learning have shown remarkable improvements in dealing with class imbalance and enhancing the predictive performance [12]. At the same time, ensemble methods like Random Forest (RF) and Light Gradient Boosting Machine (LightGBM) were widely adopted due to their robustness or generalization capability [13]. However, in the case of complex datasets, these models cannot often represent complex interrelationships within the features. To address this limitation, a robust approach is through meta-learning frameworks that aggregate predictions from multiple base models. Zen et al. tackled the shortcomings of reweighting methods using a meta-learning approach, which can be beneficial by utilizing complementary information from different classifiers and improving the overall performance [14].
Additionally, attention mechanisms have been investigated to maximize feature representation in deep neural networks. In this case, attention mechanisms learn how to dynamically focus on essential features and allow models to pay attention to the most important aspects of input data, thus being exceptionally well suited to high-dimensional medical datasets [15]. Due to its ability to increase feature interpretability and improve classification performance on challenging prediction tasks like stroke prediction, the attention mechanism integrated with ensemble-based predictions can further enhance both prediction power and interpretability.
Explainability is vital beyond model performance in clinical applications. The practitioners applying this model aim to understand how it makes decisions so as to reliably trust and use it in practice. Techniques like Shapley Additive Explanations (SHAP) make machine learning transparent through feature importance attribution to predictions [16]. Moreover, these approaches guarantee that the forecasts from the model are interpretable and meaningful from the clinical perspective, a consideration frequently disregarded in pure black-box systems [17].
To address the above challenges, we propose a novel framework for brain stroke prediction that combines ensemble models, meta-learning, attention mechanisms, and explainable AI techniques. The proposed framework introduces several innovations:
  • Hybrid Resampling Techniques: By combining SMOTE and SMOTEENN, the framework effectively addresses class imbalance while minimizing noise in synthetic samples. This ensures a balanced dataset, which is crucial for improving the sensitivity of minority class predictions.
  • Attention-Based Feature Engineering: The attention mechanism adaptively prioritizes significant features, capturing both local and global interactions. This dynamic feature weighting enhances the representation of critical factors contributing to stroke prediction.
  • Ensemble and Meta-Learning Integration: By integrating Random Forest and LightGBM as base models and leveraging a deep learning meta-model, the framework optimizes the synergy between diverse classifiers. This approach captures higher-order interactions, improving decision boundaries and overall predictive accuracy.
  • Explainable AI: The inclusion of SHAP ensures that the model’s decision-making model is transparent, providing clinicians with actionable insights into feature contributions. This enhances trust in the system, making it more suitable for real-world clinical adoption.
  • Extensive Validation: The framework’s performance is evaluated on three benchmark datasets (DF-1, DF-2, and DF-3), demonstrating consistent superiority in accuracy, F1-Score, and ROC-AUC metrics. This validates its robustness and highlights its generalizability across diverse datasets.
The rest of the paper is structured as follows: In Section 2, related work on brain stroke prediction, specifically in the context of class imbalance and explainable AI, is reviewed. Section 3 presents and analyzes the benchmark datasets. We introduce the meta-learning framework for the proposed method in Section 4, describing data preprocessing and hybrid resampling, feature selection, and model architecture. In Section 5, we describe the experimental setup and evaluation metrics and report results, explainable predictions, comparisons to baselines, and ablation studies. Section 6 presents findings from the experiments with our benchmarking dataset, while Section 7 closes this paper with future research directions.

2. Related Work

In recent years, brain stroke prediction using machine learning and deep learning models has attracted much interest. While existing methods address some of these challenges, they still struggle with data imbalance, feature representation, model explainability, and ensemble integration. This section reviews the literature, organizing previous studies into imbalanced data handling, feature selection algorithms, ensemble learning, meta-learning, and explainable artificial intelligence (XAI). Section 3 presents the limitations of existing traditional and hybrid approaches and how our proposed framework overcomes these limitations.
Dealing with Imbalanced Data: Brain stroke datasets usually exhibit a significant class imbalance, where the smaller class (brain strokes) is swamped by the majority (non-brain-strokes). This imbalance can lead to biases in models towards the majority class, resulting in decreased sensitivity for the minority class [18]. To tackle this problem, approaches such as SMOTE [10], BorderlineSMOTE [19], and hybrid techniques like SMOTEENN [11] have been used. Although successful when applied to improve data distribution, such methods still generate noise and overfit the model by generating synthetic samples far from the decision boundary. We extend this task within the context of our proposed framework by embedding SMOTE and SMOTEENN in a meta-learning architecture to obtain balanced and robust predictions of any imbalanced dataset.
Feature Selection Techniques: Feature selection is critical in medical prediction tasks to reduce dimensionality and enhance interpretability. Traditional techniques, such as ANOVA and mutual information gain [20], rank features based on individual importance but fail to capture complex feature interactions. Attention-based models, such as Transformers [21], dynamically prioritize relevant features and have shown promise in medical domains [22]. However, existing methods lack a hierarchical representation of features and fail to combine static and dynamic feature selection. Our proposed attention-based feature engineering module addresses these gaps by adaptively weighting features and leveraging hierarchical representations. New studies have shown excellent advances in attention-based meta-learning, which enhance the focusing power of the model. The mechanism of attention allows for the model to assign weights to different inputs, making it interpretable and performant. We have included these new studies in our references, ensuring that our study is based on the latest advances. Some of these excellent studies comprise those of Vaswani et al. (2017), which proposed Transformer architecture, as well as other subsequent studies, which generalize attention mechanisms to other meta-learning tasks [23].
Ensemble Learning and Meta-Learning in Medical Prediction: Ensemble learning methods, such as Random Forest (RF) [24] and Gradient Boosting Machines (GBMs) [25], have gained widespread adoption due to their robustness and generalization capabilities. These techniques combine multiple models to enhance overall performance, typically through approaches like bagging (e.g., RF) and boosting (e.g., LightGBM) [26]. Studies such as [27] have demonstrated the efficacy of ensemble methods in medical prediction tasks. However, traditional ensembles often treat base learners independently, neglecting the potential interactions between their outputs.
Meta-learning frameworks [28] attempt a different solution in learning a meta-model based on predictions of base models, which in turn optimizes the learning process. In a way, while ensemble learning attempts to minimize variance and bias via aggregation of predictions, meta-learning attempts to improve the end output by learning about higher-order representations. This approach allows for more efficient adaptation to new tasks than traditional ensemble methods. Our framework extends these concepts by combining RF and LightGBM with a deep learning-based meta-model, enabling the capture of non-linear relationships and further optimization of predictions. This integration of ensemble and meta-learning techniques leverages the strengths of both approaches; it improves model stability through ensemble methods while enhancing adaptability and performance through meta-learning strategies.
Explainable Artificial Intelligence (XAI): Explainability is crucial in clinical applications to ensure transparency and trust. Methods like SHAP (Shapley Additive Explanations) [29] provide insights into feature contributions, bridging the gap between black-box models and clinical decision-making. While SHAP has been widely applied in traditional ML models, its integration with meta-learning frameworks remains limited. Our framework leverages SHAP to explain base- and meta-level predictions, enhancing interpretability and clinical applicability.
Traditional and Hybrid Models: Classical machine learning methods, such as Support Vector Machines (SVMs), Decision Trees (DTs), and RF [30,31], have been used in stroke prediction. Although effective, these models rely heavily on static feature sets and fail to address data imbalance. Advanced hybrid models, such as ensemble-based BSPE and HEL-BSP [32] as well as boosting techniques [33], have demonstrated improved accuracy but often lack explainability and adaptability. Deep learning models, including CNNs and LSTMs [34], have also been employed but are domain-specific and computationally intensive.
Recent Advances in Stroke Prediction: Stroke prediction has been significantly improved by integrating deep learning, explainable artificial intelligence (XAI), and novel methods in feature extraction. Research has explored diverse methodologies in this context, including EEG-based diagnostic methods to ensemble learning algorithms.
Islam et al. [35] have created an explainable system to forecast stroke by EEG signals. Their system employed Adaptive Gradient Boosting for classification and used methods such as Eli5 and LIME to aid in interpretation. Their work revealed spectral features in deltas and thetas to be significantly related to predicting stroke.
Moulaei et al. [36] compared prediction models in deep learning to those in machine learning by comparing various models such as CNNs, LSTMs, and Random Forests. Their finding was that, generally, deep learning models performed better compared to traditional ML models, where better sensitivity was achieved by LSTMs while Random Forest achieved better overall accuracy and specificity.
Dritsas et al. [37] focused on machine learning algorithms in predicting stroke risk. Their paper employed stacking in the process of combining multiple classifiers: Random Forest, Naïve Bayes, and Decision Trees. Their technique achieved an AUC value of 98.9.
Another paper by Dasgupta and Aksoy [38] studied deep learning in neuro-oncology by applying EfficientNetB0 in separating brain tumors. Their transfer learning technique is not targeted towards stroke but offers beneficial guidance in deploying CNNs to diagnostic work in clinical images that can be extended to stroke classification. Recently, work by zainab et al. [39] put forward emphasis on real-time monitoring of stroke by AI-powered wearable sensors. Their work employed a digital twin paradigm in predictive modeling by combining EEG and EMG measurements to achieve better rates in early detection of stroke. Such an AI-powered wearable system has potential in continuous patient observation and estimation of risk in real-world contexts.
These studies validate an extension of the role of ensemble methods, XAI, and AI in predicting stroke to yield increasingly accurate, interpretable, and clinically feasible models.
The limitations of prior works necessitate a robust solution that integrates advanced preprocessing, modeling, and interpretability. The proposed framework combines SMOTE and SMOTEENN to address class imbalance while minimizing noise, employs attention-based feature engineering to enhance critical factor representation, and leverages Random Forest and LightGBM with a deep learning meta-model to optimize classifier synergy. With SHAP ensuring explainability, the framework provides actionable insights for clinical adoption. Validated across three benchmark datasets—DF-1, DF-2, and DF-3—it consistently performs better, demonstrating robustness, generalizability, and suitability for complex predictive tasks.
The proposed framework bridges the gaps in existing methods by addressing class imbalance, enhancing feature representation, integrating ensemble learning with meta-learning, and providing explainable predictions. These advances set a new standard for predictive power and interpretability in stroke predictions, which can be a reference for future studies in clinical and research environments. This process makes inferences about vital stroke-related data in a way that maintains methodological rigor.

3. Datasets

In this section, three benchmark datasets, namely DF-1 [40], DF-2 [41], and DF-3 [42], are discussed, which were used in this study to solve brain stroke prediction. These datasets are a significant machine learning challenge, especially in healthcare. Class imbalance is a fundamental problem in machine learning where the dominant class is large while the other courses are small and represent the minority class. The problem leads to biased models and decreases the models’ ability to model instances with a minority class accurately. This lack of reliability lowers the confidence in models in critical situations where the correct prediction of stroke cases is significant. Thus, working with imbalanced datasets requires special techniques to ensure sound and fair performance for both classes. In this section, we describe dataset features, present summary tables, and provide visualizations to demonstrate the distributions and properties of the datasets.

3.1. Dataset Description

The datasets DF-1 [40], DF-2 [41], and DF-3 [42] are collections of medical data focusing on stroke prediction and contain various features describing patient demographics, medical history, and lifestyle factors. DF-1 and DF-3 share identical fields, including a unique identifier (id), while DF-2 excludes the id column. These datasets include key features such as age, gender, hypertension, heart_disease, ever_married, work_type, Residence_type, avg_glucose_level, bmi, smoking_status, and stroke.
Table 1 summarizes each field’s description fields. For example, age is a numeric feature of patient age in years. stroke is a binary target variable for stroke. Categorical fields such as gender, ever_married, and work_type give us an understanding of patient demography and lifestyle. Additionally, some numeric fields such as avg_glucose_level provide valuable information.
Having a similar format in all datasets makes it convenient for us to make a comparative study, barring slight differences, like not utilizing the id column in DF-2. The datasets provide a detailed groundwork for medical and demographical factors in stroke prediction.

3.2. Class Imbalance in DF1, DF2, and DF-3 Datasets

The datasets utilized in this study, DF1, DF2, and DF3, exemplify these challenges. As summarized in Table 2, DF1 contains 42,617 non-stroke samples (98.2%) compared to only 783 stroke samples (1.8%). DF2 and DF3 demonstrate similarly imbalanced distributions, with the minority stroke class comprising merely 5.0% and 4.9% of the datasets, respectively. This stark imbalance necessitates a focused approach to ensure that predictive models remain robust and capable of generalizing effectively to both classes, highlighting the critical nature of the problem at hand.
From Table 2, it is clear that the minority class (stroke cases) constitutes less than 3% of the total samples, making the datasets highly imbalanced.
The box plots of all three datasets (DF-1, DF-2, and DF-3) give an idea of critical numeric attributes: age, avg_glucose_level, and bmi. In all datasets, age shows a wider spread for non-stroke cases, while stroke cases have a clustering in older ages. The avg_glucose_level feature exhibits steadily higher values in stroke cases, along with extreme values for exceptional levels of glucose. The bmi feature exhibits higher medians in stroke cases, along with extreme values in all categories, which indicates heterogeneity in the population. The trends in all datasets corroborate the key role of these features in differentiating stroke outcomes and point towards their predictive modeling utility. See Figure 1, Figure 2 and Figure 3.

3.3. Feature Distributions

Figure 4, Figure 5 and Figure 6 present the distributions of critical numerical features (age, avg_glucose_level, and bmi) for datasets DF-1, DF-2, and DF-3, respectively. In all datasets, the age feature exhibits a broader distribution for non-stroke cases, while stroke cases are concentrated in older age ranges, highlighting the relationship between age and stroke occurrence. The avg_glucose_level variable shows constantly high levels in stroke cases, a visible spread, and outliers, which reflects its status as a stroke predictive health indicator. The bmi variable, likewise, shows higher medians for stroke cases in all datasets and a number of outliers in every group, which reflects the heterogeneity in the population. The trends in all datasets point in this direction and support the role of these attributes in stroke risk determination.
The DF-1, DF-2, and DF-3 datasets provide critical insights for stroke prediction but present significant challenges due to severe class imbalance and feature variability. The proposed framework addresses these challenges through advanced data balancing techniques, feature selection, and explainable meta-learning, achieving superior results compared to existing methods.

4. Methodology

The proposed framework aims to address the challenges of imbalanced brain stroke prediction using a hybrid data resampling strategy integrated with a meta-learning model. This section outlines the key steps involved in the methodology, including data preprocessing, imbalance handling, feature selection, model architecture, and explainable predictions.

4.1. Data Preprocessing

The data preprocessing code systematically prepares the dataset for machine learning by addressing missing values, encoding categorical features, standardizing numerical features, and eliminating redundancy through correlation analysis. The detailed steps are as follows:

Handling Missing Values

Missing values in the bmi column, represented as NA (Not Available), are replaced with the mean value:
bmi i = bmi i , if bmi i NA 1 N j = 1 N bmi j , if bmi i = NA
where N is the total number of non-missing values in the bmi column.
For the smoking_status column, missing values are replaced with the placeholder ‘Unknown’ to preserve data integrity.

4.2. One-Hot Encoding

Categorical variables are one-hot encoded, transforming each category C k of a variable C into a binary feature:
C k , i = 1 , if C i = C k 0 , otherwise

4.2.1. Standardizing Numerical Features

Numerical features (age, avg_glucose_level, and bmi) are standardized to a mean of 0 and a standard deviation of 1:
X = X μ σ a
where X is the original feature value, μ is the mean, and σ is the standard deviation of the feature.

4.2.2. Feature Correlation and Redundancy Removal

To identify and eliminate redundant features, a correlation matrix [43] was computed to quantify the linear relationships between features, as shown in Figure 7. The correlation coefficient between two features X i and X j is mathematically defined as
Corr ( X i , X j ) = Cov ( X i , X j ) σ X i · σ X j ,
where Cov ( X i , X j ) represents the covariance between X i and X j , and σ X i and σ X j denote their respective standard deviations. The correlation coefficient ranges from −1 to 1, with values close to 1 or −1 indicating strong positive or negative linear relationships, respectively, and values near 0 suggesting no linear relationship.
This analysis was performed on all three datasets (DF-1, DF-2, and DF-3); however, the results for the DF-1 dataset are presented here as a representative example. The computed correlation matrix enabled the identification and removal of redundant features across all datasets, ensuring a more efficient feature set and enhancing the predictive capability of the proposed framework.
To focus on unique pairwise correlations, the upper triangle of the correlation matrix was extracted:
UpperTriangle ( i , j ) = Corr ( X i , X j ) , if i < j , NA , otherwise .
This avoids redundancy caused by symmetric and diagonal elements of the matrix.
Features with correlations exceeding a threshold ( t = 0.8 ) were removed to reduce multicollinearity:
Features to drop = { X j : | Corr ( X i , X j ) | > t , i < j } .
Figure 7 illustrates the correlation matrix for the DF-1 dataset, highlighting the relationships between features after one-hot encoding. Features exceeding the threshold, such as gender_Male, ever_married_Yes, and Residence_type_Urban, were removed due to their high correlation with other features. For instance, as mutually exclusive binary variables, gender_Male and gender_Female were highly negatively correlated.
The original dataset contained 10 features. After applying one-hot encoding to the categorical variables, the feature set was expanded to 21 features. One-hot encoding introduced binary columns corresponding to each category within the categorical variables, as summarized in Table 3.
This transformation enhanced the dataset’s ability to gather categorical information in a numerical format suitable for machine learning models while preserving the granularity of the original categories.
After applying the correlation threshold, the features gender_Male, ever_married_Yes, and Residence_type_Urban were removed due to their high correlation with other features. This reduced the total number of features from 21 to 18. The correlation matrix, as shown in Figure 7, highlights these relationships, indicating redundant features with high correlation values ( a b s o l u t e v a l u e > 0.8 ) . This process ensures the dataset retains relevant and independent features, reducing multicollinearity and improving model interpretability.

4.3. Imbalance Handling

To address the class imbalance in the dataset, SMOTE [10], SMOTEN [11], and a hybrid method called SMOTE-SMOTEN were applied. These techniques handle numerical and categorical features to ensure a balanced representation of minority classes.
SMOTE generates synthetic samples for numerical features by interpolating between a sample of minority class x and one of its k-nearest neighbors x nearest . The synthetic sample x new is computed as
x new = x + λ · ( x nearest x ) ,
where λ [ 0 , 1 ] is a random interpolation factor. This ensures diverse samples without duplication.
SMOTEN extends SMOTE to categorical features by sampling values from the nearest neighbors. For a categorical feature C, the synthetic value C new is defined as
C new = C nearest ,
where C nearest is the value of the feature from a randomly chosen neighbor. This maintains consistency with the observed categories.
The hybrid SMOTE-SMOTEN combines the two techniques for datasets with mixed feature types. Numerical features are processed using SMOTE:
x new , num = x num + λ · ( x nearest , num x num ) ,
while categorical features are handled using SMOTEN:
C new = C nearest .
The final synthetic sample combines both numerical and categorical components:
X new = [ x new , num , C new ] .
The combined process is capable of balancing datasets of numeric and nominal attributes, reducing class imbalance, and preserving data distributions. The resulting process, SMOTE-SMOTEN, enhances synthetic data diversity and homogeneity, which allows for stronger support for machine learning model performance.
SMOTE and SMOTEENN integration was chosen in an effort to merge the best of both. The synthetic instances in SMOTE are made by interpolation, which improves the minority class and its impact on the dataset. However, when using SMOTE standalone, instances can be noisy, which can be unrepresentative of true data points. This is neutralized using Tomek links in SMOTEENN, which eradicates uncertain or noisy instances, leaving a cleaner dataset. The dataset is well balanced using this process, but data integrity is upheld, which is critical for improved model performance on the minority class

4.4. Feature Selection

To reduce dimensionality and retain the most relevant features, we applied the SelectKBest method [44] with the ANOVA F-test. The top k = 18 features were selected based on their statistical significance with the target variable, computed as
F = Between - class variance Within - class variance .
The same feature selection process was applied to the other two datasets, DF-2 and DF-3, and yielded consistent conclusions. Selecting k = 18 features demonstrated an optimal trade-off between model complexity and predictive performance in each case. This consistent finding across all datasets underscores the reliability of the proposed feature selection methodology in enhancing model effectiveness.
Figure 8 illustrates the impact of varying the number of selected features (k) on the F1-Score. The results reveal that the optimal F1-Score is achieved when k = 18 , balancing high model performance with reduced dimensionality. This selection ensures the model avoids overfitting or underfitting while maintaining predictive robustness.

4.5. Model Architecture

The novel meta-learning framework composed of the presented components is proposed to perform ensemble learning through a DNN considering complexity and an imbalance dataset scenario, as illustrated in Figure 9. At the core of this architecture are two strong base models, Random Forest and LightGBM, that independently produce probability predictions given input features. These models were selected for their distinct advantages: Random Forests effectively model feature interactions while being robust to overfitting, and LightGBM for its speed and firm performance in imbalanced scenarios. These base model outputs are then fed into a deep neural network called the meta-model, specifically crafted to improve and combine these predictions. To proceed with the decision-making process, the meta-model is a meta-multiplicative model that can capture the non-linear relationships and higher-order interactions between the probabilistic outputs of the base models. It improves decision boundaries overall, resolves inferences, and helps decrease overfitting to deliver accurate predictions more confidently by adapting the model dynamically to heterogeneous data distributions. The culmination of all of this hierarchical integration is that the meta-model generates the final predictions, incorporating the best of each model for a high-performing final predictor in one system. Thus, combining ensemble methods with deep learning assures that the framework yields high predictive performance and stays versatile and stable on different datasets.

4.5.1. Base Models

The first stage of this novel framework comprises two robust base models, Random Forest [45] and LightGBM [46], selected due to their distinctive strengths in managing complex datasets and class imbalance. A Random Forest is a type of ensemble learning using multiple pseudo test trees; it is extremely robust against overfitting and performs well in high- or higher-dimensional feature spaces with highly non-linear interaction ability. As a result, it is one of the most popular methods well suited to fix medical datasets (see Figure 9, (1)). It also offers built-in feature importance metrics, which improves model interpretability. It is well known that LightGBM is a very fast and efficient gradient boosting framework that is employed especially for large datasets with multiple classes. This is performed using histogram-based techniques to reduce memory and computation time while maintaining high accuracy and ensuring that it is well suited to learn complex patterns and heterogeneous feature distribution. The developed framework thus combines the stability and interpretability of Random Forest with the accuracy and speed of LightGBM into one hybrid ensemble fitted to increase predictions and robustness to class imbalance. Below are the definitions of the individual techniques:
  • Random Forest (RF): A tree-based ensemble method that combines predictions from multiple Decision Trees. Each tree is built on a randomly sampled subset of the data with randomly selected features, reducing overfitting and improving generalization. The Random Forest prediction is computed as
    y ^ R F = MajorityVote T i ( X ) | i = 1 , 2 , , N ,
    where T i ( X ) represents the prediction of the i-th Decision Tree, and N is the total number of trees.
  • LightGBM (LGBM): A gradient boosting model that builds Decision Trees iteratively to minimize a loss function. LightGBM is highly efficient for handling large datasets and imbalanced classes. The model minimizes the loss function L :
    L ( y , y ^ L G B M ) = i = 1 n y i , y ^ L G B M ( t ) ,
    where is the loss function (e.g., binary cross-entropy), y ^ L G B M ( t ) is the prediction at iteration t, and n is the number of data points.
Random Forest and LightGBM were selected due to their complementary strengths. Random Forest is known for its robustness in handling non-linear interactions and its ability to reduce overfitting through ensemble averaging. LightGBM, on the other hand, is highly efficient, capable of handling large datasets, and excels in processing high-dimensional data quickly. In the meta-learning process, the predictions from these base models are fed into a deep learning meta-classifier, which learns the optimal weighting and interactions between these outputs. This ensures refined and accurate final predictions.

4.5.2. Meta-Learning Model

The meta-learning model integrates predictions from Random Forest (RF) and LightGBM (LGBM) through a deep learning meta-classifier. The base models generate initial predictions, which are fed into the meta-classifier. This classifier refines the predictions by learning from the combined outputs, capturing complex patterns and interactions. The meta-classifier consists of multiple layers, including dense layers with ReLU activation functions, dropout layers to prevent overfitting, and a final softmax layer for classification (see Figure 9, (2)). The second level in the framework is the meta-model, which is a deep neural network trained on the probability output of the base models. The meta-model learns the complex, non-linear relationships and higher-order interactions between the outputs of the Random Forest and LightGBM classifiers. In this regard, the nature of deep learning models makes them the perfect candidate for this role, as they excel in capturing complex patterns and adjusting dynamically to different data distributions in order to fine-tune the ensemble predictions of the base classifiers. Meta-learning is a powerful way of combining base classifiers because it capitalizes on the strengths of complementary models and minimizes the effects of their weaknesses. Each of these algorithms has its strengths, with Random Forest being more robust and interpretable and able to handle non-linear interactions. At the same time, LightGBM is faster and well suited to unbalanced datasets. This meta-model integrates the strengths of each of the models, taking cx or/and adj as input, establishing more accurate decision boundaries, and avoiding overfitting thanks to the generalization extracted from multiple perspectives. Additionally, the meta-learning process addresses inconsistencies among its base classifiers, like overlapping class distributions, which leads to superior classification accuracy when classifying difficult minority class instances. Such synergy contributes to overall predictive efficacy but also helps create a balanced and stable system that is competent in handling difficulties presented by heterogeneous datasets. The architecture of the meta-model is illustrated in Figure 9 (2) as follows:
  • Input Layer: Combines the probability outputs P R F ( y = 1 | X ) and P L G B M ( y = 1 | X ) from the Random Forest and LightGBM models:
    Z = P R F ( y = 1 | X ) , P L G B M ( y = 1 | X ) .
  • Hidden Layers: Two fully connected dense layers with ReLU activation functions capture non-linear interactions in the input space. Dropout layers are applied for regularization to reduce overfitting:
    h ( l ) = ReLU W ( l ) h ( l 1 ) + b ( l ) , for l = 1 , 2
    where h ( l ) is the activation at layer l; W ( l ) and b ( l ) are the weights and biases.
  • Output Layer: The final layer is a single neuron with a sigmoid activation function, outputting the final probability prediction:
    y ^ m e t a = σ W ( o u t ) h ( 2 ) + b ( o u t ) , where σ ( x ) = 1 1 + e x .
The deep learning meta-classifier is integral to our framework as it refines the predictions from the base models. It employs two hidden layers with ReLU activation functions, which capture complex non-linear relationships between the base model outputs. Dropout layers with rates of 0.2 and 0.3 are used to prevent overfitting, ensuring that the model generalizes well even when working with imbalanced data. This architecture enables the meta-classifier to deliver more accurate and reliable predictions.

4.5.3. Meta-Learning Concept and Final Prediction

Meta-learning, or ”learning to learn” is an advanced machine learning paradigm where a model is trained to integrate and refine predictions from other models (see Figure 9, (3)). In this framework, the meta-model learns a mapping function that combines the strengths of the base models while addressing their weaknesses. The meta-learning process can be expressed as
y ^ = f m e t a P R F ( y = 1 | X ) , P L G B M ( y = 1 | X ) ,
where f m e t a represents the function learned by the meta-model. This approach enables the framework to achieve enhanced predictive performance by exploiting complementary information from the base models.
To enhance clarity, we provide a more structured explanation of how the meta-learning model integrates the predictions from Random Forest (RF) and LightGBM (LGBM). The meta-model functions as a higher-level learner, utilizing the probability outputs from both base models as input features. By learning the optimal way to combine these predictions, the meta-classifier refines the final decision-making process. Specifically, the deep learning-based meta-classifier applies a fully connected neural network with two hidden layers, where each layer uses ReLU activation and dropout regularization to prevent overfitting. The first hidden layer captures high-level interactions between RF and LGBM predictions, while the second refines the final decision boundary.
Figure 9 illustrates the data flow within our stroke prediction framework, providing a detailed visualization of how input features are processed through multiple stages. The framework consists of three key stages: (1) base model prediction, (2) meta-learning model integration, and (3) final stroke prediction. Initially, the preprocessed input features are fed into two independent classifiers, the Random Forest (RF) and LightGBM (LGBM) models, each generating probability outputs based on their learned decision boundaries. These probability scores are then aggregated and passed to a deep neural network-based meta-classifier. The meta-learning model comprises two hidden layers—Hidden Layer 1 (64 neurons, ReLU activation) and Hidden Layer 2 (32 neurons, ReLU activation)—with dropout layers (0.2 and 0.3, respectively) to prevent overfitting and enhance generalization. The final prediction is obtained through a single-node sigmoid output layer, which processes the combined probabilities and provides refined stroke prediction. This hierarchical learning structure ensures that the model leverages both the interpretability of traditional classifiers and the adaptability of deep learning, optimizing predictive performance.

5. Experimental Results

In this section, we perform an extensive analysis of the proposed framework, implemented on three public datasets (i.e., DF-1, DF-2, and DF-3). All the experiments were conducted on Kaggle servers using respective computational resources for smooth execution and reproducibility. The outcome is presented in three key sections: a summary of performance on datasets, explainable predictions (via SHAP), and a comparison with state-of-the-art methods. The first part of the subsection investigates the prediction performance of the framework with measures including, but not limited to, accuracy, F1-Score, and ROC-AUC. The second subsection demonstrates the ability to interpret the model’s predictions of a SHAP analysis to understand feature contributions and interactions. Lastly, the third subsection compares the results obtained from the proposed framework with the existing state of the art, providing evidence behind the high performance in handling imbalanced datasets that the framework can achieve whilst maintaining high accuracy and interpretability.

5.1. Performance Measure over Datasets

In this section, we evaluate three imbalance handling methods (SMOTE, SMOTEENN, and SMOTE_SMOTEENN) using datasets DF-1, DF-2, and DF-3 in terms of accuracy, precision, recall, F1-Score, ROC-AUC, and Cohen’s Kappa. Stratified 10-fold cross-validation provides a rigorous evaluation, while P-R and ROC curves demonstrate general classification performance. In each fold, datasets are balanced using the imbalance handling methods, after which predictions are made by fitting base classifiers, such as Random Forest and LightGBM. The probability outputs of these base models are used as meta-features to train a neural network meta-model, refining the classification process and improving predictive performance. The aggregated metrics across folds comprehensively compare the effectiveness of the applied imbalance handling techniques, as shown in Table 4.

5.1.1. DF-1 Dataset Results

Table 5 shows the performance metrics for the DF-1 dataset. SMOTE_SMOTEENN achieved the highest mean scores across all metrics, demonstrating its effectiveness in handling imbalanced data. The mean accuracy, precision, recall, and F1-Score were 0.992, 0.994, 0.992, and 0.993, respectively. The ROC AUC of 0.9997 further confirms the model’s ability to distinguish between classes effectively.
Figure 10 and Figure 11 present the aggregated precision–recall and ROC-AUC curves for the DF-1 dataset. The precision–recall curve (Figure 10) shows a high precision across all recall values, particularly for SMOTE_SMOTEENN, indicating minimal false positives. Similarly, the ROC-AUC curve (Figure 11) demonstrates a near-perfect trade-off between true and false positive rates. To provide a clearer view of the model’s performance, we have zoomed in on the area where the curve bends. This focal area highlights the true positive rate (sensitivity) and false positive rate (1-specificity) at different thresholds, demonstrating the model’s ability to discriminate between positive and negative cases effectively. The area under the curve (AUC) is a measure of the model’s overall performance, with higher values indicating better discrimination.

5.1.2. DF-2 Dataset Results

Table 6 provides the results for the DF-2 dataset. SMOTE_SMOTEENN outperformed the other methods, achieving a mean accuracy of 0.980 and F1-Score of 0.982. The high ROC AUC value of 0.9987 indicates excellent discrimination between classes. However, some metrics’ slight standard deviation suggests variability across the folds.
Figure 12 and Figure 13 depict the P-R and ROC curves for DF-2. The P-R curve (Figure 12) reveals superior precision–recall trade-offs for SMOTE_SMOTEENN. The ROC curve (Figure 13) exhibits near-perfect performance for this method, with a clear separation from SMOTE and SMOTEENN.

5.1.3. DF-3 Dataset Results

The performance metrics for DF-3 are summarized in Table 7. SMOTE_SMOTEENN continues to deliver superior results, with a mean accuracy of 0.982, precision of 0.977, and F1-Score of 0.983. These results reinforce its robustness across datasets.
Figure 14 and Figure 15 illustrate the P-R and ROC curves for DF-3. The P-R curve (Figure 14) shows high precision, even for high recall values. The ROC curve (Figure 15) confirms the excellent trade-off achieved by SMOTE_SMOTEENN.

5.2. Explainable Predictions Using SHAP

To enhance the interpretability and transparency of the meta-learning model, SHAP (Shapley Additive Explanations) was applied to analyze the contributions of ‘RF Probability’ and ‘LGBM Probability’ to the model’s predictions. The model’s AP framework offers both global and local explanations, enabling a detailed understanding of feature contributions. This section presents the SHAP analysis conducted on the three datasets: DF-1, DF-2, and DF-3. Integrating SHAP with ensemble predictions posed challenges due to the distributed nature of the outputs from multiple models. To address this, we applied SHAP independently to each base model and then aggregated the feature contributions at the meta-level. This approach ensured that each model’s influence was preserved while providing a comprehensive and interpretable explanation of the final predictions.

5.2.1. Global Feature Importance

The SHAP summary plots across the three datasets (Figure 16A–C) consistently highlight the importance of ‘RF Probability’ and ‘LGBM Probability’ in driving the meta-learning model’s predictions. In the model’s datasets, ‘RF Probability’ is observed as the dominant feature, contributing more significantly to positive predictions than ‘LGBM Probability’. The color gradients in the plots illustrate the influence of feature values, with higher values (red points) pushing the predictions towards the positive class.
For the DF-1 dataset (Figure 16A), the distribution of SHAP values indicate that ‘RF Probability’ accounts for the majority of predictive power, while ‘LGBM Probability’ supports the model by adding complementary insights. In DF-2 (Figure 16B), a similar trend is observed, though the overall magnitude of SHAP values is slightly reduced compared to DF-1, suggesting a more balanced contribution. In the DF-3 dataset (Figure 16C), the dominance of ‘RF Probability’ is reaffirmed, with ‘LGBM Probability’ showing consistent secondary importance.
Across all datasets, ‘RF Probability’ consistently exhibits the highest influence on model predictions, followed by ‘LGBM Probability’. These results confirm the features’ complementary nature of the features and demonstrate the meta-learning framework’s stability and generalizability in leveraging their combined contributions.

5.2.2. Feature Dependency and Interaction

The SHAP dependence plots (Figure 17A–C) reveal the relationship between ‘RF Probability’ values and their SHAP values across all datasets. A strong positive correlation is consistently observed, indicating that higher ‘RF Probability’ values drive the model’s predictions towards the positive class. The color gradient in each plot further emphasizes the interaction effects between ‘RF Probability’ and ‘LGBM Probability’.
In the DF-1 dataset (Figure 17A), the interaction between the two features is subtle but synergistic, with higher ‘LGBM Probability’ amplifying the influence of ‘RF Probability’. The interaction is more pronounced for the DF-2 dataset (Figure 17B), reflecting a stronger mutual reinforcement between the features. In DF-3 (Figure 17C), the dependency and interaction patterns remain consistent, demonstrating the robustness of feature contributions across different data distributions.
As we can see in the dependency plots for each of the datasets, ‘RF Probability’ has a rather consistent effect and its a very strong interaction with ‘LGBM Probability’. Such interactions are important to validate interactions between features, which is necessary for the robustness of the meta-learning approach.

5.2.3. Localized Explanations for Individual Predictions

SHAP force plots (Figure 18A–C) have been utilized to outline localized explanations individually for each dataset. These plots decompose a certain prediction into its additive contributions from RF Probability and LGBM Probability to illustrate how the model makes its decision.
For DF-1 (Figure 18A), the force plot shows the expected contributions of both features, with RF Probability having a slightly higher influence. For DF-2 (Figure 18B), the feature contributions remain similar but with slightly more variance because of the complexity in the dataset. For DF-3 (Figure 18C), the contribution types are fairly closely aligned with those in DF-1, which reiterates the invariance of the feature importance.
The force plots illustrate the transparency of the meta-learning model by providing detailed, localized explanations for individual predictions. This level of interpretability enhances trust in the model’s predictions in the diverse datasets.

5.2.4. Cumulative Feature Contributions

The SHAP decision plots (Figure 19A–C) illustrate the cumulative contributions of ‘RF Probability’ and ‘LGBM Probability’ to the model’s predictions. The model’s plots capture the additive impact of each feature as they collectively drive the predictions towards the correct class.
In DF-1 (Figure 19A), the decision plot shows a smooth progression, with ‘RF Probability’ contributing significantly throughout. In DF-2 (Figure 19B), the cumulative contributions are slightly more distributed between the features, reflecting the dataset’s complexity. In DF-3 (Figure 19C), the cumulative patterns mirror those in DF-1, highlighting the model’s consistency.
The decision plots demonstrate the stability and reliability of the meta-learning model’s cumulative feature contributions across all datasets. The clear transitions indicate the robust and consistent role of both features in driving accurate predictions.

5.3. Comparison with State-of-the-Art Methods

In this subsection, we present a comparative evaluation of the proposed meta-learning model with the state-of-the-art approaches on three datasets, namely DF-1, DF-2 and DF-3. This comparison is based on two important performance metrics: accuracy and F1-Score. The results show the efficacy of the method on studying imbalanced datasets and producing robust predictions.

5.3.1. DF-1 Dataset

The comparison results presented in Table 8 demonstrate the superior performance of the proposed meta-learning framework in the DF-1 dataset compared to existing state-of-the-art methods. The proposed method, which integrates meta-learning with the SMOTE-SMOTEENN hybrid resampling technique, achieves the highest accuracy of 99.21% and F1-Score of 99.26%. These metrics represent a significant improvement over previous methods.
The XGB model [47] achieved an accuracy of 87.5% and an F1-Score of 89.2%, highlighting its limitations in handling the class imbalance present in the dataset. Similarly, the CatBoost model [48] demonstrated improved performance with an accuracy of 98.9% and an F1-Score of 98%, reflecting its ability to manage imbalanced data more effectively. However, the proposed method surpasses both, setting a new benchmark for predictive accuracy and class balance.
Table 8 shows the comparison results that prove how our meta-learning framework achieves state-of-the-art results on the DF-1 dataset. This approach, where the SMOTE-SMOTEENN hybrid resampling strategy is merged with meta-learning, produced the highest accuracy of 99.21% and F1-Score of 99.26%. These metrics are a stark improvement compared with past methodologies.
The XGB model [47] recorded an accuracy of 87.5% and an F1-Score of 89.2%, demonstrating its inability to deal with the class imbalance that exists in the dataset. Likewise, the CatBoost model [48] showed a better performance with an accuracy of 98.9% and an F1-Score of 98%, as it is much better capable of handling imbalanced data. Yet, the proposed method outperforms both methods, establishing a new balance of predictive accuracy and class balance.
The presented experimental results illuminate the potential of the new meta-learning framework to integrate hybrid resampling techniques and ensemble learning in solving problems with class imbalance whilst boosting predictive performance through hybrid resampling approaches. By combining SMOTE with SMOTEENN, not only does the created data become distributed in a more efficient manner, but it allows the meta-learning model to derive optimal decision boundaries to maximize accuracy and reliability gains. Because of this, the proposed method is a strong and better choice for prediction problems in imbalanced datasets.

5.3.2. DF-2 Dataset

Table 9 illustrates the performance of the proposed meta-learning framework on the DF-2 dataset along with the results from state-of-the-art methods for comparison. The proposed method achieved the best accuracy of 98.02% and F1-Score of 98.25%, outperforming existing methods and becoming a new baseline of predictive performance on this dataset by integrating meta-learning with SMOTE-SMOTEENN resampling.
Older papers like [49] that used a hybrid model of LR, DT, RF, SVM, and NB yielded an accuracy of 95.5% and an F1-Score of 94.5%. Likewise, with the same multiple features [50,51], which use DT, SVM, LR, and deep neural networks, also suggested comparable results, confirming that these techniques are not fully able to address the issues of class imbalance.
Table 9. Comparison of DF-2 dataset results with related work.
Table 9. Comparison of DF-2 dataset results with related work.
Refs.Model UsedAccuracy (%)F1-Score (%)
[49]LR, DT, RF, SVM, and NB95.594.5
[52]Category Boosting Classifier (CBC)9796
[50]DT, SVM, and LR95.4996
[51]Deep Neural Networks95.4996
[53]RF97.1997.15
[48]Stacking Algorithms97.9898.0
[54]Boosting Algorithms97.9793.0
Proposed MethodMeta + (SMOTE-SMOTEENN)98.0298.25
Nevertheless, the proposed approach overcomes all the aforementioned extenuation to surpass the others due to the hybrid SMOTE-SMOTEENN resampling to balance the dataset along with the feature contribution optimization. This facilitates the meta-learning model to exploit the varying predictions made by different base classifiers and optimize the decision boundaries, thus minimizing errors in classification. The proposed method successfully achieved a higher F1-Score, which is indicative of improved capability in balancing precision and recall, which is an essential task to perform when interested in minority class detection. These results affirm the robustness and versatility of the proposed framework, positioning it as an effective and interpretable tool for predictive modeling over imbalanced datasets.

5.3.3. DF-3 Dataset

The comparison in Table 10 highlights the exceptional performance of the proposed meta-learning framework on the DF-3 dataset. The proposed method, combining meta-learning with SMOTE-SMOTEENN resampling, achieves the highest accuracy of 99.34%, significantly outperforming previous state-of-the-art techniques.
Earlier studies, such as [55,56], utilizing Random Forest (RF) and BSPE models, achieved an accuracy of 95.3%. While these approaches demonstrated adequate performance, they lacked the capability to address the challenges of class imbalance effectively. LightGBM, as implemented in [57], reported an accuracy of 94.53%, reflecting its limitations in handling imbalanced datasets. Advanced methods like Voting in [58] and Random Forest variations in [59,60] improved accuracy to 97.0% and 99.07%, respectively, and yet, did not match the proposed framework’s performance. Decision Tree-based models, as employed in [61], recorded a notably lower accuracy of 93%.
Table 10. Comparison of DF-3 dataset results with related work.
Table 10. Comparison of DF-3 dataset results with related work.
Refs.Model UsedAccuracy (%)
[55]RF95.3
[56]BSPE95.3
[57]LGBM94.53
[59]RF98.94
[62]RF97.2
[60]RF99.07
[58]Voting97.0
[61]DT93
Proposed MethodMeta + (SMOTE-SMOTEENN)99.34
The proposed method stands out by effectively mitigating the effects of class imbalance through the integration of hybrid resampling techniques. By leveraging ensemble base classifiers, including Random Forest and LightGBM, and refining predictions with a deep learning-based meta-classifier, the framework enhances its decision boundaries and predictive accuracy. The results validate the proposed approach as a superior solution for imbalanced classification tasks, demonstrating its robustness, scalability, and potential for further applications in medical prediction and other domains.

5.3.4. Summary of Comparative Analysis

The comparative analysis conducted across the three datasets—DF-1, DF-2, and DF-3—demonstrates the clear superiority of the proposed meta-learning framework integrated with SMOTE-SMOTEENN over existing state-of-the-art methods. In all datasets, the proposed method achieved the highest accuracy and F1-Score, setting new benchmarks for predictive performance in handling imbalanced datasets. Specifically, the framework achieved an accuracy of 99.21% and an F1-Score of 99.26% in DF-1, 98.02% and 98.25% in DF-2, and 99.34% in accuracy for DF-3. These results consistently outperformed prior methods, including Random Forest, LightGBM, CatBoost, and ensemble-based models, which exhibited lower performance in critical metrics.
The main benefit of the proposed framework is that it efficiently targets the class imbalance. It employs hybrid resampling methods to balance data distribution and consolidate predictions from heterogeneous base classifiers. In addition, SHAP explainability allows for easier model interpretability, which helps in understanding feature contributions and increases transparency.
The proposed framework’s state-of-the-art performance on all datasets confirms its robustness, scalability, and adaptability. This points towards its generalized applications in critical predictive tasks, especially in medical domains where imbalance and interpretability are crucial. Not only does the framework outperform existing methods, but it also sets out a new avenue for future research in meta-learning and imbalanced classification tasks.
The significant improvements in accuracy and F1-Score can be attributed to several key factors. The hybrid resampling technique (SMOTE combined with SMOTEENN) played a crucial role in balancing the dataset while maintaining its quality. The selection of Random Forest and LightGBM provided robust and diverse predictions, and the adaptive deep learning meta-classifier effectively combined these predictions. These components created a synergistic effect that enhanced the model’s overall performance, particularly for the minority class. Moreover, these improvements in performance metrics across all datasets underscore the robustness and effectiveness of our proposed framework. By consistently achieving higher accuracy, precision, recall, and F1-Scores compared to baseline methods, our model demonstrates a clear advantage in handling imbalanced datasets. These results validate the framework’s capability in delivering reliable and accurate predictions, making it a valuable tool for clinical decision-making and other applications where class imbalance poses a significant challenge.

5.4. Statistical Validation

To ensure the statistical significance of our proposed framework’s performance improvements over baseline models, we conducted paired t-tests and Wilcoxon signed-rank tests across three datasets (DF-1, DF-2, and DF-3). These statistical tests assess whether the observed improvements in accuracy, precision, recall, and F1-Score are statistically significant.
The paired t-test determines whether there is a significant difference between the means of two related samples, assuming normal distribution. The test statistic is computed as
t = d ¯ s d / n
where d ¯ represents the mean difference between the proposed and baseline model performances, s d is the standard deviation of these differences, and n denotes the number of datasets used in the analysis.
The corresponding p-value for the paired t-test is obtained from the cumulative probability distribution of the Student’s t-distribution with n 1 degrees of freedom:
p = P ( T > | t | )
where p represents the probability of obtaining a t-value as extreme as the observed one under the null hypothesis, which assumes no difference in performance.
The Wilcoxon signed-rank test is a non-parametric test that evaluates whether the median difference between paired samples is significantly different from zero. The test statistic is computed as
W = R +
where R + denotes the sum of ranks for positive differences.
The p-value for the Wilcoxon signed-rank test is derived from the Wilcoxon distribution:
p = P ( W W observed )
where W observed is the computed Wilcoxon signed-rank test statistic, and P ( W W observed ) represents the probability of obtaining a W-value as extreme as the observed one under the null hypothesis.
The results from the statistical tests are summarized in Table 11 and Table 12. The t-values and W-values indicate the magnitude of the observed improvements, while the corresponding p-values assess their statistical significance. A p-value lower than 0.05 suggests that the observed improvements are unlikely to be due to random chance, providing strong evidence for the effectiveness of the proposed framework in handling imbalanced classification tasks.
The statistical test results provide strong empirical evidence supporting the superiority of the proposed methodology across all key performance metrics. The paired t-test results demonstrate consistently high t-values, reflecting substantial differences between the proposed and baseline models. Additionally, the p-values for all metrics remain well below the 0.05 threshold, affirming the statistical significance of these differences.
Similarly, the Wilcoxon signed-rank test results further corroborate the robustness of the proposed framework, with consistently high W-values across accuracy, precision, recall, and F1-Score. The corresponding p-values reinforce the reliability of the improvements, indicating that the performance gains are not dataset-specific but generalize effectively across different evaluation scenarios.
By presenting an aggregated summary of statistical validation and a per-dataset breakdown, we ensure transparency and provide a detailed view of the model’s effectiveness across different datasets. The consistently significant p-values across all datasets confirm that the improvements achieved by our proposed framework are not only substantial but also statistically reliable, further validating the efficacy of our approach in stroke prediction applications.

6. Discussion

In this section, we present a detailed discussion of the results achieved with our proposed meta-learning model using SMOTE, SMOTEENN, and the combined SMOTE-SMOTEENN methods. Knowledge of imbalance handling techniques impact the model’s performance with resampling methods, and improving scores (high F1-Score and ROC-AUC are discussed) impacts the generally applicable nature of the proposed framework in clinical and real-world data.

6.1. Impact of Resampling Techniques on Model Sensitivity

The depth of influence that resampling techniques had on the sensitivity of the meta-learning model was evident. The combined SMOTE-SMOTEENN approach significantly improved sensitivity (defined as the ability to identify true positives correctly). This approach managed class distribution balance alongside the preservation of key decision boundaries and showed significantly higher recall values across all datasets. For instance, the individual implementation of SMOTE and SMOTEENN depicted challenges on different datasets, such as simple imbalanced distributions or complex imbalanced distributions with noise/overlapping class arrangement, while combining oversampling and hybrid techniques improves the sensitivity of the model.
Achieving high performance in minority classes without overfitting was accomplished through several measures. The use of SelectKBest for feature selection ensured that only the most relevant features were used, reducing the risk of overfitting. Dropout layers in the meta-classifier prevented the model from relying too heavily on any single feature or pattern. Regularization techniques further ensured that the model learned essential patterns without being influenced by noise or redundant information.

6.2. Performance Variations with Resampling Strategies

The meta-learning model’s performance is strongly influenced by different resampling strategies. In fact, the combined SMOTE-SMOTEENN approach provided the best accuracy, F1-Score, and ROC-AUC for all datasets, as shown in Table 5, Table 6 and Table 7. This is due to the fact that the method reduces both false positives and false negatives by combining oversampling and noise reduction. However, we did notice some variability in SMOTEENN, which relies on cleaning via nearest neighbors of various instances, to remove the most noticeable misclassifications. Through a statistical lens, this process can become overly aggressive about removing instances when they are close together. SMOTE, on the other hand, performed moderately but failed due to overlapping classes in highly imbalanced datasets. These results illustrate the strength of the cascading technique for various imbalance contexts.

6.3. Significance of High Predictive Metrics in Clinical Applications

It is crucial to have high F1-Score and ROC-AUC values in areas like clinical and diagnostic applications, where false negatives and false positives could have significant costs. One positive aspect of the proposed framework is its capability to provide high F1-Scores across over 10 outcomes—indicating its proficiency in achieving the right balance of precision and recall, ensuring proper classification of both positive and negative cases. The ROC-AUC is very close to 1, reflecting a high level of discrimination. However, while these metrics highlight the model’s predictive strength, they do not alone confirm its readiness for real-world clinical decision-making. Clinical deployment necessitates further validation through external dataset testing, prospective real-world assessments, and integration with clinical workflows. Additionally, considerations such as interpretability, robustness, and regulatory compliance must be addressed before the model can be reliably used in practice. Therefore, while our framework represents a promising advancement in stroke prediction, it should be regarded as an advanced research tool requiring further validation rather than an immediately deployable clinical solution.

6.4. Enhancing Model Interpretability Through SHAP Analysis

In DF-1, DF-2, and DF-3 meta-learning, we performed the SHAP analysis, which is essential for the interoperability of the meta-learning model. SHAP improved our understanding of the decision-making process through quantifying the contributions of both RF Probability and LGBM Probability to the model predictions. In studies across the globe, consistent patterns were found for RF Probability, which was the most significant feature, followed by LGBM Probability. This observation reaffirms the power of diversity through the combination of different base classifier in the meta-learning setting.
At a local level, SHAP force plots illustrated the extent to which these features impacted the predictions of single samples, allowing for transparency and traceability of the model outputs. Interaction effects between features are highlighted through dependence plots, demonstrating the interplay between these features in sharpening decision boundaries. In conclusion, the SHAP analysis was used to interpret the model predictions and revealed a strong signal between features and targets, which was observed either individually or together in the ESL literature. This demonstrates the robustness and generalizability of the framework while also providing actionable insights to other researchers to allow them to leverage the model for further optimization and trust in a real-world application setting.

6.5. Broader Applicability of the Proposed Framework

This consistent robustness in three different datasets demonstrates the generalizability of the proposed meta-learning framework. This demonstrates the framework’s adaptability to different levels of imbalance on varying complexity datasets; hence, this framework could also be applied to a wider range of datasets beyond healthcare settings. In this regard, resampling techniques and meta-learning-based algorithms may be useful in real-world application domains, such as fraud detection, industrial control, and environmental monitoring, where imbalanced datasets exist. In addition, the low standard deviations of performance metrics from the 10-fold cross-validation support that the framework is not only working as intended but also has stability and reproducibility, thus providing more evidence to its practical utility.
Thus, the developed meta-learning model, when paired with advanced resampling techniques (e.g., SMOTE-SMOTEENN), would ensure a powerful approach to address and overcome the difficulties of imbalanced datasets. These make it a powerful and indispensable tool in both molecular diagnostic applications in clinical settings, as well as fundamental, non-biomedical applications.

7. Conclusions and Future Work

This study introduced a novel meta-learning framework that integrates ensemble learning with the SMOTE-SMOTEENN resampling strategy to address imbalanced classification challenges. The framework was evaluated on three diverse datasets (DF-1, DF-2, and DF-3), consistently outperforming state-of-the-art methods in terms of accuracy, F1-Score, and other performance metrics. By leveraging the strengths of the XGBoost classifier and SHAP explainability techniques, our approach enhances both predictive accuracy and interpretability. The results confirm the stability and effectiveness of the proposed methodology in handling complex, imbalanced datasets, reinforcing its potential for real-world applications.
While our model demonstrates strong discrimination capabilities (high ROC-AUC), its clinical applicability requires further validation through extensive real-world testing and clinical trials across diverse populations. Future research will focus on integrating the model into clinical workflows in collaboration with healthcare professionals, ensuring its seamless adoption in medical environments and assessing its impact on patient outcomes. Additionally, extending the framework to multi-class classification tasks will broaden its applicability across various domains, while expanding the range of base classifiers and incorporating adaptive feature selection techniques will further enhance model performance. To improve transparency and fairness, advanced SHAP-based analysis will be leveraged to identify potential biases, reinforcing interpretability in decision-making. These enhancements will contribute to establishing a more adaptable, robust, and clinically reliable machine learning framework capable of addressing diverse predictive challenges.

Funding

This research received no external funding.

Institutional Review Board Statement

This study is based on secondary data analysis and does not involve direct interaction with human participants or any experimental interventions.

Informed Consent Statement

As this study is based on secondary data analysis and does not involve direct interaction with human participants, obtaining informed consent was not required.

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Feigin, V.L.; Stark, B.A.; Johnson, C.O.; Roth, G.A.; Bisignano, C.; Abady, G.G.; Abbasifard, M.; Abbasi-Kangevari, M.; Abd-Allah, F.; Abedi, V.; et al. Global, regional, and national burden of stroke and its risk factors, 1990–2019: A systematic analysis for the Global Burden of Disease Study 2019. Lancet Neurol. 2021, 20, 795. [Google Scholar] [CrossRef] [PubMed]
  2. Saini, V.; Guada, L.; Yavagal, D.R. Global epidemiology of stroke and access to acute ischemic stroke interventions. Neurology 2021, 97, S6–S16. [Google Scholar] [CrossRef]
  3. Saceleanu, V.M.; Toader, C.; Ples, H.; Covache-Busuioc, R.A.; Costin, H.P.; Bratu, B.G.; Dumitrascu, D.I.; Bordeianu, A.; Corlatescu, A.D.; Ciurea, A.V. Integrative approaches in acute ischemic stroke: From symptom recognition to future innovations. Biomedicines 2023, 11, 2617. [Google Scholar] [CrossRef] [PubMed]
  4. Alanazi, A. Using machine learning for healthcare challenges and opportunities. Inform. Med. Unlocked 2022, 30, 100924. [Google Scholar] [CrossRef]
  5. Shah, Y.A.R.; Qureshi, S.M.; Qureshi, H.; Shah, S.; Shiwlani, A.; Ahmad, A. Artificial Intelligence in Stroke Care: Enhancing Diagnostic Accuracy, Personalizing Treatment, and Addressing Implementation Challenges. Int. J. Appl. Res. Sustain. Sci. 2024, 2, 855–886. [Google Scholar]
  6. Olawade, D.B.; Aderinto, N.; David-Olawade, A.C.; Egbon, E.; Adereni, T.; Popoola, M.R.; Tiwari, R. Integrating AI-driven wearable devices and biometric data into stroke risk assessment: A review of opportunities and challenges. Clin. Neurol. Neurosurg. 2024, 249, 108689. [Google Scholar] [CrossRef]
  7. Avan, A.; Hachinski, V. Stroke and dementia, leading causes of neurological disability and death, potential for prevention. Alzheimers Dement. 2021, 17, 1072–1076. [Google Scholar] [CrossRef]
  8. Al Duhayyim, M.; Abbas, S.; Al Hejaili, A.; Kryvinska, N.; Almadhor, A.; Mohammad, U.G. An Ensemble Machine Learning Technique for Stroke Prognosis. Comput. Syst. Sci. Eng. 2023, 47, 413. [Google Scholar] [CrossRef]
  9. Correa, R.; Shaan, M.; Trivedi, H.; Patel, B.; Celi, L.A.G.; Gichoya, J.W.; Banerjee, I. A Systematic review of ‘Fair’AI model development for image classification and prediction. J. Med. Biol. Eng. 2022, 42, 816–827. [Google Scholar] [CrossRef]
  10. Adi Pratama, F.R.; Oktora, S.I. Synthetic Minority Over-sampling Technique (SMOTE) for handling imbalanced data in poverty classification. Stat. J. IAOS 2023, 39, 233–239. [Google Scholar] [CrossRef]
  11. Muntasir Nishat, M.; Faisal, F.; Jahan Ratul, I.; Al-Monsur, A.; Ar-Rafi, A.M.; Nasrullah, S.M.; Reza, M.T.; Khan, M.R.H. A Comprehensive Investigation of the Performances of Different Machine Learning Classifiers with SMOTE-ENN Oversampling Technique and Hyperparameter Optimization for Imbalanced Heart Failure Dataset. Sci. Program. 2022, 2022, 3649406. [Google Scholar] [CrossRef]
  12. Yang, Y.; Lv, H.; Chen, N. A survey on ensemble learning under the era of deep learning. Artif. Intell. Rev. 2023, 56, 5545–5589. [Google Scholar] [CrossRef]
  13. Rufo, D.D.; Debelee, T.G.; Ibenthal, A.; Negera, W.G. Diagnosis of diabetes mellitus using gradient boosting machine (LightGBM). Diagnostics 2021, 11, 1714. [Google Scholar] [CrossRef] [PubMed]
  14. Monteiro, J.P.; Ramos, D.; Carneiro, D.; Duarte, F.; Fernandes, J.M.; Novais, P. Meta-learning and the new challenges of machine learning. Int. J. Intell. Syst. 2021, 36, 6240–6272. [Google Scholar] [CrossRef]
  15. Chaudhari, S.; Mithal, V.; Polatkan, G.; Ramanath, R. An attentive survey of attention models. ACM Trans. Intell. Syst. Technol. (TIST) 2021, 12, 1–32. [Google Scholar] [CrossRef]
  16. Nohara, Y.; Matsumoto, K.; Soejima, H.; Nakashima, N. Explanation of machine learning models using shapley additive explanation and application for real data in hospital. Comput. Methods Programs Biomed. 2022, 214, 106584. [Google Scholar] [CrossRef]
  17. Hassija, V.; Chamola, V.; Mahapatra, A.; Singal, A.; Goel, D.; Huang, K.; Scardapane, S.; Spinelli, I.; Mahmud, M.; Hussain, A. Interpreting black-box models: A review on explainable artificial intelligence. Cogn. Comput. 2024, 16, 45–74. [Google Scholar] [CrossRef]
  18. He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  19. Ning, Q.; Zhao, X.; Ma, Z. A novel method for Identification of Glutarylation sites combining Borderline-SMOTE with Tomek links technique in imbalanced data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 19, 2632–2641. [Google Scholar] [CrossRef]
  20. Tripathy, G.; Sharaff, A. AEGA: Enhanced feature selection based on ANOVA and extended genetic algorithm for online customer review analysis. J. Supercomput. 2023, 79, 13180–13209. [Google Scholar] [CrossRef]
  21. Shou, Y.; Liu, H.; Cao, X.; Meng, D.; Dong, B. A low-rank matching attention based cross-modal feature fusion method for conversational emotion recognition. IEEE Trans. Affect. Comput. 2024, 1–13. [Google Scholar] [CrossRef]
  22. Haleem, A.; Javaid, M.; Singh, R.P.; Suman, R. Medical 4.0 technologies for healthcare: Features, capabilities, and applications. Internet Things Cyber-Phys. Syst. 2022, 2, 12–30. [Google Scholar] [CrossRef]
  23. Waswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the NIPS, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  24. Tarsha Kurdi, F.; Amakhchan, W.; Gharineiat, Z. Random forest machine learning technique for automatic vegetation detection and modelling in LiDAR data. Int. J. Environ. Sci. Nat. Resour. 2021, 28, 556234. [Google Scholar]
  25. Konstantinov, A.V.; Utkin, L.V. Interpretable machine learning with an ensemble of gradient boosting machines. Knowl.-Based Syst. 2021, 222, 106993. [Google Scholar] [CrossRef]
  26. Dietterich, T.G. Ensemble methods in machine learning. In Proceedings of the International Workshop on Multiple Classifier Systems; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar]
  27. Rasmy, L.; Xiang, Y.; Xie, Z.; Tao, C.; Zhi, D. Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit. Med. 2021, 4, 86. [Google Scholar] [CrossRef]
  28. Khan, K. A Framework for Meta-Learning in Dynamic Adaptive Streaming over HTTP. Int. J. Comput. 2023, 12, 3–11. [Google Scholar]
  29. Kalusivalingam, A.K.; Sharma, A.; Patel, N.; Singh, V. Leveraging SHAP and LIME for Enhanced Explainability in AI-Driven Diagnostic Systems. Int. J. AI ML 2021, 2, 3. [Google Scholar]
  30. Salman, A.H.; Al-Jawher, W.A.M. Performance Comparison of Support Vector Machines, AdaBoost, and Random Forest for Sentiment Text Analysis and Classification. J. Port Sci. Res. 2024, 7, 300–311. [Google Scholar] [CrossRef]
  31. Zhang, Y.; Lu, S.; Zhou, X.; Yang, M.; Wu, L.; Liu, B.; Phillips, P.; Wang, S. Comparison of machine learning methods for stationary wavelet entropy-based multiple sclerosis detection: Decision tree, k-nearest neighbors, and support vector machine. Simulation 2016, 92, 861–871. [Google Scholar] [CrossRef]
  32. Mondal, S.; Ghosh, S.; Nag, A. Brain stroke prediction model based on boosting and stacking ensemble approach. Int. J. Inf. Technol. 2024, 16, 437–446. [Google Scholar] [CrossRef]
  33. Mienye, I.D.; Jere, N. Optimized ensemble learning approach with explainable AI for improved heart disease prediction. Information 2024, 15, 394. [Google Scholar] [CrossRef]
  34. Khademi, Z.; Ebrahimi, F.; Kordy, H.M. A transfer learning-based CNN and LSTM hybrid deep learning model to classify motor imagery EEG signals. Comput. Biol. Med. 2022, 143, 105288. [Google Scholar] [CrossRef] [PubMed]
  35. Islam, M.S.; Hussain, I.; Rahman, M.M.; Park, S.J.; Hossain, M.A. Explainable Artificial Intelligence Model for Stroke Prediction Using EEG Signal. Sensors 2022, 22, 9859. [Google Scholar] [CrossRef] [PubMed]
  36. Moulaei, K.; Afshari, L.; Moulaei, R.; Sabet, B.; Mousavi, S.M.; Afrash, M.R. Explainable Artificial Intelligence for Stroke Prediction Through Comparison of Deep Learning and Machine Learning Models. Sci. Rep. 2024, 14, 31392. [Google Scholar] [CrossRef]
  37. Dritsas, E.; Trigka, M. Stroke Risk Prediction with Machine Learning Techniques. Sensors 2022, 22, 4670. [Google Scholar] [CrossRef]
  38. Aksoy, S.; Dasgupta, P. AI-Powered Neuro-Oncology: EfficientNetB0’s Role in Tumor Differentiation. Clin. Transl. Neurosci. 2025, 9, 2. [Google Scholar] [CrossRef]
  39. Zainab, H.; Khan, A.H.; Khan, R.; Hussain, H.K. Integration of AI and wearable devices for continuous cardiac health monitoring. Int. J. Multidiscip. Sci. Arts 2024, 3, 123–139. [Google Scholar]
  40. Identify Stroke on Imbalanced Dataset. Cerebral Stroke Prediction-Imbalanced Dataset. 2022. Available online: https://www.kaggle.com/datasets/shashwatwork/cerebral-stroke-predictionimbalaced-dataset (accessed on 15 December 2022).
  41. Brain Stroke Dataset. Available online: https://www.kaggle.com/datasets/jillanisofttech/brain-stroke-dataset (accessed on 3 November 2022).
  42. Stroke Prediction Dataset. Available online: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset (accessed on 15 December 2022).
  43. Wang, L.; Jiang, S.; Jiang, S. A feature selection method via analysis of relevance, redundancy, and interaction. Expert Syst. Appl. 2021, 183, 115365. [Google Scholar] [CrossRef]
  44. Li, A.; Mueller, A.; English, B.; Arena, A.; Vera, D.; Kane, A.E.; Sinclair, D.A. Novel feature selection methods for construction of accurate epigenetic clocks. PLoS Comput. Biol. 2022, 18, e1009938. [Google Scholar] [CrossRef]
  45. Naseer, A.; Jalal, A. Pixels to precision: Features fusion and random forests over labelled-based segmentation. In Proceedings of the 2023 20th International Bhurban Conference on Applied Sciences and Technology (IBCAST), Murree, Pakistan, 22–25 August 2023; pp. 1–6. [Google Scholar]
  46. Lokker, C.; Abdelkader, W.; Bagheri, E.; Parrish, R.; Cotoi, C.; Navarro, T.; Germini, F.; Linkins, L.A.; Haynes, R.B.; Chu, L.; et al. Boosting efficiency in a clinical literature surveillance system with LightGBM. PLoS Digit. Health 2024, 3, e0000299. [Google Scholar] [CrossRef]
  47. Xie, H.; Fan, X.; Zhang, Y.; Zhan, Y.; Xu, W.; Huang, L. Predicting the risk of stroke based on imbalanced data set with missing data. In Proceedings of the 2022 IEEE 2nd International Conference on Electronic Technology, Communication and Information (ICETCI), Changchun, China, 27–29 May 2022; pp. 129–133. [Google Scholar]
  48. Setyarini, D.A.; Gayatri, A.A.M.D.; Aditya, C.S.K.; Chandranegara, D.R. Stroke Prediction with Enhanced Gradient Boosting Classifier and Strategic Hyperparameter. MATRIK J. Manajemen, Tek. Inform. Dan Rekayasa Komput. 2024, 23, 477–490. [Google Scholar] [CrossRef]
  49. Ashrafuzzaman, M.; Saha, S.; Nur, K. Prediction of stroke disease using deep CNN based approach. J. Adv. Inf. Technol. 2022, 13, 6. [Google Scholar] [CrossRef]
  50. Geethanjali, T.; Divyashree, M.; Monisha, S.; Sahana, M. Stroke prediction using machine learning. J. Emerg. Technol. Innov. Res. 2021, 9, 710–717. [Google Scholar]
  51. Nalini, D. Motyka Similar Feature Selected Softsign Deep Neural Classification For Stroke Disease Prediction. Webology 2021, 18, 2526–2540. [Google Scholar]
  52. Ahammad, T. Risk factor identification for stroke prognosis using machine-learning algorithms. Jordanian J. Comput. Inf. Technol. 2022, 8, 3. [Google Scholar] [CrossRef]
  53. Bathla, P.; Kumar, R. A hybrid system to predict brain stroke using a combined feature selection and classifier. Intell. Med. 2024, 4, 75–82. [Google Scholar] [CrossRef]
  54. Dubey, Y.; Tarte, Y.; Talatule, N.; Damahe, K.; Palsodkar, P.; Fulzele, P. Explainable and Interpretable Model for the Early Detection of Brain Stroke Using Optimized Boosting Algorithms. Diagnostics 2024, 14, 2514. [Google Scholar] [CrossRef]
  55. Akter, B.; Rajbongshi, A.; Sazzad, S.; Shakil, R.; Biswas, J.; Sara, U. A machine learning approach to detect the brain stroke disease. In Proceedings of the 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 20–22 January 2022; pp. 897–901. [Google Scholar]
  56. Devaki, A.; Rao, C.G. An ensemble framework for improving brain stroke prediction performance. In Proceedings of the 2022 First International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT), Trichy, India, 16–18 February 2022; pp. 1–7. [Google Scholar]
  57. Premisha, P.; Prasanth, S.; Kanagarathnam, M.; Banujan, K. An ensemble machine learning approach for stroke prediction. In Proceedings of the 2022 International Research Conference on Smart Computing and Systems Engineering (SCSE), Colombo, Sri Lanka, 1 September 2022; Volume 5, pp. 165–170. [Google Scholar]
  58. Emon, M.U.; Keya, M.S.; Meghla, T.I.; Rahman, M.M.; Al Mamun, M.S.; Kaiser, M.S. Performance analysis of machine learning approaches in stroke prediction. In Proceedings of the 2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 5–7 November 2020; pp. 1464–1469. [Google Scholar]
  59. Sharma, C.; Sharma, S.; Kumar, M.; Sodhi, A. Early stroke prediction using machine learning. In Proceedings of the 2022 International Conference on Decision Aid Sciences and Applications (DASA), Chiangrai, Thailand, 23–25 March 2022; pp. 890–894. [Google Scholar]
  60. Islam, F.; Ghosh, M. An enhanced stroke prediction scheme using SMOTE and machine learning techniques. In Proceedings of the International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 6–8 July 2021. [Google Scholar]
  61. Hossain, S.; Biswas, P.; Ahmed, P.; Sourov, M.R.; Keya, M.; Khushbu, S.A. Prognostic the risk of stroke using integrated supervised machine learning teachniques. In Proceedings of the 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 6–8 July 2021; pp. 1–5. [Google Scholar]
  62. Gupta, S.; Raheja, S. Stroke prediction using machine learning methods. In Proceedings of the 2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Virtual, 27–28 January 2022; pp. 553–558. [Google Scholar]
Figure 1. Box plots for the DF-1 dataset visualizing the distributionally key features, highlighting differences between stroke and non-stroke cases: (A) age, (B) average glucose level, and (C) BMI.
Figure 1. Box plots for the DF-1 dataset visualizing the distributionally key features, highlighting differences between stroke and non-stroke cases: (A) age, (B) average glucose level, and (C) BMI.
Sensors 25 01739 g001
Figure 2. Box plots for the DF-2 dataset illustrating the spread of key features, showcasing trends and outliers between stroke and non-stroke cases: (A) age, (B) average glucose level, and (C) BMI.
Figure 2. Box plots for the DF-2 dataset illustrating the spread of key features, showcasing trends and outliers between stroke and non-stroke cases: (A) age, (B) average glucose level, and (C) BMI.
Sensors 25 01739 g002
Figure 3. Box plots for the DF-3 dataset depicting the distributionally critical features, similar to DF-1 due to matching field structure: (A) age, (B) average glucose level, and (C) BMI.
Figure 3. Box plots for the DF-3 dataset depicting the distributionally critical features, similar to DF-1 due to matching field structure: (A) age, (B) average glucose level, and (C) BMI.
Sensors 25 01739 g003
Figure 4. Distribution of key numerical features in the DF-1 dataset: (A) age, (B) average glucose level, and (C) Body Mass Index (BMI). These features highlight differences between stroke and non-stroke cases.
Figure 4. Distribution of key numerical features in the DF-1 dataset: (A) age, (B) average glucose level, and (C) Body Mass Index (BMI). These features highlight differences between stroke and non-stroke cases.
Sensors 25 01739 g004
Figure 5. Distribution of key numerical features in the DF-2 dataset: (A) age, (B) average glucose level, and (C) Body Mass Index (BMI). Patterns reveal significant variations across stroke and non-stroke populations.
Figure 5. Distribution of key numerical features in the DF-2 dataset: (A) age, (B) average glucose level, and (C) Body Mass Index (BMI). Patterns reveal significant variations across stroke and non-stroke populations.
Sensors 25 01739 g005
Figure 6. Distribution of key numerical features in the DF-3 dataset: (A) age, (B) average glucose level, and (C) Body Mass Index (BMI). Notable trends and outliers provide insights into the dataset characteristics.
Figure 6. Distribution of key numerical features in the DF-3 dataset: (A) age, (B) average glucose level, and (C) Body Mass Index (BMI). Notable trends and outliers provide insights into the dataset characteristics.
Sensors 25 01739 g006
Figure 7. Correlation matrix of the DF-1 dataset showing relationships between features after one-hot encoding. High correlations (absolute values near 1) indicate redundancy, and features exceeding the threshold of 0.8 were removed.
Figure 7. Correlation matrix of the DF-1 dataset showing relationships between features after one-hot encoding. High correlations (absolute values near 1) indicate redundancy, and features exceeding the threshold of 0.8 were removed.
Sensors 25 01739 g007
Figure 8. F1-Score vs. number of selected features (k). The optimal k = 18 is indicated by the vertical dashed line.
Figure 8. F1-Score vs. number of selected features (k). The optimal k = 18 is indicated by the vertical dashed line.
Sensors 25 01739 g008
Figure 9. Proposed meta-learning framework: Base models generate initial predictions, and the meta-model combines and refines these outputs to produce the final prediction.
Figure 9. Proposed meta-learning framework: Base models generate initial predictions, and the meta-model combines and refines these outputs to produce the final prediction.
Sensors 25 01739 g009
Figure 10. Precision–recall curve for DF-1 dataset using imbalance handling methods: (A) SMOTE, (B) SMOTEEN, and (C) SMOTE-SMOTEEN.
Figure 10. Precision–recall curve for DF-1 dataset using imbalance handling methods: (A) SMOTE, (B) SMOTEEN, and (C) SMOTE-SMOTEEN.
Sensors 25 01739 g010
Figure 11. ROC-AUC curve for DF-1 dataset using imbalance handling methods: (A) SMOTE, (B) SMOTEEN, and (C) SMOTE-SMOTEEN.
Figure 11. ROC-AUC curve for DF-1 dataset using imbalance handling methods: (A) SMOTE, (B) SMOTEEN, and (C) SMOTE-SMOTEEN.
Sensors 25 01739 g011
Figure 12. Precision–recall curve for DF-2 dataset using imbalance handling methods: (A) SMOTE, (B) SMOTEEN, and (C) SMOTE-SMOTEEN.
Figure 12. Precision–recall curve for DF-2 dataset using imbalance handling methods: (A) SMOTE, (B) SMOTEEN, and (C) SMOTE-SMOTEEN.
Sensors 25 01739 g012
Figure 13. ROC-AUC curve for DF-2 dataset using imbalance handling methods: (A) SMOTE, (B) SMOTEEN, and (C) SMOTE-SMOTEEN.
Figure 13. ROC-AUC curve for DF-2 dataset using imbalance handling methods: (A) SMOTE, (B) SMOTEEN, and (C) SMOTE-SMOTEEN.
Sensors 25 01739 g013
Figure 14. Precision–recall curve for DF-3 dataset using imbalance handling methods: (A) SMOTE, (B) SMOTEEN, and (C) SMOTE-SMOTEEN.
Figure 14. Precision–recall curve for DF-3 dataset using imbalance handling methods: (A) SMOTE, (B) SMOTEEN, and (C) SMOTE-SMOTEEN.
Sensors 25 01739 g014
Figure 15. ROC-AUC curve for DF-3 dataset using imbalance handling methods: (A) SMOTE, (B) SMOTEEN, and (C) SMOTE-SMOTEEN.
Figure 15. ROC-AUC curve for DF-3 dataset using imbalance handling methods: (A) SMOTE, (B) SMOTEEN, and (C) SMOTE-SMOTEEN.
Sensors 25 01739 g015
Figure 16. Global feature importance using SHAP for the datasets: (A) DF-1, (B) DF-2, and (C) DF-3.
Figure 16. Global feature importance using SHAP for the datasets: (A) DF-1, (B) DF-2, and (C) DF-3.
Sensors 25 01739 g016
Figure 17. Feature dependency and interaction in the datasets: (A) DF-1, (B) DF-2, and (C) DF-3 datasets.
Figure 17. Feature dependency and interaction in the datasets: (A) DF-1, (B) DF-2, and (C) DF-3 datasets.
Sensors 25 01739 g017
Figure 18. Localized explanations using SHAP force plots for individual predictions in the datasets: (A) DF-1, (B) DF-2, and (C) DF-3.
Figure 18. Localized explanations using SHAP force plots for individual predictions in the datasets: (A) DF-1, (B) DF-2, and (C) DF-3.
Sensors 25 01739 g018
Figure 19. Cumulative feature contributions visualized through SHAP decision plots for the datasets: (A) DF-1, (B) DF-2, and (C) DF-3.
Figure 19. Cumulative feature contributions visualized through SHAP decision plots for the datasets: (A) DF-1, (B) DF-2, and (C) DF-3.
Sensors 25 01739 g019
Table 1. Description of dataset fields in DF1 and DF2.
Table 1. Description of dataset fields in DF1 and DF2.
Feature NameDescriptionType
agePatient’s age in yearsNumeric
GenderGender of the patient (Male, Female)Categorical
hypertensionWhether the patient has hypertension (0 or 1)Binary
heart_diseasePresence of heart disease (0 or 1)Binary
ever_marriedMarital status (Yes, No)Categorical
work_typeType of work (Private, Self-employed, etc.)Categorical
Residence_typeArea of residence (Urban, Rural)Categorical
avg_glucose_levelAverage glucose level in bloodNumeric
bmiBody Mass Index (BMI)Numeric
smoking_statusSmoking status (Never smoked, Smokes, etc.)Categorical
strokeStroke occurrence (Target: 0 or 1)Binary
Table 2. Class distribution in DF-1, DF-2, and DF-3.
Table 2. Class distribution in DF-1, DF-2, and DF-3.
DatasetClassSamplesPercentage (%)
DF-1 [40]Non-Stroke (0)42,61798.2%
Stroke (1)7831.8%
DF-2 [41]Non-Stroke (0)473395.0%
Stroke (1)2485.0%
DF-3 [42]Non-Stroke (0)486195.1%
Stroke (1)2494.9%
Table 3. Features introduced by one-hot encoding.
Table 3. Features introduced by one-hot encoding.
FeatureDescription
gender_FemaleIndicates if the gender is Female (binary: 0 or 1).
gender_MaleIndicates if the gender is Male (binary: 0 or 1).
gender_OtherIndicates if the gender is Other (binary: 0 or 1).
ever_married_NoIndicates if the individual has never married (binary: 0 or 1).
ever_married_YesIndicates if the individual has been married (binary: 0 or 1).
Residence_type_RuralIndicates if the residence type is Rural (binary: 0 or 1).
Residence_type_UrbanIndicates if the residence type is Urban (binary: 0 or 1).
work_type_Govt_jobIndicates if the work type is Government job (binary: 0 or 1).
work_type_Never_workedIndicates if the individual has never worked (binary: 0 or 1).
work_type_PrivateIndicates if the work type is Private sector (binary: 0 or 1).
work_type_Self-employedIndicates if the work type is Self-employed (binary: 0 or 1).
work_type_childrenIndicates if the work type is related to children (binary: 0 or 1).
smoking_status_UnknownIndicates if the smoking status is unknown (binary: 0 or 1).
smoking_status_formerly smokedIndicates if the individual formerly smoked (binary: 0 or 1).
smoking_status_never smokedIndicates if the individual never smoked (binary: 0 or 1).
smoking_status_smokesIndicates if the individual currently smokes (binary: 0 or 1).
Table 4. Summary of evaluation metrics and their equations.
Table 4. Summary of evaluation metrics and their equations.
MetricDescriptionEquation
AccuracyProportion of correct predictions among all cases. Accuracy = T P + T N T P + T N + F P + F N
PrecisionProportion of true positives among all positive predictions. Precision = T P T P + F P
Recall (Sensitivity)Proportion of true positives among all actual positives. Recall = T P T P + F N
F1-ScoreHarmonic mean of precision and recall. F 1 - Score = 2 × Precision × Recall Precision + Recall
ROC-AUCArea under the ROC curve. ROC AUC = 0 1 TPR ( F P R ) d ( FPR )
Cohen KappaAgreement between predicted and actual labels, adjusted for chance. Kappa = P o P e 1 P e
Table 5. Performance comparison on DF-1 dataset.
Table 5. Performance comparison on DF-1 dataset.
SMOTESMOTEENNSMOTE_SMOTEENN
MeanStd.MeanStd.MeanStd.
Accuracy0.9840.0010.9890.0010.9920.001
Precision0.9840.0040.9890.0030.9940.001
Recall0.9830.0040.9900.0030.9920.002
F1-Score0.9840.0010.9900.0010.9930.001
ROC AUC0.9990.0000.9990.0001.0000.000
Cohen Kappa0.9670.0030.9790.0020.9840.002
Table 6. Performance comparison on DF-2 dataset.
Table 6. Performance comparison on DF-2 dataset.
SMOTESMOTEENNSMOTE_SMOTEENN
MeanStd.MeanStd.MeanStd.
Accuracy0.9570.0060.9750.0080.9800.002
Precision0.9530.0080.9720.0120.9760.002
Recall0.9600.0080.9820.0040.9880.003
F1-Score0.9570.0060.9770.0070.9820.002
ROC AUC0.9940.0010.9970.0010.9990.001
Cohen Kappa0.9140.0120.9500.0150.9600.005
Table 7. Performance comparison on DF-3 dataset.
Table 7. Performance comparison on DF-3 dataset.
SMOTESMOTEENNSMOTE_SMOTEENN
MeanStd.MeanStd.MeanStd.
Accuracy0.9580.0050.9740.0060.9930.005
Precision0.9550.0060.9710.0090.9770.008
Recall0.9610.0060.9810.0070.9900.002
F1-Score0.9580.0050.9760.0060.9830.005
ROC AUC0.9940.0020.9970.0010.9990.001
Cohen Kappa0.9160.0110.9480.0120.9640.011
Table 8. Comparison of DF-1 dataset results with related work.
Table 8. Comparison of DF-1 dataset results with related work.
Refs.Model UsedAccuracy (%)F1-Score (%)
[47]XGB87.589.2
[48]Cat boost (CB)98.998
Proposed MethodMeta + (SMOTE-SMOTEENN)99.2199.26
Table 11. Aggregated statistical test results for performance evaluation metrics across all datasets.
Table 11. Aggregated statistical test results for performance evaluation metrics across all datasets.
MetricPaired t-TestPaired t-TestWilcoxon Signed-RankWilcoxon Signed-Rank
(t-Value)(p-Value)(W-Value)(p-Value)
Accuracy5.6340.0046.00.005
Precision4.9820.0075.50.006
Recall6.2140.0036.50.004
F1-Score5.8760.0046.30.005
Table 12. Per-dataset statistical test results confirming the significant improvements in model performance.
Table 12. Per-dataset statistical test results confirming the significant improvements in model performance.
DatasetPaired t-TestPaired t-TestWilcoxon Signed-RankWilcoxon Signed-Rank
(t-Value)(p-Value)(W-Value)(p-Value)
DF-16.2130.0036.50.004
DF-25.8760.0046.30.005
DF-35.6340.0046.00.005
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abousaber, I. A Novel Explainable Attention-Based Meta-Learning Framework for Imbalanced Brain Stroke Prediction. Sensors 2025, 25, 1739. https://doi.org/10.3390/s25061739

AMA Style

Abousaber I. A Novel Explainable Attention-Based Meta-Learning Framework for Imbalanced Brain Stroke Prediction. Sensors. 2025; 25(6):1739. https://doi.org/10.3390/s25061739

Chicago/Turabian Style

Abousaber, Inam. 2025. "A Novel Explainable Attention-Based Meta-Learning Framework for Imbalanced Brain Stroke Prediction" Sensors 25, no. 6: 1739. https://doi.org/10.3390/s25061739

APA Style

Abousaber, I. (2025). A Novel Explainable Attention-Based Meta-Learning Framework for Imbalanced Brain Stroke Prediction. Sensors, 25(6), 1739. https://doi.org/10.3390/s25061739

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop