MDPI - Publisher of Open Access Journals

35 pages, 2740 KB

Open AccessArticle

Prediction of Depression Risk on Social Media Using Natural Language Processing and Explainable Machine Learning

by Ronewa Mabodi, Elliot Mbunge, Tebogo Makaba and Nompumelelo Ndlovu

Appl. Sci. 2026, 16(7), 3489; https://doi.org/10.3390/app16073489 - 3 Apr 2026

Viewed by 145

Major Depressive Disorder (MDD) is a significant global health burden that contributes to disability and reduced quality of life. Its impact extends beyond individuals, placing emotional, social, and economic strain on families and healthcare systems worldwide. Despite its prevalence, MDD remains widely misunderstood, [...] Read more.

Major Depressive Disorder (MDD) is a significant global health burden that contributes to disability and reduced quality of life. Its impact extends beyond individuals, placing emotional, social, and economic strain on families and healthcare systems worldwide. Despite its prevalence, MDD remains widely misunderstood, with limited mental health literacy and persistent stigma often preventing individuals from seeking help. This research explored the prediction of MDD utilising social media data via Natural Language Processing (NLP), Machine Learning (ML), and explainable Machine Learning (xML) techniques. The research aimed at identifying depressive indicators on X (formerly Twitter) and developing interpretable models for depression risk detection. The study’s methodology followed the Cross-Industry Standard Process for Data Mining (CRISP-DM) framework to ensure a systematic approach to data analysis. Data was collected via X’s API and processed using regex-based noise removal, normalisation, tokenisation, and lemmatisation. Symptoms were mapped to DSM-5-TR criteria at the post-level, with user-level MDD risk assessed based on symptom persistence over a two-week period. Risk levels were classified as No Risk, Monitor, and High Risk to facilitate early intervention. Six ML models were trained and tested, while the Synthetic Minority Over-sampling Technique (SMOTE) was applied to mitigate class imbalance. The dataset was partitioned into training and testing sets using an 80:20 split. ML models were evaluated, and the Extreme Gradient Boosting model outperformed the others. Extreme Gradient Boosting achieved an accuracy of 0.979, F1-score of 0.970, and ROC-AUC of 0.996, surpassing benchmark results reported in prior studies. Explainability techniques, such as LIME and tree-based feature importance, enhance model transparency and clinical interpretability. Depressed mood consistently emerged as the highest-weighted predictor across different models. The findings highlight the value of aligning ML models with validated diagnostic frameworks to improve trustworthiness and reduce false positives. Future research can expand beyond text-based analysis by incorporating multimodal features to broaden diagnostic depth. Full article

(This article belongs to the Special Issue Deep Learning and Machine Learning in Information Systems)

► Show Figures

Figure 1

25 pages, 12554 KB

Open AccessArticle

An Explainable Artificial Intelligence-Driven Framework for Predicting Groundwater Irrigation Suitability in Hard-Rock Aquifers: Moving Beyond Traditional Bivariate Diagnostics

by Mohamed Hussein Yousif, Quanrong Wang, Anurag Tewari, Abara A. Biabak Indrick, Hafizou M. Sow, Yousif Hassan Mohamed Salh and Wakeel Hussain

Water 2026, 18(7), 854; https://doi.org/10.3390/w18070854 (registering DOI) - 2 Apr 2026

Viewed by 354

Abstract

Groundwater is the primary source of irrigation in many semi-arid hard-rock aquifer regions. Yet, its suitability assessment is often hindered by the nonlinear hydrochemical dynamics that traditional bivariate tools, such as the U.S. Salinity Laboratory (USSL) diagram, cannot adequately resolve. To overcome this [...] Read more.

Groundwater is the primary source of irrigation in many semi-arid hard-rock aquifer regions. Yet, its suitability assessment is often hindered by the nonlinear hydrochemical dynamics that traditional bivariate tools, such as the U.S. Salinity Laboratory (USSL) diagram, cannot adequately resolve. To overcome this limitation, we developed an explainable artificial intelligence (XAI) framework that predicts irrigation suitability categories directly from hydrochemical variables, without relying on calculated indices. Using 1872 post-monsoon groundwater samples from Telangana, India, we trained three ensemble tree-based classifiers (Random Forest, LightGBM, and XGBoost) on 11 hydrochemical variables (Na⁺, K⁺, Ca²⁺, Mg²⁺, HCO₃⁻, Cl⁻, F⁻, NO₃⁻, SO₄²⁻, pH, and total hardness). Class imbalance was addressed using the Synthetic Minority Over-sampling Technique (SMOTE), and model hyperparameters were optimized with Optuna. Among the tested models, LightGBM achieved the best performance (balanced accuracy = 0.938). Model interpretability was enabled using Shapley Additive Explanations (SHAP), supported by Piper and Gibbs diagrams, revealing a critical distinction between sodicity-driven salinity and hardness-driven mineralization, identifying calcium-saturated waters for which gypsum amendment can be chemically futile. To bridge the gap between algorithmic accuracy and operational simplicity, we distilled SHAP explanations into linear heuristics and quantified the trade-off between accuracy and simplicity. Accordingly, we proposed a tiered hydrochemical triage framework in which quantitative heuristics handled approximately 62.5% of the routine samples, while XAI resolved the complex and ambiguous cases. Overall, the proposed framework transforms classic suitability assessment tools into an adaptable, evidence-informed, proactive decision-support system for sustainable agricultural water management under increasing environmental stress. Full article

(This article belongs to the Special Issue Water Quality Analytics in the Digital Era: Methods, Models and Management)

► Show Figures

Figure 1

28 pages, 4366 KB

Open AccessFeature PaperArticle

Temporal Transformer with Conditional Tabular GAN for Credit Card Fraud Detection: A Sequential Deep Learning Approach

by Jiaying Chen, Yiwen Liang, Jingyi Liu and Mengjie Zhou

Mathematics 2026, 14(7), 1183; https://doi.org/10.3390/math14071183 - 1 Apr 2026

Viewed by 321

Abstract

Credit card fraud detection remains a critical challenge in financial security, characterized by severe class imbalance and the need to capture complex temporal patterns in transaction sequences. Traditional machine learning approaches treat transactions as independent events, failing to model the sequential nature of [...] Read more.

Credit card fraud detection remains a critical challenge in financial security, characterized by severe class imbalance and the need to capture complex temporal patterns in transaction sequences. Traditional machine learning approaches treat transactions as independent events, failing to model the sequential nature of user behavior and suffering from inadequate handling of minority class samples. In this paper, we propose an integrated framework that combines generative modeling and time-aware sequential learning for credit card fraud detection. Our approach addresses two fundamental limitations: (1) we model transaction histories as temporal sequences using a Transformer-based architecture that captures both long-term dependencies and abrupt behavioral changes through multi-head self-attention mechanisms, and (2) we employ CTGAN to generate high-quality synthetic fraudulent samples, providing more effective oversampling than conventional techniques like SMOTE. The Time-Aware Transformer incorporates temporal encoding and position-aware attention to preserve transaction order and time intervals, while CTGAN learns the complex conditional distributions of fraudulent transactions to produce realistic synthetic samples. We evaluate our framework on the IEEE-CIS Fraud Detection dataset, demonstrating significant improvements over representative classical and sequential deep-learning baselines. Experimental results show that our method achieves superior performance with an AUC-ROC of 0.982, precision of 0.891, recall of 0.876, and F1-score of 0.883, outperforming the representative baselines considered in this study, including traditional machine learning models, standalone deep learning architectures, and supervised sequential neural models. Ablation studies confirm the individual contributions of both the sequential modeling component and the generative oversampling strategy. Our work demonstrates that combining temporal sequence modeling with generative synthesis provides a robust solution for imbalanced fraud detection, with potential applications extending to other domains requiring sequential pattern recognition under extreme class imbalance. Full article

► Show Figures

Figure 1

22 pages, 2045 KB

Open AccessArticle

GA-SMOTE-RF Enhanced Kalman Filter with Adaptive Noise Reduction

by Yiming Wang, Hui Zou, Yuzhou Liu, Tianchang Qiao, Xinyuan Xu, Yihang Li, Changxun He, Shunv Zhou, Hanjie Wang, Qingqing Geng and Qiqi Song

Sensors 2026, 26(7), 2165; https://doi.org/10.3390/s26072165 - 31 Mar 2026

Viewed by 200

Abstract

Low-noise free-space laser communication has widespread applications in military and rescue fields, but atmospheric turbulence severely affects communication quality. This paper proposes an intelligent classification and adaptive noise reduction system that integrates genetic algorithms (GA), synthetic minority oversampling technique (SMOTE), random forest (RF), [...] Read more.

Low-noise free-space laser communication has widespread applications in military and rescue fields, but atmospheric turbulence severely affects communication quality. This paper proposes an intelligent classification and adaptive noise reduction system that integrates genetic algorithms (GA), synthetic minority oversampling technique (SMOTE), random forest (RF), and Kalman filtering, significantly improving turbulence channel interference classification accuracy and communication quality. Simulation results show that the system achieves a classification accuracy of 98.27%, with corresponding F1-score of 0.9732 and MCC of 0.9653, far exceeding algorithms such as SVM and KNN. After noise reduction, the average RMSE for 400 signal groups is 0.6983, with zero estimated delay, and the mean and standard deviation of the innovative sequence are −0.0049 and 0.6960, respectively, demonstrating excellent signal quality and efficient real-time processing capabilities. Beyond synthetic simulations, we conducted real-world FSO data studies to validate practical applicability. A 24-h field experiment collected 283 real FSO measurement windows, on which the proposed GA–SMOTE–RF method achieves 0.308 RMSE and 0.75% Average Regret in Kalman filter parameter selection, outperforming KNN and SVM, confirming practical applicability for real-world FSO systems. Full article

(This article belongs to the Special Issue Antenna Technology for Advanced Communication and Sensing Systems)

► Show Figures

Figure 1

22 pages, 3647 KB

Open AccessArticle

Addressing Class Imbalance in Predicting Student Performance Using SMOTE and GAN Techniques

by Fatema Mohammad Alnassar, Tim Blackwell, Elaheh Homayounvala and Matthew Yee-king

Appl. Sci. 2026, 16(7), 3274; https://doi.org/10.3390/app16073274 - 28 Mar 2026

Viewed by 307

Abstract

Virtual Learning Environments (VLEs) have become increasingly popular in education, particularly with the rise of remote learning during the COVID-19 pandemic. Assessing student performance in VLEs is challenging, and the accurate prediction of final results is of great interest to educational institutions. Machine [...] Read more.

Virtual Learning Environments (VLEs) have become increasingly popular in education, particularly with the rise of remote learning during the COVID-19 pandemic. Assessing student performance in VLEs is challenging, and the accurate prediction of final results is of great interest to educational institutions. Machine learning classification models have been shown to be effective in predicting student performance, but the accuracy of these models depends on the dataset’s size, diversity, quality, and feature type. Class imbalance is a common issue in educational datasets, but there is a lack of research on addressing this problem in predicting student performance. In this paper, we present an experimental design that addresses class imbalance in predicting student performance by using the Synthetic Minority Over-sampling Technique (SMOTE) and Generative Adversarial Network (GAN) technique. We compared the classification performance of seven machine learning models (i.e., Multi-Layer Perceptron (MLP), Decision Trees (DT), Random Forests (RF), Extreme Gradient Boosting (XGBoost), Categorical Boosting (CATBoost), K-Nearest Neighbors (KNN), and Support Vector Classifier (SVC)) using different dataset combinations, and our results show that SMOTE techniques can improve model performance, and GAN models can generate useful simulated data for classification tasks. Among the SMOTE resampling methods, SMOTE NN produced the strongest performance for the RF model, achieving a Region of Convergence (ROC) Area Under the Curve (AUC) of 0.96 and a Type II error rate of 8%. For the generative data experiments, the XGBoost model demonstrated the best performance when trained on the GAN-generated dataset balanced using SMOTE NN, attaining a ROC AUC of 0.97 and a reduced Type II error rate of 3%. These results indicate that the combined use of class balancing techniques and generative synthetic data augmentation can enhance student outcome prediction performance. Full article

(This article belongs to the Topic Explainable AI in Education)

► Show Figures

Figure 1

28 pages, 1320 KB

Open AccessArticle

WCGAN-GA-RF: Healthcare Fraud Detection via Generative Adversarial Networks and Evolutionary Feature Selection

by Junze Cai, Shuhui Wu, Yawen Zhang, Jiale Shao and Yuanhong Tao

Information 2026, 17(4), 315; https://doi.org/10.3390/info17040315 - 24 Mar 2026

Viewed by 166

Abstract

Healthcare fraud poses significant risks to insurance systems, undermining both financial sustainability and equitable access to care. Accurate detection of fraudulent claims is therefore critical to ensuring the integrity of healthcare insurance operations. However, the increasing sophistication of fraud techniques and limited data [...] Read more.

Healthcare fraud poses significant risks to insurance systems, undermining both financial sustainability and equitable access to care. Accurate detection of fraudulent claims is therefore critical to ensuring the integrity of healthcare insurance operations. However, the increasing sophistication of fraud techniques and limited data availability have undermined the performance of traditional detection approaches. To address these challenges, this paper proposes WCGAN-GA-RF, an integrated fraud detection framework that synergistically combines Wasserstein Conditional Generative Adversarial Network with gradient penalty (WCGAN-GP) for synthetic data generation, genetic algorithm-based feature selection (GA-RF) for dimensionality reduction, and Random Forest (RF) for classification. The proposed framework was empirically validated on a real-world dataset of 16,000 healthcare insurance claims from a Chinese healthcare technology firm, characterized by a 16:1 class imbalance ratio (

5.9 %

fraudulent samples) and 118 original features. Using a stratified

80 / 20

train–test split with results averaged over five independent runs, the WCGAN-GA-RF framework achieved a precision of

96.47 \pm 0.5 %

, a recall of

97.05 \pm 0.4 %

, and an F1-score of

96.26 \pm 0.4 %

. Notably, the GA-RF component achieved a

65 %

feature reduction (from 80 to 28 features) while maintaining competitive detection accuracy. Comparative experiments demonstrate that the proposed approach outperforms conventional oversampling methods, including Random Oversampling (ROS), Synthetic Minority Oversampling Technique (SMOTE), and Adaptive Synthetic Sampling (ADASYN), particularly in handling high-dimensional, severely imbalanced healthcare fraud data. Full article

(This article belongs to the Special Issue Advancements in Healthcare Data Science: Innovations, Challenges and Applications)

► Show Figures

Figure 1

14 pages, 1332 KB

Open AccessArticle

Leakage-Free Evaluation for Employee Attrition Prediction on Tabular Data

by Ana Maria Căvescu and Alina Nirvana Popescu

Information 2026, 17(3), 308; https://doi.org/10.3390/info17030308 - 23 Mar 2026

Viewed by 216

Abstract

In the context of employee attrition prediction using imbalanced tabular data, we propose a reproducible, leakage-aware evaluation protocol and validate it on the IBM HR Attrition dataset. We perform the train/test split prior to any rebalancing; SMOTE (Synthetic Minority Over-sampling Technique) is applied [...] Read more.

In the context of employee attrition prediction using imbalanced tabular data, we propose a reproducible, leakage-aware evaluation protocol and validate it on the IBM HR Attrition dataset. We perform the train/test split prior to any rebalancing; SMOTE (Synthetic Minority Over-sampling Technique) is applied exclusively within the training portion of each fold in stratified 5-fold cross-validation, while the test set remains untouched. One-Hot Encoding is performed consistently using pd.get_dummies. We benchmark Logistic Regression, Random Forest, ExtraTrees, LightGBM, and XGBoost using imbalance-aware metrics: F1 for the minority class, PR-AUC reported as Average Precision (AP), and ROC-AUC reported both in cross-validation and on the held-out test set. XGBoost attains the best mean AP in cross-validation (0.556 ± 0.056). Logistic Regression achieves the highest mean F1 (0.439 ± 0.048), while LightGBM yields the best mean ROC-AUC (0.791 ± 0.026). On the test set, XGBoost achieves a precision value of 0.65 and a recall value of 0.45 at a fixed threshold of 0.5. Overall, the results highlight a trade-off between stable minority-class detection (Logistic Regression) and stronger risk ranking performance (boosting models) under class imbalance. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

30 pages, 9811 KB

Open AccessArticle

Audio-Based Screening of Respiratory Diseases Using Machine Learning: A Methodological Framework Evaluated on a Clinically Validated COVID-19 Cough Dataset

by Arley Magnolia Aquino-García, Humberto Pérez-Espinosa, Javier Andreu-Perez and Ansel Y. Rodríguez González

Mach. Learn. Knowl. Extr. 2026, 8(3), 80; https://doi.org/10.3390/make8030080 - 20 Mar 2026

Viewed by 343

Abstract

The development of AI-driven computational methods has enabled rapid and non-invasive analysis of respiratory sounds using acoustic data, particularly cough recordings. Although the COVID-19 pandemic accelerated research on cough-based acoustic analysis, many early studies were limited by insufficient data quality, lack of standardized [...] Read more.

The development of AI-driven computational methods has enabled rapid and non-invasive analysis of respiratory sounds using acoustic data, particularly cough recordings. Although the COVID-19 pandemic accelerated research on cough-based acoustic analysis, many early studies were limited by insufficient data quality, lack of standardized protocols, and limited reproducibility due to data scarcity. In this study, we propose an audio analysis framework for cough-based respiratory disease screening research using COVID-19 as a clinically validated case dataset. All analyses were conducted on a single clinically acquired multicentric dataset collected under standardized conditions in certified laboratories in Mexico and Spain, comprising cough recordings from 1105 individuals. Model training and testing were performed exclusively within this dataset. The framework incorporates signal preprocessing and a comparative evaluation of segmentation strategies, showing that segmented cough analysis significantly outperforms full-signal analysis. Class imbalance was addressed using the Synthetic Minority Over-sampling Technique (SMOTE) for CNN2D models and the supervised Resample filter implemented in WEKA for classical machine learning models, both applied exclusively to the training subset to generate balanced training sets and prevent data leakage. Feature extraction and classification were carried out using Random Forest, Support Vector Machine (SVM), XGBoost, and a 2D Convolutional Neural Network (CNN2D), with hyperparameter optimization via AutoML. The proposed framework achieved a best balanced screening performance of 85.58% sensitivity and 86.65% specificity (Random Forest with GeMAPSvB01), while the highest-specificity configuration reached 93.90% specificity with 18.14% sensitivity (CNN2D with SMOTE and AutoML). These results demonstrate the methodological feasibility of the proposed framework under the evaluated conditions. Full article

(This article belongs to the Topic Deep Supplement Learning for Healthcare and Biomedical Applications)

► Show Figures

Figure 1

41 pages, 1130 KB

Open AccessArticle

A Weighted Average-Based Heterogeneous Datasets Integration Framework for Intrusion Detection Using a Hybrid Transformer–MLP Model

by Hesham Kamal and Maggie Mashaly

Technologies 2026, 14(3), 180; https://doi.org/10.3390/technologies14030180 - 16 Mar 2026

Viewed by 436

Abstract

In today’s digital era, cyberattacks pose a critical threat to networks of all scales, from local systems to global infrastructures. Intrusion detection systems (IDSs) are essential for identifying and mitigating such threats. However, existing machine learning-based IDS often suffer from low detection accuracy, [...] Read more.

In today’s digital era, cyberattacks pose a critical threat to networks of all scales, from local systems to global infrastructures. Intrusion detection systems (IDSs) are essential for identifying and mitigating such threats. However, existing machine learning-based IDS often suffer from low detection accuracy, heavy reliance on manual feature extraction, and limited coverage of attack categories. To address these limitations, we propose a modular, deployment-ready intrusion detection framework that integrates multiple heterogeneous datasets through a hybrid transformer–multilayer perceptron (Transformer–MLP) architecture. The system employs three parallel Transformer–MLP models, each specialized for a distinct dataset, whose probabilistic outputs are fused using a weighted decision-level strategy. Unlike traditional feature-level fusion, this strategy ensures module independence, eliminates the need for global retraining when adding new components, and provides seamless modular scalability. The framework accurately identifies twenty-one traffic categories, including one benign and twenty attack classes, derived from a unified mapping across multiple heterogeneous sources to ensure a consistent cross-dataset taxonomy. By combining advanced contextual representation learning with ensemble-based probabilistic fusion, the framework demonstrates high detection accuracy and practical applicability in real-world network environments. The Transformer module captures complex contextual dependencies, while the MLP performs final classification. Class imbalance is mitigated via adaptive synthetic sampling (ADASYN), synthetic minority over-sampling technique (SMOTE), edited nearest neighbor (ENN), and class weight adjustments. Empirical evaluation demonstrates the framework’s high effectiveness: for binary classification, it achieves 99.98% on CICIDS2017, 99.19% on NSL-KDD, and 99.98% on NF-BoT-IoT-v2; for two-stage multi-class classification, 99.56%, 99.55%, and 97.75%; and for one-phase multi-class classification, 99.73%, 99.07%, and 98.23%, respectively. Moreover, the framework enables real-time deployment with 4.8–6.9 ms latency, 9800–14,200 fps throughput, and 412–458 MB memory. These results outperform existing multi-dataset IDS approaches, highlighting the architectural effectiveness, robustness, and practical applicability of the proposed framework. Full article

► Show Figures

Figure 1

22 pages, 4100 KB

Open AccessArticle

Explainable Machine Learning-Based Urban Waterlogging Prediction Framework

by Yinghua Deng and Xin Lu

Urban Sci. 2026, 10(3), 156; https://doi.org/10.3390/urbansci10030156 - 13 Mar 2026

Viewed by 364

Abstract

Urban waterlogging has become a critical challenge to urban sustainability under the combined pressures of rapid urbanization and increasingly frequent extreme weather events. However, traditional predictive models struggle to achieve real-time, point-specific early warning effectively, primarily due to the interference of redundant high-dimensional [...] Read more.

Urban waterlogging has become a critical challenge to urban sustainability under the combined pressures of rapid urbanization and increasingly frequent extreme weather events. However, traditional predictive models struggle to achieve real-time, point-specific early warning effectively, primarily due to the interference of redundant high-dimensional data and the inability to handle severe data imbalance. This study proposes a lightweight and interpretable machine learning framework for real-time waterlogging hotspot prediction, based on a multi-dimensional feature space. Specifically, we implement a Lasso-based mechanism to distill 37 multi-source variables into five core determinants. This process effectively isolates dominant environmental drivers while filtering noise. To further overcome the recall bottleneck, we propose a Synthetic Minority Over-sampling Technique based on Weighted Distance and Cleaning (SMOTE-WDC) algorithm that incorporates weighted feature distances and density-based noise cleaning. Validating the framework on datasets from Shenzhen (2023–2024), we demonstrate that the integrated Gradient Boosting Decision Tree (GBDT) model integrated with this strategy achieves optimal performance using only five features, yielding an F1-score of 0.808 and an Area Under the Precision-Recall Curve (AUC-PR) of 0.895. Notably, a Recall of 0.882 is attained, representing a 4.6% improvement over the baseline. This study contributes a cost-effective, high-sensitivity approach to disaster risk reduction, advancing predictive urban waterlogging management. Full article

(This article belongs to the Special Issue Flooding Prevention Strategies for Flood-Prone Cities Under Climate Change)

► Show Figures

Figure 1

28 pages, 3210 KB

Open AccessArticle

Employee Attrition Prediction: An Explanatory and Statistically Robust Ensemble Learning Model

by Ghalia Nassreddine, Jamil Hammoud, Obada Al-Khatib and Mohamad Al Majzoub

Computers 2026, 15(3), 185; https://doi.org/10.3390/computers15030185 - 12 Mar 2026

Viewed by 723

Abstract

Organizational productivity and workforce management are highly affected by employee attrition. Thus, an employee attrition prediction system may allow human resource management to enhance the workplace by minimizing attrition. This study proposes a new and interpretable ensemble learning framework for employee attrition prediction. [...] Read more.

Organizational productivity and workforce management are highly affected by employee attrition. Thus, an employee attrition prediction system may allow human resource management to enhance the workplace by minimizing attrition. This study proposes a new and interpretable ensemble learning framework for employee attrition prediction. The model integrates SHapley Additive exPlanations (SHAP)-based feature selection, Optuna hyperparameter optimization, and dual explainability using SHAP and Local Interpretable Model-agnostic Explanations (LIME). Random oversampling (ROS) is used to address class imbalance. The proposed framework allows for both global and local interpretability, enabling actionable insights into retention drivers. It was assessed using two benchmark datasets: the Kaggle HR Analytics dataset (14,999 records) and the IBM HR dataset (1470 records). The results revealed that the most impactful factors on employee attrition are promotion history, tenure, job satisfaction, workload, average monthly hours, overtime, and financial incentives. Furthermore, the proposed model achieved exceptional performance on both datasets. On the Kaggle dataset, it reached an accuracy of 98.72%, an F1-score of 97.29%, and an ROC–AUC of 0.994, while on the IBM dataset, it produced an accuracy of 97.72%, an F1-score of 97.74%, and an ROC–AUC of 0.995. Moreover, the proposed approach shows high computational efficiency, demonstrating that it is suitable for real-world deployment. These findings indicate that integrating explainable AI techniques, resampling tools, and automated hyperparameter tuning can achieve robust, accurate, and actionable employee attrition predictions, supporting HR managers’ decision-making. Full article

(This article belongs to the Special Issue Machine Learning: Innovation, Implementation, and Impact)

► Show Figures

Graphical abstract

21 pages, 474 KB

Open AccessArticle

Performance Evaluation of Machine Learning and Deep Learning Models for Credit Risk Prediction

by Irvine Mapfumo and Thokozani Shongwe

J. Risk Financial Manag. 2026, 19(3), 210; https://doi.org/10.3390/jrfm19030210 - 11 Mar 2026

Viewed by 578

Abstract

Credit risk prediction is essential for financial institutions to effectively assess the likelihood of borrower defaults and manage associated risks. This study presents a comparative analysis of deep learning architectures and traditional machine learning models on imbalanced credit risk datasets. To address class [...] Read more.

Credit risk prediction is essential for financial institutions to effectively assess the likelihood of borrower defaults and manage associated risks. This study presents a comparative analysis of deep learning architectures and traditional machine learning models on imbalanced credit risk datasets. To address class imbalance, we employ three resampling techniques: Synthetic Minority Over-sampling Technique (SMOTE), Edited Nearest Neighbors (ENN), and the hybrid SMOTE-ENN. We evaluate the performance of various models, including multilayer perceptron (MLP), convolutional neural network (CNN), long short-term memory (LSTM), gated recurrent unit (GRU), logistic regression, decision tree, support vector machine (SVM), random forest, adaptive boosting, and extreme gradient boosting. The analysis reveals that SMOTE-ENN combined with MLP achieves the highest F1-score of 0.928 (accuracy 95.4%) on the German dataset, while SMOTE-ENN with random forest attains the best F1-score of 0.789 (accuracy 82.1%) on the Taiwanese dataset. SHapley Additive exPlanations (SHAP) are employed to enhance model interpretability, identifying key drivers of credit default. These findings provide actionable guidance for developing transparent, high-performing, and robust credit risk assessment systems. Full article

(This article belongs to the Section Financial Technology and Innovation)

► Show Figures

Figure 1

9 pages, 514 KB

Open AccessProceeding Paper

Predictive Analytics for Inventory Backorder Optimization Using Machine Learning

by Thean Pheng Lim, Shi Yean Wong, Wei Chien Ng and Guat Guan Toh

Eng. Proc. 2026, 128(1), 13; https://doi.org/10.3390/engproc2026128013 - 9 Mar 2026

Viewed by 397

Abstract

The need for effective inventory management in the transition from “Just-in-Time” to “Just-in-Case” supply chain strategies was addressed by developing a machine learning model to predict inventory backorders. Using a large store keeping unit dataset, five supervised learning algorithms, namely, logistic regression, random [...] Read more.

The need for effective inventory management in the transition from “Just-in-Time” to “Just-in-Case” supply chain strategies was addressed by developing a machine learning model to predict inventory backorders. Using a large store keeping unit dataset, five supervised learning algorithms, namely, logistic regression, random forest, k-nearest neighbours, Naïve Bayes, and gradient boosting, were implemented with Python 3.13 Data imbalance was managed using the synthetic minority over-sampling technique, while power transformation was applied to improve data distribution and model performance. Among the models, random forest demonstrated the highest prediction accuracy at 98% and a strong receiver operating characteristic score of 0.897, making it the best model for backorder prediction. This approach enhances supply chain resilience and proactive inventory control, enabling manufacturers to mitigate risks of stockouts and optimize resource planning. It is necessary to incorporate advanced balancing techniques, hyperparameter tuning, and cross-validation methods to improve predictive performance further. Full article

► Show Figures

Figure 1

35 pages, 5289 KB

Open AccessArticle

Sentiment Classification of Amazon Product Reviews Based on Machine and Deep Learning Techniques: A Comparative Study

by Eman Daraghmi and Noora Zyadeh

Future Internet 2026, 18(3), 138; https://doi.org/10.3390/fi18030138 - 7 Mar 2026

Viewed by 460

Abstract

Sentiment classification plays a crucial role in analyzing customer feedback to identify market trends, enhance product recommendations, and improve customer satisfaction. This study focuses on sentiment analysis of Amazon reviews using two major datasets—Fine Food Reviews and Unlocked Mobile Reviews—which exhibit label imbalance. [...] Read more.

Sentiment classification plays a crucial role in analyzing customer feedback to identify market trends, enhance product recommendations, and improve customer satisfaction. This study focuses on sentiment analysis of Amazon reviews using two major datasets—Fine Food Reviews and Unlocked Mobile Reviews—which exhibit label imbalance. To address this challenge, both oversampling and undersampling techniques were applied to balance the datasets. Various machine learning (ML) algorithms, including Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), Naïve Bayes (NB), and Gradient Boosting Machine (GBM), as well as deep learning (DL) models such as Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and transformer-based models like RoBERTa, were implemented. After data cleaning and preprocessing, models were trained, and performance was evaluated. The results indicate that oversampling significantly enhances classification accuracy, particularly for the Fine Food dataset. Among ML models, Random Forest achieved the highest accuracy due to its ensemble approach and robustness in handling high-dimensional data. DL models, particularly RoBERTa, also demonstrated superior performance owing to their capacity to capture contextual dependencies. The findings emphasize the importance of data balancing for optimal sentiment analysis and contribute valuable insights toward advancing automated opinion classification in e-commerce applications. Full article

(This article belongs to the Section Big Data and Augmented Intelligence)

► Show Figures

Figure 1

30 pages, 2628 KB

Open AccessArticle

Predicting Bond Defaults in China: A Double-Ensemble Model Leveraging SMOTE for Class Imbalance

by Chongwen Tian and Rong Li

Big Data Cogn. Comput. 2026, 10(3), 81; https://doi.org/10.3390/bdcc10030081 - 6 Mar 2026

Viewed by 402

Abstract

This study proposes the Double-Ensemble Learning Classification with SMOTE (DELC-SMOTE), a novel hierarchical framework designed to address the critical challenge of severe class imbalance in financial bond default prediction. The model integrates the Synthetic Minority Over-sampling Technique (SMOTE) into a two-phase ensemble architecture. [...] Read more.

This study proposes the Double-Ensemble Learning Classification with SMOTE (DELC-SMOTE), a novel hierarchical framework designed to address the critical challenge of severe class imbalance in financial bond default prediction. The model integrates the Synthetic Minority Over-sampling Technique (SMOTE) into a two-phase ensemble architecture. The first phase employs introspective stacking, where six heterogeneous base learners are individually enhanced through algorithm-specific balancing and meta-learning. The second phase fuses these optimized experts via performance-weighted voting. Empirical analysis utilizes a comprehensive dataset of 10,440 Chinese corporate bonds (522 defaults, ~5% default rate) sourced from Wind and CSMAR databases. Given the high cost of both false negatives and false positives in risk assessment, the Geometric Mean (G-mean) and Specificity are employed as primary evaluation metrics. Results demonstrate that the proposed DELC-SMOTE model significantly outperforms individual base classifiers and benchmark ensemble variants, achieving a G-mean of 0.9152 and a Specificity of 0.8715 under the primary experimental setting. The model exhibits robust performance across varying imbalance ratios (2%, 10%, 20%) and strong resilience against data noise, perturbations, and outliers. These findings indicate that the synergistic integration of data-level resampling within a diversified, two-tiered ensemble structure effectively mitigates class imbalance bias and enhances predictive reliability. The framework offers a robust and generalizable tool for actionable default risk assessment in imbalanced financial datasets. Full article

(This article belongs to the Section Data Mining and Machine Learning)

► Show Figures

Figure 1

Search Results (697)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (697)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI