Next Article in Journal
Dataset of Partial Analytical Validation of the 1,2-O-Dilauryl-Rac-Glycero-3-Glutaric Acid-(6′-Methylresorufin) Ester (DGGR) Lipase Assay in Equine Plasma
Next Article in Special Issue
Knowledge Discovery and Dataset for the Improvement of Digital Literacy Skills in Undergraduate Students
Previous Article in Journal
Reconstructed River Water Temperature Dataset for Western Canada 1980–2018
Previous Article in Special Issue
Multi-Level Analysis of Learning Management Systems’ User Acceptance Exemplified in Two System Case Studies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Data Balancing Techniques for Predicting Student Dropout Using Machine Learning

Department of Information and Communication Sciences and Engineering, The Nelson Mandela African Institution of Science and Technology, Arusha P.O. Box 447, Tanzania
Submission received: 28 January 2023 / Revised: 19 February 2023 / Accepted: 21 February 2023 / Published: 27 February 2023

Abstract

Predicting student dropout is a challenging problem in the education sector. This is due to an imbalance in student dropout data, mainly because the number of registered students is always higher than the number of dropout students. Developing a model without taking the data imbalance issue into account may lead to an ungeneralized model. In this study, different data balancing techniques were applied to improve prediction accuracy in the minority class while maintaining a satisfactory overall classification performance. Random Over Sampling, Random Under Sampling, Synthetic Minority Over Sampling, SMOTE with Edited Nearest Neighbor and SMOTE with Tomek links were tested, along with three popular classification models: Logistic Regression, Random Forest, and Multi-Layer Perceptron. Publicly accessible datasets from Tanzania and India were used to evaluate the effectiveness of balancing techniques and prediction models. The results indicate that SMOTE with Edited Nearest Neighbor achieved the best classification performance on the 10-fold holdout sample. Furthermore, Logistic Regression correctly classified the largest number of dropout students (57348 for the Uwezo dataset and 13430 for the India dataset) using the confusion matrix as the evaluation matrix. The applications of these models allow for the precise prediction of at-risk students and the reduction of dropout rates.
Keywords: student dropout; prediction; machine learning; classification; data sampling; imbalanced datasets student dropout; prediction; machine learning; classification; data sampling; imbalanced datasets

Share and Cite

MDPI and ACS Style

Mduma, N. Data Balancing Techniques for Predicting Student Dropout Using Machine Learning. Data 2023, 8, 49. https://doi.org/10.3390/data8030049

AMA Style

Mduma N. Data Balancing Techniques for Predicting Student Dropout Using Machine Learning. Data. 2023; 8(3):49. https://doi.org/10.3390/data8030049

Chicago/Turabian Style

Mduma, Neema. 2023. "Data Balancing Techniques for Predicting Student Dropout Using Machine Learning" Data 8, no. 3: 49. https://doi.org/10.3390/data8030049

APA Style

Mduma, N. (2023). Data Balancing Techniques for Predicting Student Dropout Using Machine Learning. Data, 8(3), 49. https://doi.org/10.3390/data8030049

Article Metrics

Back to TopTop