A Study on Dropout Prediction for University Students Using Machine Learning
Abstract
:1. Introduction
2. Related Work
Ref# | Source Data | Dropout Rate | Imbalance Processing | Algorithms | Measure | Best Score (Algorithm) |
---|---|---|---|---|---|---|
[19] | 2018 academic records from 7718 students | 4.5% | SMOTE, ADASYN | Balanced Bagging, DNN, DT, | F1-score | 0.976 (DT) |
[8] | 2015~2021 academic records from 67,060 students | 5% | SMOTE + Tomek SMOTE + ENN | CatBoost + XGBoost | F1-score | 0.808 |
[9] | 2017~2021 survey from 3075 students | 6.4% | - | DT, NB, RF, Ridge Regression | Precision | 0.739 (Ridge Regression) |
[29] | 2011~2019 academic records from 331 students | 37.5% | ROS | DNN, RF, XGBoost | F1-score | 0.81 (RF) |
[30] | Academic records from 1418 students | 55.2% | SMOTE+Tomek | GB, RF, SVM | F1-score | 0.804 (SVM) |
[32] | 2017~2018 academic records from 2097 students | 72.8% | - | LR, DNN | Accuracy | 0.768 (DNN) |
[33] | 2016~2017 academic records from 366 students | - | - | DNN, LR, NB, SVM | F1-score | 0.96 (NB) |
Proposed method | 2013–2022 academic records from 20,050 students | 4.5% | SMOTE | DNN, DT, LR, RF, SVM | F1-score | 0.817 (RF) |
3. Proposed Method
3.1. Data Description
3.2. Feature Selection
- ①
- In the new table, records are stored by SID. To implement this, records with the same SID in the existing source tables need to be converted into a single record consisting of summarized values of attributes in the records. From this, the new table is called the summary table, as seen below.
- ②
- When merging all source tables, 150 attributes could be added to the summary table. Not all of these attributes have a significant impact on student dropout. If attributes having less relevance to the dropout are used for learning, prediction accuracy may decrease. So, only attributes with a high correlation with the dropout are extracted and added to the summary table.
3.3. Model Implementation
X = df[[“AdmType”,”NumSem”,”Grade”, … ]].values T = df[“Dropout”].values X_scaled = StandardScaler().fit_transform(X) X_train, X_test, T_train, T_test = train_test_split(X_scaled, T, test_size = 0.2) |
from sklearn.linear_model import LogisticRegression model = LogisticRegression(C = 100) model.fit(X_train, T_train) |
from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier(max_depth = 10, random_state = 0) model.fit(X_train, T_train) |
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators = 100, random_state = 0) model.fit(X_train, T_train) |
from sklearn.svm import SVC model = SVC(C = 100) model.fit(X_train, T_train) |
model = keras.Sequential([ keras.layers.Dense(128, activation = “relu”, input_shape = (7, )), keras.layers.Dense(32, activation = “relu”), keras.layers.Dense(1, activation = ‘sigmoid’) ]) model.compile(optimizer = “adam”, loss = ‘binary_crossentropy’) model.fit(X_train, T_train, epochs = 30) |
from lightgbm import LGBMClassifier model = LGBMClassifier(n_estimators = 100, random_state = 0) model.fit(X_train, T_train) |
4. Experimental Results
4.1. Performance Measure
4.2. Model Performance
4.3. Influence of Oversampling
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Conflicts of Interest
References
- Kim, D.; Kim, S. Sustainable education: Analyzing the determinants of university student dropout by nonlinear panel data models. Sustainability 2018, 10, 954. [Google Scholar] [CrossRef]
- Martinho, V.R.D.C.; Nunes, C.; Minussi, C.R. An intelligent system for prediction of school dropout risk group in higher education classroom based on artificial neural networks. In Proceedings of the 2013 IEEE 25th International Conference on Tools with Artificial Intelligence, Washington, DC, USA, 4–6 November 2013; pp. 159–166. [Google Scholar]
- Jain, P.; Chhabra, H.; Chauhan, U.; Prakash, K.; Gupta, A.; Soliman, M.S.; Islam, M.S.; Islam, M.T. Machine learning assisted hepta band THz metamaterial absorber for biomedical applications. Sci. Rep. 2023, 13, 1792. [Google Scholar] [CrossRef] [PubMed]
- Jain, P.; Chhabra, H.; Chauhan, U.; Singh, D.K.; Anwer, T.M.K.; Ahammad, S.H.; Hossain, M.A.; Rashed, A.N.Z. Multiband Metamaterial absorber with absorption prediction by assisted machine learning. Mater. Chem. Phys. 2023, 307, 128180. [Google Scholar] [CrossRef]
- Prenkaj, B.; Velardi, P.; Stilo, G.; Distante, D.; Faralli, S. A survey of machine learning approaches for student dropout prediction in online courses. ACM Comput. Surv. (CSUR) 2020, 53, 1–34. [Google Scholar] [CrossRef]
- Alyahyan, E.; Düştegör, D. Predicting academic success in higher education: Literature review and best practices. Int. J. Educ. Technol. High. Educ. 2020, 17, 3. [Google Scholar] [CrossRef]
- Mduma, N.; Khamisi, K.; Dina, M. A Survey of Machine Learning Approaches and Techniques for Student Dropout Prediction. Data Sci. J. 2019, 18, 1–10. [Google Scholar] [CrossRef]
- Kim, S.; Choi, E.; Jun, Y.K.; Lee, S. Student Dropout Prediction for University with High Precision and Recall. Appl. Sci. 2023, 13, 6275. [Google Scholar] [CrossRef]
- Jeong, S.H. A study on the development of university students dropout prediction model using classification technique. J. Converg. Cons. 2022, 5, 174–185. [Google Scholar]
- Park, C. Development of prediction model to improve dropout of cyber university. J. Korea Acedemia-Ind. Coop. Soc. 2020, 21, 380–390. [Google Scholar]
- Onah, D.F.; Sinclair, J.; Boyatt, R. Dropout rates of massive open online courses: Behavioral patterns. In Proceedings of the 6th International Conference on Education and New Learning Technologies, Barcelona, Spain, 7–9 July 2014; pp. 5825–5834. [Google Scholar]
- Liyanagunawardena, T.R.; Parslow, P.; Williams, S. Dropout: MOOC participants’perspective. In Proceedings of the EMOOCs 2014, the Second MOOC European Stakeholders Summit, Lausanne, Switzerland, 10–12 February 2014; pp. 95–100. [Google Scholar]
- Xing, W.; Du, D. Dropout prediction in MOOCs: Using deep learning for personalized intervention. J. Educ. Comput. Res. 2019, 57, 547–570. [Google Scholar] [CrossRef]
- McDonald, G.C. Ridge regression. Wiley Interdiscip. Rev. Comput. Stat. 2009, 1, 93–100. [Google Scholar] [CrossRef]
- Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
- Ho, T.K. Random Decision Forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; pp. 278–282. [Google Scholar]
- Meyer, D.; Wien, F.T. Support vector machines. R News 2001, 1, 23–26. [Google Scholar]
- Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [PubMed]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
- Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the IEEE 2005 International Conference on Advances in Intelligent Computing, Hefei, China, 23–26 August 2005; Volume 16, pp. 878–887. [Google Scholar]
- Barros, T.M.; Souza Neto, P.A.; Silva, I.; Guedes, L.A. Predictive Models for Imbalanced Data: A School Dropout Perspective. Educ. Sci. 2019, 9, 4–275. [Google Scholar] [CrossRef]
- Hido, S.; Kashima, H.; Takahashi, Y. Roughly balanced bagging for imbalanced data. Stat. Anal. Data Min. ASA Data Sci. J. 2009, 2, 412–426. [Google Scholar] [CrossRef]
- Batista, G.E.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13 August 2016; pp. 785–794. [Google Scholar]
- Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient boosting with categorical features support. arXiv 2018, arXiv:1810.11363. [Google Scholar]
- Webb, G.I.; Keogh, E.; Miikkulainen, R. Naïve Bayes. Encycl. Mach. Learn. 2010, 15, 713–714. [Google Scholar]
- Da Silva, M.; Diogo, E.; Solteiro, P.; Eduardo, J.; Arsénio, R.; de Moura, O.; Paulo, B.; Barroso, J. Forecasting Students Dropout: A UTAD University Study. Future Internet 2022, 14, 76. [Google Scholar] [CrossRef]
- Fernández-García, A.J.; Preciado, J.C.; Melchor, F.; Rodriguez-Echeverria, R.; Conejero, J.M.; Sánchez-Figueroa, F. A real-life machine learning experience for predicting university dropout at different stages using academic data. IEEE Access 2021, 9, 133076–133090. [Google Scholar] [CrossRef]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Sandoval-Palis, I.; Naranjo, D.; Vidal, J.; Gilar-Corbi, R. Early Dropout Prediction Model: A Case Study of University Leveling Course Students. Sustainability 2020, 12, 9314. [Google Scholar] [CrossRef]
- Shynarbek, N.; Orynbassar, A.; Sapazhanov, Y.; Kadyrov, S. Prediction of Student’s Dropout from a University Program. In Proceedings of the 16th International Conference on Electronics Computer and Computation (ICECCO), Kaskelen, Kazakhstan, 25–26 November 2021; pp. 1–4. [Google Scholar]
- Exponential Smoothing. Available online: https://en.wikipedia.org/wiki/Exponential_smoothing (accessed on 28 August 2023).
- Seaborn, Statistical Data Visualization. Available online: https://seaborn.pydata.org (accessed on 28 August 2023).
- Scikit-Learn. Available online: https://en.wikipedia.org/wiki/Scikit-learn (accessed on 28 August 2023).
- Keras. Available online: https://www.tensorflow.org/guide/keras (accessed on 28 August 2023).
- Hu, Z.; Zhang, J.; Ge, Y. Handling vanishing gradient problem using artificial derivative. IEEE Access 2021, 9, 22371–22377. [Google Scholar] [CrossRef]
- Lee, S.; Chung, J.Y. The machine learning-based dropout early warning system for improving the performance of dropout prediction. Appl. Sci. 2019, 9, 3093. [Google Scholar] [CrossRef]
- Moon, G.B.; Kim, J.W.; Lee, J.S. Early prediction model of student performance based on deep neural network using massive LMS log data. J. Korea Contents Assoc. 2021, 21, 10. [Google Scholar]
Grade table (28 columns) | ||
SID | Student ID | Number (11 digits) |
Year | Year enrolled | Number (4 digits) |
Semester | Semester enrolled | Number (1 or 2) |
Grade | Average grade | Number (0~4.5) |
NumCourse | Number of courses | Number |
NumF | Number of courses receiving an F grade | Number |
AcademicStatus table (8 columns) | ||
SID | Student ID | Number (11 digits) |
Year | Year enrolled | Number (4 digits) |
Semester | Semester enrolled | Number (1 or 2) |
Status | Enrollment status: Admission(0), Enrollment(1), Leave-of-absence(2), | Category |
Transfer(3), Dropout(4), Graduation(5) | ||
Desc | Additional information about the Status field (if required) | String |
Transfer | Transfer information: Transfer(1) or Not(0) | Boolean |
Scholarship table (8 columns) | ||
SID | Student ID | Number (11 digits) |
Year | Year enrolled | Number (4 digits) |
Semester | Semester enrolled | Number (1 or 2) |
Tuition | Tuition paid | Number |
Scholarship | Scholarship received | Number |
Counsel table (4 columns) | ||
SID | Student ID | Number (11 digits) |
Year | Year enrolled | Number (4 digits) |
Semester | Semester enrolled | Number (1 or 2) |
NumCouns | Number of counseling attended | Number |
ExtraCourse table (16 columns) | ||
SID | Student ID | Number (11 digits) |
Year | Year enrolled | Number (4 digits) |
Semester | Semester enrolled | Number (1 or 2) |
NumExtra | Number of extracurricular courses enrolled | Number |
NumVolun | Number of volunteer participations | Number |
BookLoan table (16 columns) | ||
SID | Student ID | Number (11 digits) |
Year | Year enrolled | Number (4 digits) |
Semester | Semester enrolled | Number (1 or 2) |
NumBook | Number of books borrowed | Number |
AvgPeriod | Average rental period | Number |
StudentInfo table (70 columns) | ||
SID | Student ID | Number (11 digits) |
Name | Student name | String |
Dept | Department or division name | String |
Major | Major name | String |
AdmYear | Year of admission | Number (4 digits) |
AdmType | Type of admission: Admission(0), Transfer(1) | Category |
Region | Region of a high school graduated | Category |
SID | Year | Semester | Grade | NumCourse | … |
---|---|---|---|---|---|
2012xxx010 | 2012 | 1 | 3.53 | 17 | … |
2012xxx010 | 2012 | 2 | 2.69 | 16 | … |
2012xxx010 | 2013 | 1 | 3.17 | 18 | … |
2012xxx010 | 2013 | 2 | 2.53 | 16 | … |
2012xxx010 | 2014 | 1 | 2.93 | 17 | … |
2012xxx010 | 2014 | 2 | 4.07 | 17 | … |
2012xxx010 | 2015 | 1 | 3.78 | 19 | … |
2012xxx010 | 2015 | 2 | 3.72 | 17 | … |
2012xxx010 | 2016 | 1 | - | - | … |
2012xxx011 | 2012 | 1 | 0.45 | 16 | … |
2012xxx011 | 2012 | 2 | - | - | … |
… | … | … | … | … | … |
SID | Year | Semester | Status | … |
---|---|---|---|---|
2012xxx012 | 2012 | 1 | Admission(0) | … |
2012xxx012 | 2012 | 2 | Enrollment(1) | … |
2012xxx012 | 2013 | 1 | Leave-of-absence(2) | … |
2012xxx012 | 2013 | 2 | Leave-of-absence(2) | … |
2012xxx012 | 2014 | 1 | Dropout(4) | … |
2012xxx013 | 2012 | 1 | Transfer(3) | … |
2012xxx013 | 2012 | 2 | Enrollment(1) | … |
2012xxx013 | 2013 | 1 | Dropout(4) | … |
… | … | … | … | … |
Attribute | Type | Description | Source Table |
---|---|---|---|
Grade | Number | Average grade | Grade |
NumF | Number | Number of F grades in the last semester | Grade |
NumSem | Number | Number of semesters enrolled | AcademicS… |
NumAbs | Number | Number of consecutive semesters of leave-of-absence right before graduation or dropout | AcademicS… |
Dropout | Boolean | Final status: dropout(1) or non-dropout(0) | AcademicS… |
Scholar | Number | Scholarship received in the last semester | Scholarship |
NumCouns | Number | Number of counseling in the last semester | Counsel |
NumExtra | Number | Number of participation in extra curriculum subjects in the last semester enrolled | ExtraCourse |
NumBook | Number | Number of book loans in the last semester | BookLoan |
Dept | Number | Department number | StudentInfo |
AdmType | Category | Admission type: Freshman(0), Transfer(1) | StudentInfo |
Region | Category | Region of a high school graduated | StudentInfo |
SID | Grade | NumF | NumSem | NumAbs | Scholar | NumCouns | … | AdmType | Region | Dropout |
---|---|---|---|---|---|---|---|---|---|---|
2012xxx010 | 3.73 | 0 | 8 | 0 | 0 | 2 | … | 0 | 1 | 0 |
2012xxx011 | 0.45 | 4 | 1 | 0 | 0 | 0 | … | 0 | 1 | 1 |
2012xxx012 | 3.21 | 0 | 2 | 2 | 0 | 1 | … | 0 | 1 | 1 |
2012xxx013 | 1.65 | 3 | 2 | 0 | 0 | 0 | … | 1 | 3 | 1 |
… | … | … | … | … | … | … | … | … | … | … |
Algorithm | Parameter | Description | Value | |
---|---|---|---|---|
LR | C | Regularization parameter used to prevent overfitting | 100 | |
DT | max_depth random_state | Depth of the decision tree used to prevent overfitting Random seed to choose data that make up the tree | 10 0 | |
RF | n_estimators random state | Number of trees that make up the forest Random seed to choose data that make up the trees | 100 0 | |
SVM | C | Regularization parameter used to prevent overfitting | 100 | |
DNN | layer-1 | units activation input_shape | Dimensionality of the output space (# of hidden nodes) Activation function to calculate output for the next layer Dimensionality of the input space (# of input attributes) | 128 ‘relu’ 7 |
layer-2 | units activation | Dimensionality of the output space (# of hidden nodes) Activation function to calculate output for the next layer | 32 ‘relu’ | |
layer-3 | units activation | Dimensionality of the output space (# of hidden nodes) Activation function to calculate output for the next layer | 1 ‘sigmoid’ | |
optimizer | Function to optimize the input weights by comparing the prediction and the loss function | ‘adam’ | ||
loss | Loss function to calculate error or deviation in the learning process | ‘binary_cross entropy’ | ||
epoch | Number of epochs to train the model | 50 | ||
LightGBM | n_estimators random state | Number of boosted trees Random seed to choose data that make up the trees | 100 0 |
Algorithm | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Linear Regression (LR) | 0.927 | 0.815 | 0.637 | 0.715 |
Decision Tree (DT) | 0.947 | 0.856 | 0.755 | 0.802 |
Random Forest (RF) | 0.953 | 0.883 | 0.777 | 0.827 |
Support Vector Machine (SVM) | 0.942 | 0.825 | 0.753 | 0.787 |
Deep Neural Network (DNN) | 0.947 | 0.843 | 0.776 | 0.808 |
Light Gradient Boosting Machine (LightGBM) | 0.955 | 0.867 | 0.814 | 0.840 |
Mean | 0.945 | 0.848 | 0.752 | 0.796 |
Standard Deviation | 0.010 | 0.026 | 0.061 | 0.044 |
Measure | Existing Models | Proposed Model | |||
---|---|---|---|---|---|
Ref. No. | Algorithm | Score | Algorithm | Score | |
F1-score | [8] | CatBoost+XGBoost | 0.808 | LightGBM | 0.840 |
Precision | [9] | Ridge Regression | 0.739 | LightGBM | 0.867 |
F1-score | [29] | RF | 0.810 | LightGBM | 0.840 |
F1-score | [30] | SVM | 0.804 | LightGBM | 0.840 |
Accuracy | [32] | DNN | 0.768 | LightGBM | 0.955 |
Algorithm + SMOTE | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Linear Regression (LR) | 0.868 | 0.523 | 0.883 | 0.657 |
Decision Tree (DT) | 0.929 | 0.706 | 0.868 | 0.778 |
Random Forest (RF) | 0.949 | 0.797 | 0.866 | 0.830 |
Support Vector Machine (SVM) | 0.921 | 0.668 | 0.890 | 0.763 |
Deep Neural Network (DNN) | 0.928 | 0.691 | 0.904 | 0.784 |
Light Gradient Boosting Machine (LightGBM) | - | - | - | - |
Mean | 0.919 | 0.677 | 0.882 | 0.762 |
Standard Deviation | 0.031 | 0.099 | 0.016 | 0.064 |
Algorithm + ADASYN | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Linear Regression (LR) | 0.841 | 0.472 | 0.920 | 0.624 |
Decision Tree (DT) | 0.919 | 0.663 | 0.880 | 0.756 |
Random Forest (RF) | 0.947 | 0.778 | 0.880 | 0.826 |
Support Vector Machine (SVM) | 0.900 | 0.602 | 0.890 | 0.718 |
Deep Neural Network (DNN) | 0.910 | 0.625 | 0.932 | 0.749 |
Light Gradient Boosting Machine (LightGBM) | - | - | - | - |
Mean | 0.903 | 0.628 | 0.901 | 0.735 |
Standard Deviation | 0.039 | 0.111 | 0.024 | 0.073 |
Algorithm + Borderline-SMOTE | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Linear Regression (LR) | 0.840 | 0.470 | 0.913 | 0.621 |
Decision Tree (DT) | 0.920 | 0.668 | 0.885 | 0.761 |
Random Forest (RF) | 0.948 | 0.790 | 0.871 | 0.829 |
Support Vector Machine (SVM) | 0.912 | 0.640 | 0.885 | 0.743 |
Deep Neural Network (DNN) | 0.915 | 0.642 | 0.925 | 0.758 |
Light Gradient Boosting Machine (LightGBM) | - | - | - | - |
Mean | 0.907 | 0.642 | 0.896 | 0.742 |
Standard Deviation | 0.040 | 0.114 | 0.022 | 0.076 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cho, C.H.; Yu, Y.W.; Kim, H.G. A Study on Dropout Prediction for University Students Using Machine Learning. Appl. Sci. 2023, 13, 12004. https://doi.org/10.3390/app132112004
Cho CH, Yu YW, Kim HG. A Study on Dropout Prediction for University Students Using Machine Learning. Applied Sciences. 2023; 13(21):12004. https://doi.org/10.3390/app132112004
Chicago/Turabian StyleCho, Choong Hee, Yang Woo Yu, and Hyeon Gyu Kim. 2023. "A Study on Dropout Prediction for University Students Using Machine Learning" Applied Sciences 13, no. 21: 12004. https://doi.org/10.3390/app132112004
APA StyleCho, C. H., Yu, Y. W., & Kim, H. G. (2023). A Study on Dropout Prediction for University Students Using Machine Learning. Applied Sciences, 13(21), 12004. https://doi.org/10.3390/app132112004