Predicting Student Dropout and Academic Success
Abstract
:1. Introduction
2. Data Description
3. Materials and Methods
3.1. Data Preprocessing
- Prepare National Competition Data. The data relating to the National Competition for Access to Higher Education (CNAES) are received, every year, after the results of the competition, as a Microsoft Access database. We developed a Visual Basic for Applications (VBA) program that collects, from the different Microsoft Access databases (one for each year), the information needed and exports a CSV file (competition.csv) that contains one row for each student with fields related to the group “Data at Enrollment” described in Table 1.
- Prepare Student Records Data. In this step, the CSV received from the AMS with students’ records is prepared to be processed in the next steps. This file contains 13,992 rows and 398 columns, with a significant number of rows and columns that are duplicated or irrelevant to our study. To resume, this step comprises the deletion of students’ records enrolled in old courses that do not currently accept enrollments, the deletion of students’ records with irrelevant ways of enrollment such as Erasmus, the selection and renaming of relevant columns, and the elimination of duplicated rows. At the end of this step, all data related to the groups “Demographics Data” and “Socioeconomics Data” (see Table 1) are gathered to be used in the next steps.
- Prepare Student Evaluations Data. In this step, the CSV file with all the information related to the evaluation attempts of students is processed. For each student that results from the processing in the previous step, the attributes related to the groups “Academic data at the end of 1st semester” and “Academic data are calculated at the end of 2nd semester” (see Table 1).
- Merge and Preprocessing Data. All data gathered in the previous steps are merged into one single dataset in which are added the attributes related to “Macroeconomics Data”. Then, we performed rigorous data preprocessing to handle anomalies, unexplainable outliers, and missing values. Finally, each student is classified as a dropout, enrolled, or graduate depending on their situation at the end of the normal duration of the course (3 years, except Nursing which has 4 years). The result is the final dataset, available at https://doi.org/10.5281/zenodo.5777339 (accessed on 10 October 2022).
3.2. Data Analysis
3.2.1. Descriptive Analysis
3.2.2. Imbalanced Data
3.2.3. Multi-collinearity
3.2.4. Feature Importance
3.3. Compliances
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
AMS | Academic Management System |
CATBOOST | CatBoost |
CSV | Comma-separated values |
DGES | Direção Geral do Ensino Superior |
DPO | Data Protection Officer |
GDPR | General Data Protection Regulation |
LIGHTGBM | Light Gradient Boosting Machine |
PAE | Enterprise Application Platform |
RF | Random Forest |
XGBOOST | Extreme Gradient Boost |
Appendix A
Attribute | Values |
Marital status | 1—Single |
2—Married | |
3—Widower | |
4—Divorced | |
5—Facto union | |
6—Legally separated |
Attribute | Values |
Nationality | 1—Portuguese |
2—German | |
3—Spanish | |
4—Italian | |
5—Dutch | |
6—English | |
7—Lithuanian | |
8—Angolan | |
9—Cape Verdean | |
10—Guinean | |
11—Mozambican | |
12—Santomean | |
13—Turkish | |
14—Brazilian | |
15—Romanian | |
16—Moldova (Republic of) | |
17—Mexican | |
18—Ukrainian | |
19—Russian | |
20—Cuban | |
21—Colombian |
Attribute | Values |
Application mode | 1—1st phase—general contingent |
2—Ordinance No. 612/93 | |
3—1st phase—special contingent (Azores Island) | |
4—Holders of other higher courses | |
5—Ordinance No. 854-B/99 | |
6—International student (bachelor) | |
7—1st phase—special contingent (Madeira Island) | |
8—2nd phase—general contingent | |
9—3rd phase—general contingent | |
10—Ordinance No. 533-A/99, item b2) (Different Plan) | |
11—Ordinance No. 533-A/99, item b3 (Other Institution) | |
12—Over 23 years old | |
13—Transfer | |
14—Change in course | |
15—Technological specialization diploma holders | |
16—Change in institution/course | |
17—Short cycle diploma holders | |
18—Change in institution/course (International) |
Attribute | Values |
Course | 1—Biofuel Production Technologies |
2—Animation and Multimedia Design | |
3—Social Service (evening attendance) | |
4—Agronomy | |
5—Communication Design | |
6—Veterinary Nursing | |
7—Informatics Engineering | |
8—Equiniculture | |
9—Management | |
10—Social Service | |
11—Tourism | |
12—Nursing | |
13—Oral Hygiene | |
14—Advertising and Marketing Management | |
15—Journalism and Communication | |
16—Basic Education | |
17—Management (evening attendance) |
Attribute | Values |
Previous qualification | 1—Secondary education |
2—Higher education—bachelor’s degree | |
3—Higher education—degree | |
4—Higher education—master’s degree | |
5—Higher education—doctorate | |
6—Frequency of higher education | |
7—12th year of schooling—not completed | |
8—11th year of schooling—not completed | |
9—Other—11th year of schooling | |
10—10th year of schooling | |
11—10th year of schooling—not completed | |
12—Basic education 3rd cycle (9th/10th/11th year) or equivalent | |
13—Basic education 2nd cycle (6th/7th/8th year) or equivalent | |
14—Technological specialization course | |
15—Higher education—degree (1st cycle) | |
16—Professional higher technical course | |
17—Higher education—master’s degree (2nd cycle) |
Attribute | Values |
Mother’s qualification Father’s qualification | 1—Secondary Education—12th Year of Schooling or Equivalent |
2—Higher Education—bachelor’s degree | |
3—Higher Education—degree | |
4—Higher Education—master’s degree | |
5—Higher Education—doctorate | |
6—Frequency of Higher Education | |
7—12th Year of Schooling—not completed | |
8—11th Year of Schooling—not completed | |
9—7th Year (Old) | |
10—Other—11th Year of Schooling | |
11—2nd year complementary high school course | |
12—10th Year of Schooling | |
13—General commerce course | |
14—Basic Education 3rd Cycle (9th/10th/11th Year) or Equivalent | |
15—Complementary High School Course | |
16—Technical-professional course | |
17—Complementary High School Course—not concluded | |
18—7th year of schooling | |
19—2nd cycle of the general high school course | |
20—9th Year of Schooling—not completed | |
21—8th year of schooling | |
22—General Course of Administration and Commerce | |
23—Supplementary Accounting and Administration | |
24—Unknown | |
25—Cannot read or write | |
26—Can read without having a 4th year of schooling | |
27—Basic education 1st cycle (4th/5th year) or equivalent | |
28—Basic Education 2nd Cycle (6th/7th/8th Year) or equivalent | |
29—Technological specialization course | |
30—Higher education—degree (1st cycle) | |
31—Specialized higher studies course | |
32—Professional higher technical course | |
33—Higher Education—master’s degree (2nd cycle) | |
34—Higher Education—doctorate (3rd cycle) |
Attribute | Values |
Mother’s occupation Father’s occupation | 1—Student |
2—Representatives of the Legislative Power and Executive Bodies, Directors, Directors and Executive Managers | |
3—Specialists in Intellectual and Scientific Activities | |
4—Intermediate Level Technicians and Professions | |
5—Administrative staff | |
6—Personal Services, Security and Safety Workers, and Sellers | |
7—Farmers and Skilled Workers in Agriculture, Fisheries, and Forestry | |
8—Skilled Workers in Industry, Construction, and Craftsmen | |
9—Installation and Machine Operators and Assembly Workers | |
10—Unskilled Workers | |
11—Armed Forces Professions | |
12—Other Situation; 13—(blank) | |
14—Armed Forces Officers | |
15—Armed Forces Sergeants | |
16—Other Armed Forces personnel | |
17—Directors of administrative and commercial services | |
18—Hotel, catering, trade, and other services directors | |
19—Specialists in the physical sciences, mathematics, engineering, and related techniques | |
20—Health professionals | |
21—Teachers | |
22—Specialists in finance, accounting, administrative organization, and public and commercial relations | |
23—Intermediate level science and engineering technicians and professions | |
24—Technicians and professionals of intermediate level of health | |
25—Intermediate level technicians from legal, social, sports, cultural, and similar services | |
26—Information and communication technology technicians | |
27—Office workers, secretaries in general, and data processing operators | |
28—Data, accounting, statistical, financial services, and registry-related operators | |
29—Other administrative support staff | |
30—Personal service workers | |
31—Sellers | |
32—Personal care workers and the like | |
33—Protection and security services personnel | |
34—Market-oriented farmers and skilled agricultural and animal production workers | |
35—Farmers, livestock keepers, fishermen, hunters and gatherers, and subsistence | |
36—Skilled construction workers and the like, except electricians | |
37—Skilled workers in metallurgy, metalworking, and similar | |
38—Skilled workers in electricity and electronics | |
39—Workers in food processing, woodworking, and clothing and other industries and crafts | |
40—Fixed plant and machine operators | |
41—Assembly workers | |
42—Vehicle drivers and mobile equipment operators | |
43—Unskilled workers in agriculture, animal production, and fisheries and forestry | |
44—Unskilled workers in extractive industry, construction, manufacturing, and transport | |
45—Meal preparation assistants | |
46—Street vendors (except food) and street service providers |
Attribute | Values |
Gender | 1—male |
0—female |
Attribute | Values |
Daytime/evening attendance | 1—daytime |
0—evening |
Attribute | Values |
Displaced Educational special needs Debtor Tuition fees up to date Scholarship holder International | 1—yes 0—no |
References
- Behr, A.; Giese, M.; Teguim Kamdjou, H.D.; Theune, K. Motives for Dropping out from Higher Education—An Analysis of Bachelor’s Degree Students in Germany. Eur. J. Educ. 2021, 56, 325–343. [Google Scholar] [CrossRef]
- Kehm, B.M.; Larsen, M.R.; Sommersel, H.B. Student Dropout from Universities in Europe: A Review of Empirical Literature. Hungarian Educ. Res. J. 2020, 9, 147–164. [Google Scholar] [CrossRef]
- Atchley, W.; Wingenbach, G.; Akers, C. Comparison of Course Completion and Student Performance through Online and Traditional Courses. Int. Rev. Res. Open Distance Learn. 2013, 14, 104–116. [Google Scholar] [CrossRef] [Green Version]
- Quinn, J. Dropout and Completion in Higher Education in Europe among Students from Under-Represented Groups; An Independent report authored for the NESET network of experts; European Commission: Brussels, Belgium, 2013.
- Namoun, A.; Alshanqiti, A. Predicting Student Performance Using Data Mining and Learning Analytics Techniques: A Systematic Literature Review. Appl. Sci. 2020, 11, 237. [Google Scholar] [CrossRef]
- Saa, A.A.; Al-Emran, M.; Shaalan, K. Mining Student Information System Records to Predict Students’ Academic Performance. Adv. Intell. Syst. Comput. 2020, 921, 229–239. [Google Scholar] [CrossRef]
- Akçapınar, G.; Altun, A.; Aşkar, P. Using Learning Analytics to Develop Early-Warning System for at-Risk Students. Int. J. Educ. Technol. High. Educ. 2019, 16, 40. [Google Scholar] [CrossRef]
- Daud, A.; Lytras, M.D.; Aljohani, N.R.; Abbas, F.; Abbasi, R.A.; Alowibdi, J.S. Predicting Student Performance Using Advanced Learning Analytics. In Proceedings of the 26th International World Wide Web Conference 2017, WWW 2017 Companion, Perth, Australia, 3–7 April 2017; pp. 415–421. [Google Scholar] [CrossRef] [Green Version]
- Martins, M.V.; Tolledo, D.; Machado, J.; Baptista, L.M.T.; Realinho, V. Early Prediction of Student’s Performance in Higher Education: A Case Study. Adv. Intell. Syst. Comput. 2021, 1365, 166–175. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
- Chen, C.; Liaw, A.; Breiman, L. Using Random Forest to Learn Imbalanced Data. Univ. Calif. Berkeley 2004, 110, 1–12. [Google Scholar]
- Liu, X.Y.; Wu, J.; Zhou, Z.H. Exploratory Undersampling for Class-Imbalance Learning. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2009, 39, 539–550. [Google Scholar] [CrossRef]
- Maclin, R.; Opitz, D. An Empirical Evaluation of Bagging and Boosting. In Proceedings of the National Conference on Artificial Intelligence, Providence, RI, USA; 1997; pp. 546–551. [Google Scholar]
- Hido, S.; Kashima, H.; Takahashi, Y. Roughly Balanced Bagging for Imbalanced Data. Stat. Anal. Data Min. 2009, 2, 412–426. [Google Scholar] [CrossRef]
- Wang, S.; Yao, X. Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Data Mining, Nashville, TN, USA, 30 March–2 April 2009; pp. 324–331. [Google Scholar] [CrossRef] [Green Version]
- Saarela, M.; Jauhiainen, S. Comparison of Feature Importance Measures as Explanations for Classification Models. SN Appl. Sci. 2021, 3, 272. [Google Scholar] [CrossRef]
- Spelmen, V.S.; Porkodi, R. A Review on Handling Imbalanced Data. In Proceedings of the 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT), Coimbatore, India, 1–3 March 2018. [Google Scholar] [CrossRef]
- Ali, H.; Salleh, M.N.M.; Saedudin, R.; Hussain, K.; Mushtaq, M.F. Imbalance Class Problems in Data Mining: A Review. Indones. J. Electr. Eng. Comput. Sci. 2019, 14, 1552–1563. [Google Scholar] [CrossRef]
- Ho, T.K. Random Decision Forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 1, pp. 278–282. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3147–3155. [Google Scholar] [CrossRef]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. arXiv 2017, arXiv:1706.09516v5. [Google Scholar] [CrossRef]
- Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.; da Silva Santos, L.B.; Bourne, P.E.; et al. The FAIR Guiding Principles for Scientific Data Management and Stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef] [PubMed]
Class of Attribute | Attribute | Type |
---|---|---|
Demographic data | Marital status | Numeric/discrete |
Nationality | Numeric/discrete | |
Displaced | Numeric/binary | |
Gender | Numeric/binary | |
Age at enrollment | Numeric/discrete | |
International | Numeric/binary | |
Socioeconomic data | Mother’s qualification | Numeric/discrete |
Father’s qualification | Numeric/discrete | |
Mother’s occupation | Numeric/discrete | |
Father’s occupation | Numeric/discrete | |
Educational special needs | Numeric/binary | |
Debtor | Numeric/binary | |
Tuition fees up to date | Numeric/binary | |
Scholarship holder | Numeric/binary | |
Macroeconomic data | Unemployment rate | Numeric/continuous |
Inflation rate | Numeric/continuous | |
GDP | Numeric/continuous | |
Academic data at enrollment | Application mode | Numeric/discrete |
Application order | Numeric/ordinal | |
Course | Numeric/discrete | |
Daytime/evening attendance | Numeric/binary | |
Previous qualification | Numeric/discrete | |
Academic data at the end of 1st semester | Curricular units 1st sem (credited) | Numeric/discrete |
Curricular units 1st sem (enrolled) | Numeric/discrete | |
Curricular units 1st sem (evaluations) | Numeric/discrete | |
Curricular units 1st sem (approved) | Numeric/discrete | |
Curricular units 1st sem (grade) | Numeric/continuous | |
Curricular units 1st sem (without evaluations) | Numeric/discrete | |
Academic data at the end of 2nd semester | Curricular units 2nd sem (credited) | Numeric/discrete |
Curricular units 2nd sem (enrolled) | Numeric/discrete | |
Curricular units 2nd sem (evaluations) | Numeric/discrete | |
Curricular units 2nd sem (approved) | Numeric/discrete | |
Curricular units 2nd sem (grade) | Numeric/continuous | |
Curricular units 2nd sem (without evaluations) | Numeric/discrete | |
Target | Target | Categorical |
Attribute | Distrib. | Mean | Median | Dispersion | Min. | Max. |
---|---|---|---|---|---|---|
Marital status | 1.180 | 1 | 0.510 | 1 | 6 | |
Nationality | 1.250 | 1 | 1.390 | 1 | 21 | |
Displaced | 0.548 | 1 | 0.907 | 0 | 1 | |
Gender | 0.352 | 0 | 1.358 | 0 | 1 | |
Age at enrollment | 23.130 | 20 | 0.320 | 17 | 70 | |
International | 0.025 | 0 | 6.262 | 0 | 1 |
Attribute | Distrib. | Mean | Median | Dispersion | Min. | Max. |
---|---|---|---|---|---|---|
Father’s qualification | 16.460 | 14 | 0.670 | 1 | 34 | |
Mother’s qualification | 12.320 | 13 | 0.730 | 1 | 29 | |
Father’s occupation | 7.820 | 8 | 0.620 | 1 | 46 | |
Mother’s occupation | 7.320 | 6 | 0.550 | 1 | 32 | |
Educational special needs | 0.012 | 0 | 9.260 | 0 | 1 | |
Debtor | 0.114 | 0 | 2.792 | 0 | 1 | |
Tuition fees up to date | 0.881 | 1 | 0.368 | 0 | 1 | |
Scholarship holder | 0.248 | 0 | 1.739 | 0 | 1 |
Attribute | Distrib. | Mean | Median | Dispersion | Min. | Max. |
---|---|---|---|---|---|---|
Unemployment rate | 11.566 | 11.100 | 0.230 | 7.600 | 16.200 | |
Inflation rate | 1.228 | 1.400 | 1.126 | −0.800 | 3.700 | |
GDP | 0.002 | 0.320 | 1152.820 | −4.100 | 3.500 |
Attribute | Distrib. | Mean | Median | Dispersion | Min. | Max. |
---|---|---|---|---|---|---|
Application mode | 6.890 | 8 | 0.770 | 1 | 18 | |
Application order | 1.730 | 1 | 0.760 | 1 | 9 | |
Course | 9.900 | 10 | 0.440 | 1 | 17 | |
Daytime/evening attendance | 0.891 | 1 | 0.350 | 0 | 1 | |
Previous qualification | 2.530 | 1 | 1.570 | 1 | 17 |
Attribute | Distrib. | Mean | Median | Dispersion | Min. | Max. |
---|---|---|---|---|---|---|
Curricular units 1st sem (credited) | 0.710 | 0 | 3.320 | 0 | 20 | |
Curricular units 1st sem (enrolled) | 6.270 | 6 | 0.400 | 0 | 26 | |
Curricular units 1st sem (evaluations) | 8.300 | 8 | 0.500 | 0 | 45 | |
Curricular units 1st sem (approved) | 4.710 | 5 | 0.660 | 0 | 26 | |
Curricular units 1st sem (grade) | 10.641 | 12.286 | 0.455 | 0.000 | 18.875 | |
Curricular units 1st sem (without evaluations) | 0.140 | 0 | 5.020 | 0 | 12 |
Attribute | Distrib. | Mean | Median | Dispersion | Min. | Max. |
---|---|---|---|---|---|---|
Curricular units 2nd sem (credited) | 0.540 | 0 | 3.540 | 0 | 19 | |
Curricular units 2nd sem (enrolled) | 6.230 | 6 | 0.350 | 0 | 23 | |
Curricular units 2nd sem (evaluations) | 8.060 | 8 | 0.490 | 0 | 33 | |
Curricular units 2nd sem (approved) | 4.440 | 5 | 0.680 | 0 | 20 | |
Curricular units 2nd sem (grade) | 10.230 | 12.200 | 0.509 | 0.000 | 18.571 | |
Curricular units 2nd sem (without evaluations) | 0.150 | 0 | 5.010 | 0 | 12 |
Attribute | Distrib. | Center | Median | Dispersion | Min. | Max. |
---|---|---|---|---|---|---|
Target | Graduate | 1.02 |
Feature | Collinearity with | Pearson |
---|---|---|
Curricular units 1st sem (credited) | Curricular units 2nd sem (credited) | 0.9448 |
Curricular units 1st sem (enrolled) | 0.7743 | |
Curricular units 1st sem (enrolled) | Curricular units 2nd sem (enrolled) | 0.9426 |
Curricular units 1st sem (approved) | 0.7691 | |
Curricular units 2nd sem (credited) | 0.7537 | |
Nationality | International | 0.9117 |
Curricular units 1st sem (approved) | Curricular units 2nd sem (approved) | 0.9040 |
Curricular units 2nd sem (enrolled) | 0.7338 | |
Curricular units 1st sem (grade) | Curricular units 2nd sem (grade) | 0.8372 |
Curricular units 1st sem (evaluations) | Curricular units 2nd sem (evaluations) | 0.7789 |
Curricular units 2nd sem (approved) | Curricular units 2nd sem (grade) | 0.7608 |
Mother’s occupation | Father’s occupation | 0.7240 |
Curricular units 2nd sem (enrolled) | Curricular units 2nd sem (approved) | 0.7033 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Realinho, V.; Machado, J.; Baptista, L.; Martins, M.V. Predicting Student Dropout and Academic Success. Data 2022, 7, 146. https://doi.org/10.3390/data7110146
Realinho V, Machado J, Baptista L, Martins MV. Predicting Student Dropout and Academic Success. Data. 2022; 7(11):146. https://doi.org/10.3390/data7110146
Chicago/Turabian StyleRealinho, Valentim, Jorge Machado, Luís Baptista, and Mónica V. Martins. 2022. "Predicting Student Dropout and Academic Success" Data 7, no. 11: 146. https://doi.org/10.3390/data7110146
APA StyleRealinho, V., Machado, J., Baptista, L., & Martins, M. V. (2022). Predicting Student Dropout and Academic Success. Data, 7(11), 146. https://doi.org/10.3390/data7110146