A Novel Predictive Modeling for Student Attrition Utilizing Machine Learning and Sustainable Big Data Analytics
Abstract
:1. Introduction
2. Literature Review
- Integration of Big Data and Machine Learning for Real-Time Monitoring: Unlike many studies that focus solely on predictive modeling, this research integrates big data analytics with machine learning to enable real-time monitoring and interventions. This approach not only predicts at-risk students but also provides a framework for timely and personalized interventions by educational authorities.
- Focus on Early-Year Subjects: By identifying early-year subjects such as Mechanics and Materials, Design of Machine Elements, and Instrumentation and Control as critical factors influencing attrition, the study highlights the importance of early intervention. This focus on the longitudinal impact of specific subjects provides actionable insights for curriculum designers and educators.
- Systematic Exploration and Hyperparameter Optimization: The study conducts preliminary trials to refine machine learning models, establish evaluation standards, and optimize hyperparameters systematically. This rigorous approach ensures the robustness and reliability of the predictive model.
- Application of Random Forest Algorithm: The use of the random forest algorithm, known for its high prediction accuracy and ability to handle large datasets with many features, is another key contribution. The study justifies the selection of this algorithm and demonstrates its effectiveness in reducing overfitting and improving prediction accuracy.
2.1. Course Restructuring
2.2. Early Engagement
3. Materials and Methods
- The elective courses are not taken into consideration.
- For students who remodule, only the final grade is recorded, meaning that the failing grades are not registered on this datasheet.
- All failed grades are not registered.
- The student that dropped out may have taken more courses but failed; however, due to the failed grades not being registered, it is not shown on the data that the student has attempted the module.
- Students who are given exemptions were given a B grade in the datasheet to reflect the average performance of the course.
- Cohort only considered students enrolled from year 1, with entry requirements of A-levels, university foundation programs or equivalent.
- Data Completeness: We assume that the datasets from educational institutions are complete and accurately reflect student performance and demographics. This is crucial as the model’s accuracy depends significantly on the quality of input data. In practice, we mitigate the risk of incomplete data by applying data imputation techniques and liaising with educational institutions to understand and fill gaps in data collection processes.
- General Education Framework: The model presupposes a relatively uniform educational structure within regions being analyzed. This assumption allows us to generalize the predictive factors across different institutions within the same educational system. We validate this assumption through a preliminary analysis of educational systems and curricula before model deployment.
- Consistency in Course Impact: We posit that certain courses have a more pronounced impact on student attrition rates across different institutions. This is based on historical data showing consistent patterns of student performance in key subjects that correlate with dropout rates. To ensure this assumption holds, we continuously update and recalibrate our model as new data becomes available, ensuring it reflects the most current educational trends.
- Student Behavior Consistency: The model assumes that student behavior and its impacts on attrition are consistent over time. While this may not capture new emerging trends immediately, the model includes mechanisms for periodic reassessment to integrate new behavioral patterns and external factors affecting student engagement and success.
- Socioeconomic Factors: It is assumed that socioeconomic factors influencing student attrition are similar within the data sample. This assumption allows us to apply the model across similar demographic groups but requires careful consideration when applying the model to regions with differing socioeconomic landscapes. We justify this by conducting localized studies to understand the socioeconomic dynamics before applying the model in a new region.
3.1. Justification for the Use of Random Forest and Parameter Selection
3.2. Data Cleaning Process
- Conversion and Cleaning: The raw data, initially in Excel format, was converted to a CSV format to standardize the data input process for use in Python 4.6. This step was crucial as it ensured compatibility with the Pandas library for subsequent manipulations.
- Error Checking and Noise Reduction: The dataset was meticulously checked for errors such as misspellings, incorrect punctuation, and inconsistent spacing which could lead to inaccuracies in the model. Irrelevant data points such as unnecessary identifiers were removed to streamline the dataset and focus the model on relevant features.
- Normalization and Encoding: Numerical normalization and categorical data encoding were performed to standardize the scales and transform categorial variables, such as grades and nationality, into a format suitable for machine learning analysis.
- Feature Selection. Relevant features were selected, including academic performance, demographic information, and course engagement metrics. These features were used to predict student dropout. The random forest algorithm was relied upon for its ability to implicitly handle feature selection by identifying the most influential factors during model training.
- Handling of Missing Data. Missing data were addressed through a data cleaning process, which included filling in gaps where information was incomplete. The exact methods for handling missing data (e.g., mean imputation, interpolation, or removal of incomplete records) were also discussed.
- Normalization and Encoding. Numerical normalization and categorical encoding were applied to standardize data for machine learning. Numerical normalization was used to scale features such as academic grades to ensure they were on comparable scales, though the exact normalization technique (e.g., min–max scaling or z-score normalization) was not specified. Categorical data, such as student demographics and course names, were encoded into numerical values to be processed by the random forest model.
3.3. Model Training and Testing
- Demographic Information: Understanding the socioeconomic and cultural background of students helps tailor the predictive capabilities.
- Academic Performance Data: Access to comprehensive performance metrics across various subjects is vital to identify at-risk students early.
- Institutional Data: Information on the educational system’s structure, including course offerings and academic policies, is necessary for contextual adaptation.
4. Results and Discussion
4.1. Introduction
4.2. Effect in Difference of Test Size vs. Training Size
4.3. Average Grades of Each Module
4.4. In-Depth Analysis of Subjects per Classification
4.5. Cohort Performance Analysis
4.6. Cross-Validation of Results for Accurary
4.7. Cross-Validation Techniques
4.8. Longitudinal Effect in Student Attrition
4.9. Pros and Cons of Using Machine Learning
4.9.1. Advantages of Machine Learning
- Enhanced Predictive Accuracy: ML algorithms are capable of processing and learning from vast amounts of data, detecting complex patterns that are not apparent through manual analysis or simpler models. This capacity significantly improves prediction accuracy compared to traditional methods, which often rely on fewer variables and assume linear relationships.
- Automation and Efficiency: ML automates the analysis of large data sets, reducing the reliance on manual data handing and analysis. This can lead to significant time savings and resource efficency, which is particularly valuable in educational settings where early detection of at-risk students can lead to timely and effective interventions.
- Scalability: Unlike traditional methods that may become cumbersome as dataset sizes increase, ML algorithms excel at scaling with data, maintaining their effectiveness across larger and diverse datasets.
4.9.2. Limitations of Machine Learning
- Risk of False Positives: One of the significant challenges with ML is the risk of generating false positives—incorrectly predicting that a student may drop out. This can lead to unnecessary interventions, potentially wasting resources and adversely affecting the student involved.
- Data Privacy and Security: The need for substantial data to train ML models raises concerns about privacy and data security, especially when sensitive student information is involved. Ensuring the integrity and security of this data is paramount but can be resource-intensive.
- Complexity and Resource Requirements: ML models, particularly those like random forest or neural networks, are complex and require significant computational resources and expertise to develop, maintain, and interpret. This may pose barriers for institutions without sufficient technical staff or infrastructure.
4.9.3. Comparison with Other Methods
4.9.4. Justification for Discussing Pros and Cons
- Data Collection: We detailed the process of gathering data from institutional databases, including student demographics, academic records, and survey responses.
- Feature Selection: We employed statistical methods and domain expertise to select relevant features, such as socio-economic background, psychological factors, and academic performance indicators.
- Machine Learning Algorithms: We utilized random forests due to their robustness and ability to handle large datasets with missing values. Other algorithms considered include Support Vector Machines and Neural Networks.
- Hyperparameter Optimization: We implemented grid search and cross-validation techniques to fine-tune hyperparameters, ensuring optimal model performance and avoiding overfitting.
5. Conclusions
6. Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chen, X. STEM attrition among high-performing college students: Scope and potential causes. J. Technol. Sci. Educ. 2015, 5, 41–59. [Google Scholar] [CrossRef]
- Christle, C.A.; Jolivette, K.; Nelson, C.M. School characteristics related to high school dropout rates. Remedial Spec. Educ. 2007, 28, 325–339. [Google Scholar] [CrossRef]
- Lee, J.; Lapira, E.; Bagheri, B.; Kao, H.A. Recent advances and trends in predictive manufacturing systems in big data environment. Manuf. Lett. 2013, 1, 38–41. [Google Scholar] [CrossRef]
- Del Bonifro, F.; Gabbrielli, M.; Lisanti, G.; Zingaro, S.P. Student drouput prediction. In International Conference on Artificial Intelligence in Education; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
- Sanders, M. STEM, STEM education, STEMmania. Technol. Teach. 2009, 68, 20–26. [Google Scholar]
- Martín-Páez, T.; Aguilera, D.; Perales-Palacios, F.J.; Vílchez-González, J.M. What are we talking about when we talk about STEM education? A review of literature. Sci. Educ. 2019, 103, 799–822. [Google Scholar] [CrossRef]
- Merrill, C.; Daugherty, J. The future of TE masters degrees: STEM. In Proceedings of the Meeting of the International Technology Education Association, Louisville, KY, USA, 21–23 February 2009. [Google Scholar]
- Zollman, A. Learning for STEM literacy: STEM literacy for learning. Sch. Sci. Math. 2012, 112, 12–19. [Google Scholar] [CrossRef]
- Favaretto, M.; De Clercq, E.; Schneble, C.O.; Elger, B.S. What is your definition of big data? Researchers’ understanding of the phenomenon of the decade. PLoS ONE 2020, 15, e0228987. [Google Scholar] [CrossRef]
- Kitchin, R.; Lauriault, T.P. Small data in the era of big data. GeoJournal 2015, 80, 463–475. [Google Scholar] [CrossRef]
- Dash, S.; Shakyawar, S.K.; Sharma, M.; Kaushik, S. Big data in healthcare: Management, analysis and future prospects. J. Big Data 2019, 6, 54. [Google Scholar] [CrossRef]
- Wang, L.; Alexander, C. A big data in design and manufacturing engineering. Am. J. Eng. Appl. Sci. 2015, 8, 223. [Google Scholar] [CrossRef]
- Wu, J.; Guo, S.; Li, J.; Zeng, D. Big data meet green challenges: Big data toward green applications. IEEE Syst. J. 2016, 10, 888–900. [Google Scholar] [CrossRef]
- El Naqa, I.; Murphy, M.J. What Is Machine Learning? Springer International Publishing: Cham, Switzerland, 2015; pp. 3–11. [Google Scholar]
- Xie, J.; Yu, F.R.; Huang, T.; Xie, R.; Liu, J.; Wang, C.; Liu, Y. A survey of machine learning techniques applied to software defined networking (SDN): Research issues and challenges. IEEE Commun. Surv. Tutor. 2018, 21, 393–430. [Google Scholar] [CrossRef]
- Li, S.; Xu, H.; Zhang, W. Deep Reinforcement Learning for Adaptive AI Applications. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1456–1467. [Google Scholar]
- Chen, J.; Teo, T.H.; Kok, C.L.; Koh, Y.Y. A Novel Single-Word Speech Recognition on Embedded Systems Using a Convolution Neuron Network with Improved Out-of-Distribution Detection. Electronics 2024, 13, 530. [Google Scholar] [CrossRef]
- Abdar, M.; Pourpanah, F.; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghavamzadeh, M.; Fieguth, P.; Cao, X.; Khosravi, A.; Acharya, U.R.; et al. A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications, and Challenges. IEEE Access 2021, 9, 139227–139244. [Google Scholar] [CrossRef]
- Kok, C.L.; Dai, Y.; Lee, T.K.; Koh, Y.Y.; Teo, T.H.; Chai, J.P. A Novel Low-Cost Capacitance Sensor Solution for Real-Time Bubble Monitoring in Medical Infusion Devices. Electronics 2024, 13, 1111. [Google Scholar] [CrossRef]
- Palaniappan, K.; Kok, C.L.; Kato, K. Artificial Intelligence (AI) Coupled with the Internet of Things (IoT) for the Enhancement of Occupational Health and Safety in the Construction Industry. In Advances in Artificial Intelligence, Software and Systems Engineering, Proceedings of the AHFE 2021, New York, NY, USA, 21–25 July 2021; Ahram, T.Z., Karwowski, W., Kalra, J., Eds.; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2021; Volume 271. [Google Scholar] [CrossRef]
- Siemens, G.; Baker, R.S.J.D. Learning analytics and educational data mining: Towards communication and collaboration. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge, Vancouver, BC, Canada, 29 April–2 May 2012; pp. 252–254. [Google Scholar]
- Baker, R.S.J.D.; Yacef, K. The state of educational data mining in 2009: A review and future visions. J. Educ. Data Min. 2009, 1, 3–17. [Google Scholar]
- Baker, R.S.J.D.; Siemens, G. Educational data mining and learning analytics. In Cambridge Handbook of the Learning Sciences, 2nd ed.; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
- Pardo, A.; Siemens, G. Ethical and privacy principles for learning analytics. Br. J. Educ. Technol. 2014, 45, 438–450. [Google Scholar]
- Shaffer, C.A. The role of educational data mining in improving learning outcomes: A case study. In Proceedings of the 4th International Conference on Educational Data Mining, Eindhoven, The Netherlands, 6–8 July 2011; pp. 11–20. [Google Scholar]
- Drachsler, H.; Greller, W. Privacy and analytics: It’s a DELICATE issue. In Proceedings of the 6th International Conference on Learning Analytics & Knowledge, Edinburgh, UK, 25–29 April 2016; pp. 89–98. [Google Scholar]
- Siemens, G.; Gašević, D.; Haythornthwaite, C.; Dawson, S.; Shum, S.B.; Ferguson, R.; Duval, E.; Verbert, K.; Baker, R.S. Open learning analytics: An integrated & modularized platform. In Proceedings of the 1st International Conference on Learning Analytics and Knowledge, Banff, AB, Canada, 27 February–1 March 2011. [Google Scholar]
- D’Mello, S.K. Improving student success using educational data mining techniques: Predictive modeling and intervention development. IEEE Trans. Learn. Technol. 2016, 9, 108–114. [Google Scholar]
- Romero, C.; Ventura, S. Educational data mining: A survey from 1995 to 2005. Expert Syst. Appl. 2007, 33, 135–146. [Google Scholar] [CrossRef]
- Rice, J.A. Learning Analytics: Understanding, Improving, and Applying Insights from Educational Data; Taylor & Francis: Abingdon, UK, 2017. [Google Scholar]
- Ferguson, R. The state of learning analytics in 2012: A review and future challenges. Tech. Rep. 2012, 13, 145. [Google Scholar]
- Delen, D. Predicting student attrition with data mining methods. J. Coll. Stud. Retent. Res. Theory Pract. 2011, 13, 17–35. [Google Scholar] [CrossRef]
- Barramufio, M.; Meza-Narvaez, C.; Galvez-Garcia, G. Prediction of student attrition risk using machine learning. J. Appl. Res. High. Educ. 2022, 14, 974–986. [Google Scholar] [CrossRef]
- Binu, V.S.; Mayya, S.S.; Dhar, M. Some basic aspects of statistical methods and sample size determination in health science research. AYU 2014, 35, 119. [Google Scholar] [PubMed]
- Hao, J.; Ho, T.K. Machine learning made easy: A review of scikit-learn package in python programming language. J. Educ. Behav. Stat. 2019, 44, 348–361. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2021, 45, 5–32. Available online: https://link.springer.com/article/10.1023/A:1010933404324 (accessed on 21 July 2023). [CrossRef]
- Brownlee, J. Why Use Random Forest for Machine Learning? Mach. Learn. Mastery 2020, 31, 31. Available online: https://www.ibm.com/topics/random-forest (accessed on 21 July 2023).
- McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2nd ed.; O’Reilly Media: Newton, MA, USA, 2011. [Google Scholar]
- Sousa, R.V.; Silva, J.R.P.; Fernandes, M.G. Python-Based Framework for Machine Learning in Medical Imaging. IEEE Access 2021, 9, 106546–106560. [Google Scholar]
- Zhang, S.; Li, T. Comparative Study of Machine Learning Algorithms Implemented in Python for Predictive Maintenance. IEEE Access 2021, 9, 64572–64582. [Google Scholar]
- Wang, C.; Guo, R.; Chen, J. Scalable Deep Learning Framework Using Python for Financial Data Analytics. IEEE Trans. Knowl. Data Eng. 2022, 34, 1230–1241. [Google Scholar]
NO. | GRADE |
---|---|
0 | A+ |
16 | A+ |
13 | A− |
26 | B+ |
48 | B |
37 | B− |
12 | C+ |
7 | C |
38 | C− |
Classification | A | B | C |
---|---|---|---|
Computer Science | 27 | 71 | 99 |
Electrical and Electronic Engineering | 36 | 65 | 96 |
General Engineering | 45 | 68 | 84 |
Management and Operations | 39 | 64 | 95 |
Materials | 35 | 58 | 98 |
Mathematics | 51 | 55 | 100 |
Mechanics | 30 | 56 | 72 |
Thermofluids | 36 | 69 | 92 |
Thesis | 41 | 74 | 82 |
COHORT | SD | Mean |
---|---|---|
3 | 2.18 | B |
6 | 2.19 | B+ |
4 | 2.19 | B |
11 | 2.41 | B |
5 | 2.41 | C+ |
10 | 2.42 | B |
2 | 2.45 | B |
8 | 2.74 | B− |
1 | 2.78 | B− |
7 | 2.80 | B |
12 | 2.89 | B |
9 | 2.94 | B |
13 | 3.05 | B |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kok, C.L.; Ho, C.K.; Chen, L.; Koh, Y.Y.; Tian, B. A Novel Predictive Modeling for Student Attrition Utilizing Machine Learning and Sustainable Big Data Analytics. Appl. Sci. 2024, 14, 9633. https://doi.org/10.3390/app14219633
Kok CL, Ho CK, Chen L, Koh YY, Tian B. A Novel Predictive Modeling for Student Attrition Utilizing Machine Learning and Sustainable Big Data Analytics. Applied Sciences. 2024; 14(21):9633. https://doi.org/10.3390/app14219633
Chicago/Turabian StyleKok, Chiang Liang, Chee Kit Ho, Leixin Chen, Yit Yan Koh, and Bowen Tian. 2024. "A Novel Predictive Modeling for Student Attrition Utilizing Machine Learning and Sustainable Big Data Analytics" Applied Sciences 14, no. 21: 9633. https://doi.org/10.3390/app14219633
APA StyleKok, C. L., Ho, C. K., Chen, L., Koh, Y. Y., & Tian, B. (2024). A Novel Predictive Modeling for Student Attrition Utilizing Machine Learning and Sustainable Big Data Analytics. Applied Sciences, 14(21), 9633. https://doi.org/10.3390/app14219633