Predictive Analysis of Students’ Learning Performance Using Data Mining Techniques: A Comparative Study of Feature Selection Methods
Abstract
:1. Introduction
1.1. Learning Management System and Students’ Academics Analytics
1.2. Complexity of the Learning Process and the Role of Machine Learning
1.3. Research Objective
1.4. Research Contribution
2. Literature Review
2.1. Risk Prediction in Student Performance Employing Machine Learning
2.2. Students Dropout
3. Methodology
3.1. Dataset Description
3.2. Data Preprocessing
3.3. Performance Evaluators for Student Learning Analytics in Academia (SLAIA)
3.3.1. Final Results
3.3.2. Final Weighted Score
4. Feature Engineering Methods
4.1. For Regression
4.2. For Classification
4.3. Machine Learning Model and Evaluation Measures
5. Experiments and Analysis
5.1. Feature Engineering for Regression
5.1.1. Lasso Regression
5.1.2. Boruta Feature Selection
5.2. Feature Engineering for Classification
Random Forest Importance (RFI)
5.3. Recursive Feature Elimination (RFE)
5.4. Student Learning Analysis Using Regression
5.4.1. Lasso Regression
5.4.2. Boruta Feature Selection
5.5. Student Learning Analysis Using Classification
5.5.1. Random Forest Importance (RFI)
5.5.2. Recursive Feature Elimination (RFE)
6. Discussion
Risks and Considerations in Applying Machine Learning to Educational Analytics
- Computational efficiency: if some models are substantially faster to train or require less memory, this could be a deciding factor.
- Model complexity: simpler models are generally preferable if performance metrics are very close, as they are easier to interpret and less likely to overfit.
- Domain-specific criteria: are there specific requirements in the educational context that might make one type of error (e.g., false positives vs. false negatives) more critical than another?
- Statistical significance: conducting a statistical test to check whether the differences in MAE and RMSE are statistically significant could provide a more definitive answer.
7. Conclusions
Funding
Data Availability Statement
Conflicts of Interest
References
- Siemens, G.J.A. Call for papers of the 1st international conference on learning analytics & knowledge (lak 2011). In Proceedings of the 1st International Conference Learning Analytics & Knowledge, Banff, AL, Canada, 27 February–1 March 2011; Volume 29, p. 2020. [Google Scholar]
- Powell, S.; MacNeill, S. CETIS Analytics Series: Institutional Readiness for Analytics; CORE: Milton Keynes, UK, 2012. [Google Scholar]
- Natek, S.; Zwilling, M. Student data mining solution–knowledge management system related to higher education institutions. Expert Syst. Appl. 2014, 41, 6400–6407. [Google Scholar] [CrossRef]
- Kumar, A.D.; Selvam, R.P.; Kumar, K.S. Review on prediction algorithms in educational data mining. Int. J. Pure Appl. Math. 2018, 118, 531–537. [Google Scholar]
- Liu, Q.; Wu, R.; Chen, E.; Xu, G.; Su, Y.; Chen, Z.; Hu, G. Fuzzy cognitive diagnosis for modelling examinee performance. ACM Trans. Intell. Syst. Technol. 2018, 9, 1–26. [Google Scholar] [CrossRef]
- Fausett, L.; Elwasif, W. Predicting performance from test scores using backpropagation and counterpropagation. In Proceedings of the 1994 IEEE International Conference on Neural Networks (ICNN’94), Orlando, FL, USA, 28 June–2 July 1994; pp. 3398–3402. [Google Scholar]
- Livieris, I.E.; Drakopoulou, K.; Mikropoulos, T.A.; Tampakas, V.; Pintelas, P. An ensemble-based semi-supervised approach for predicting students’ performance. In Research on e-Learning and ICT in Education; Technological, P., Perspectives, I., Eds.; Springer: Cham, Switzerland, 2018; pp. 25–42. [Google Scholar]
- Loh, C.S.; Sheng, Y.J.E. Measuring the (dis-) similarity between expert and novice behaviors as serious games analytics. Educ. Inf. Technol. 2015, 20, 5–19. [Google Scholar] [CrossRef]
- Wook, M.; Yusof, Z.M.; Nazri, M.Z.A. Educational data mining acceptance among undergraduate students. Educ. Inf. Technol. 2017, 22, 1195–1216. [Google Scholar] [CrossRef]
- Picciano, A.G. The evolution of big data and learning analytics in American higher education. J. Asynchronous Learn. Netw. 2012, 16, 9–20. [Google Scholar] [CrossRef]
- Viberg, O.; Hatakka, M.; Bälter, O.; Mavroudi, A. The current landscape of learning analytics in higher education. Comput. Hum. Behav. 2018, 89, 98–110. [Google Scholar] [CrossRef]
- Kotsiantis, S.B.; Pierrakeas, C.; Pintelas, P.E. Preventing student dropout in distance learning using machine learning techniques. In Proceedings of the Knowledge-Based Intelligent Information and Engineering Systems, 7th International Conference, Oxford, UK, 3–5 September 2003; pp. 267–274. [Google Scholar]
- Romero, C.; Ventura, S. Educational data mining: A survey from 1995 to 2005. Expert Syst. Appl. 2007, 33, 135–146. [Google Scholar] [CrossRef]
- Romero, C.; Ventura, S. Educational data mining: A review of the state of the art. IEEE Trans. Syst. Man Cybern. Part C 2010, 40, 601–618. [Google Scholar] [CrossRef]
- Minaei-Bidgoli, B.; Kashy, D.A.; Kortemeyer, G.; Punch, W.F. Predicting student performance: An application of data mining methods with an educational web-based system. In Proceedings of the 33rd Annual Frontiers in Education, 2003 (FIE 2003), Westminster, CO, USA, 5–8 November 2003; p. T2A-13. [Google Scholar]
- Peña-Ayala, A. Educational data mining: A survey and a data mining-based analysis of recent works. Expert Syst. Appl. 2014, 41, 1432–1462. [Google Scholar] [CrossRef]
- Shih, B.-Y.; Lee, W.-I. The application of nearest neighbor algorithm on creating an adaptive on-line learning system. In Proceedings of the 31st Annual Frontiers in Education Conference—Impact on Engineering and Science Education—Conference Proceedings (Cat. No. 01CH37193), Reno, NV, USA, 10–13 October 2001; p. T3F-10. [Google Scholar]
- Younas, J.; Lukowicz, P. Cognitive Ability Classification using On-body Sensors. In Proceedings of the Adjunct Proceedings of the 2022 ACM International Joint Conference on Pervasive and Ubiquitous Computing and the 2022 ACM International Symposium on Wearable Computers, Cambridge, UK, 11–15 September 2022; pp. 317–320. [Google Scholar]
- Kuzilek, J.; Hlosta, M.; Herrmannova, D.; Zdrahal, Z.; Vaclavek, J.; Wolff, A. OU Analyse: Analysing at-risk students at The Open University. Learn. Anal. Rev. 2015, LAK15-1, 1–16. [Google Scholar]
- He, J.; Bailey, J.; Rubinstein, B.; Zhang, R. Identifying at-risk students in massive open online courses. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
- Kovacic, Z. Early prediction of student success: Mining students’ enrolment data. In Proceedings of the InSITE 2010: Informing Science + IT Education Conference, Cassino, Italy, 19–24 June 2010. [Google Scholar]
- Kotsiantis, S.; Patriarcheas, K.; Xenos, M. A combinational incremental ensemble of classifiers as a technique for predicting students’ performance in distance education. Knowl. Based Syst. 2010, 23, 529–535. [Google Scholar] [CrossRef]
- Osmanbegovic, E.; Suljic, M. Data mining approach for predicting student performance. Econ. Rev. 2012, 10, 3–12. [Google Scholar]
- Watson, C.; Li, F.W.; Godwin, J.L. Predicting performance in an introductory programming course by logging and analyzing student programming behavior. In Proceedings of the 2013 IEEE 13th international conference on advanced learning technologies, Beijing, China, 15–18 July 2013; pp. 319–323. [Google Scholar]
- Hu, Y.-H.; Lo, C.-L.; Shih, S.-P. Developing early warning systems to predict students’ online learning performance. Comput. Hum. Behav. 2014, 36, 469–478. [Google Scholar] [CrossRef]
- Lakkaraju, H.; Aguiar, E.; Shan, C.; Miller, D.; Bhanpuri, N.; Ghani, R.; Addison, K.L. A machine learning framework to identify students at risk of adverse academic outcomes. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; pp. 1909–1918. [Google Scholar]
- Ahmed, A.; Elaraby, I.S. Data mining: A prediction for student’s performance using classification method. Int. J. Comput. Sci. Eng. 2014, 2, 43–47. [Google Scholar] [CrossRef]
- Marbouti, F.; Diefes-Dux, H.A.; Madhavan, K. Models for early prediction of at-risk students in a course using standards-based grading. Comput. Educ. 2016, 103, 1–15. [Google Scholar] [CrossRef]
- Iqbal, Z.; Qadir, J.; Mian, A.N.; Kamiran, F. Machine learning based student grade prediction: A case study. arXiv 2017, arXiv:1708.08744. [Google Scholar]
- Almarabeh, H. Analysis of students’ performance by using different data mining classifiers. Int. J. Mod. Educ. Comput. Sci. 2017, 9, 9–15. [Google Scholar] [CrossRef]
- Xu, J.; Moon, K.H.; Van Der Schaar, M. A machine learning approach for tracking and predicting student performance in degree programs. IEEE J. Sel. Top. Signal Process. 2017, 11, 742–753. [Google Scholar] [CrossRef]
- Al-Shehri, H.; Al-Qarni, A.; Al-Saati, L.; Batoaq, A.; Badukhen, H.; Alrashed, S.; Alhiyafi, J.; Olatunji, S.O. Student performance prediction using support vector machine and k-nearest neighbor. In Proceedings of the 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE), Windsor, ON, Canada, 30 April–3 May 2017; pp. 1–4. [Google Scholar]
- Daud, A.; Aljohani, N.R.; Abbasi, R.A.; Lytras, M.D.; Abbas, F.; Alowibdi, J.S. Predicting student performance using advanced learning analytics. In Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, 3–7 April 2017; pp. 415–421. [Google Scholar]
- Masci, C.; Johnes, G.; Agasisti, T. Student and school performance across countries: A machine learning approach. Eur. J. Oper. Res. 2018, 269, 1072–1085. [Google Scholar] [CrossRef]
- Aggarwal, D.; Mittal, S.; Bali, V. Significance of non-academic parameters for predicting student performance using ensemble learning techniques. Int. J. Syst. Dyn. Appl. 2021, 10, 38–49. [Google Scholar] [CrossRef]
- Zeineddine, H.; Braendle, U.; Farah, A.J.C.; Engineering, E. Enhancing prediction of student success: Automated machine learning approach. Comput. Electr. Eng. 2021, 89, 106903. [Google Scholar] [CrossRef]
- Buenaño-Fernández, D.; Gil, D.; Luján-Mora, S.J.S. Application of machine learning in predicting performance for computer engineering students: A case study. Sustainability 2019, 11, 2833. [Google Scholar] [CrossRef]
- Hussain, M.; Zhu, W.; Zhang, W.; Abidi, S.M.R.; Ali, S. Using machine learning to predict student difficulties from learning session data. Artif. Intell. Rev. 2019, 52, 381–407. [Google Scholar] [CrossRef]
- Alhusban, S.; Shatnawi, M.; Yasin, M.B.; Hmeidi, I. Measuring and enhancing the performance of undergraduate student using machine learning tools. In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 7–9 April 2020; pp. 261–265. [Google Scholar]
- Yukselturk, E.; Ozekes, S.; Turel, Y.K. Predicting dropout student: An application of data mining methods in an online education program. Eur. J. Open Distance E-Learn. 2014, 17, 118–133. [Google Scholar] [CrossRef]
- Wang, W.; Yu, H.; Miao, C. Deep model for dropout prediction in MOOCs. In Proceedings of the 2nd International Conference on Crowd Science and Engineering, Beijing, China, 6–9 July 2017; pp. 26–32. [Google Scholar]
- Aulck, L.; Velagapudi, N.; Blumenstock, J.; West, J. Predicting student dropout in higher education. arXiv 2016, arXiv:1606.06364. [Google Scholar]
- Thaker, K.; Huang, Y.; Brusilovsky, P.; Daqing, H. Dynamic knowledge modeling with heterogeneous activities for adaptive textbooks. In Proceedings of the 11th International Conference on Educational Data Mining, Buffalo, NY, USA, 15–18 July 2018; pp. 592–595. [Google Scholar]
- Ahadi, A.; Lister, R.; Haapala, H.; Vihavainen, A. Exploring machine learning methods to automatically identify students in need of assistance. In Proceedings of the Eleventh Annual International Conference on International Computing Education Research, Omaha, NE, USA, 9–13 July 2015; pp. 121–130. [Google Scholar]
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Kursa, M.B.; Rudnicki, W.R. Feature selection with the Boruta package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef]
- Genuer, R.; Poggi, J.-M.; Tuleau-Malot, C. Variable selection using random forests. Pattern Recognit. Lett. 2010, 31, 2225–2236. [Google Scholar] [CrossRef]
- Granitto, P.M.; Furlanello, C.; Biasioli, F.; Gasperi, F. Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemom. Intell. Lab. Syst. 2006, 83, 83–90. [Google Scholar] [CrossRef]
S. No. | Feature Category | Feature Name | Description |
---|---|---|---|
1 | Courses | code module | Distinctive ID that is assigned to a module |
code presentation | Refers to the designated identifier for a given presentation | ||
length | Represents the duration of the module presentation | ||
2 | Assessments | code module | An identifying code for the part of the module of which the assessment is associated |
code presentation | Denotes the identifier code of the document being presented to which the assessment pertains | ||
ID assessment | The unique identifier which is assigned to a particular assessment | ||
Assessment type | The specific type of evaluation being conducted | ||
Date | Depicts the final delivery deadline of the assessment | ||
Weight | The percentage of importance assigned to an assessment | ||
3 | VLE | ID site | A unique identifier that is assigned to a particular piece of material |
code module | Serves as an identifying code for a particular academic module | ||
code presentation | The unique identifier that is assigned to a particular presentation | ||
Activity type | Refers to the specific role that is linked to the material within the module | ||
Week from | The designated week for which the instructional material is intended to be utilized | ||
Week to | The duration for which the instructional material is intended to be utilized | ||
4 | Student Information | code module | A unique identifier assigned to a specific academic course in which a learner is enrolled |
code presentation | Refers to the unique identifier assigned to the presentation for which the learner is enrolled | ||
Id Student | A distinct identifier for each individual student | ||
gender | Refers to the sex of the student | ||
region | The specific geographic location in which the student was residing during the module presentation | ||
highest education | The highest level of education attained by the student at the time of enrollment in the module presentation | ||
imd band | The Index of Multiple Deprivation bands is associated with the geographic location of the student | ||
age band | Represents the range of ages to which the student belongs | ||
No of pre-attempts | The frequency of attempts made by the student for the current module | ||
studied credits | The aggregate number of credits assigned to the modules that a student is presently enrolled in | ||
disability | The declaration of a disability by the student | ||
final result | The student’s ultimate outcome in the module presentation | ||
5 | Student Registration | code module | A unique identifier assigned to a specific module |
code presentation | Specific code assigned to a particular presentation | ||
Id Student | A distinct identifier for each individual student. | ||
date registration | The date on which a student enrolled for a module presentation | ||
date unregistration | The duration, in days, between a student’s enrollment and unregistration from a module | ||
6 | Student Assessment | id assessment | The unique identifier assigned to an assessment |
id student | A distinct and exclusive numerical representation for each individual student | ||
date submitted | The date on which a student submitted their work | ||
is banked | A status indicator denotes the transfer of an assessment outcome from a prior presentation | ||
score | The numerical value represents the performance of the student in the given assessment | ||
7 | Student VLE | code module | A unique identification code is assigned to a module |
code presentation | The identification code that is assigned to the module presentation | ||
id student | A distinct identification number that is assigned to each student | ||
id site | A unique numerical identifier that is assigned to the VLE material | ||
date | Student’s engagement date with the material | ||
sum click | The quantification of a student’s engagement with the learning material within a given day |
S. No. | Model Name | Hyperparameter Details |
---|---|---|
1 | Linear Regression | fit_intercept = True, normalize = False, n_jobs = None, Positive = False |
2 | Support Vector Regressor | C = 1.0, kernel = rbf, degree = 3, gamma = scale, coef0 = 0.0 |
3 | Random Forest Regressor | n_estimators = 100, criterion = “mse”, max_depth = None |
4 | Gradient Boosted | Loss = “ls”, n_estimators = 100, max_depth = 3, learning_rate = 0.1 |
S. No. | Model Name | Hyperparameter Details |
---|---|---|
1 | Ensemble Model:
|
|
2 | XGBoost | max_depth = 4, alpha = 2, n_estimators = 50, objective = binary = logistic |
3 | Neural Networks | hidden_layer_sizes = 100, activation = relu, alpha = 0.2, max_iter = 6, learning_rate = constant |
4 | Decision Tree | Criterion = entropy, max_depth = 7, splitter = best, min_samples_split = 2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Syed Mustapha, S.M.F.D. Predictive Analysis of Students’ Learning Performance Using Data Mining Techniques: A Comparative Study of Feature Selection Methods. Appl. Syst. Innov. 2023, 6, 86. https://doi.org/10.3390/asi6050086
Syed Mustapha SMFD. Predictive Analysis of Students’ Learning Performance Using Data Mining Techniques: A Comparative Study of Feature Selection Methods. Applied System Innovation. 2023; 6(5):86. https://doi.org/10.3390/asi6050086
Chicago/Turabian StyleSyed Mustapha, S. M. F. D. 2023. "Predictive Analysis of Students’ Learning Performance Using Data Mining Techniques: A Comparative Study of Feature Selection Methods" Applied System Innovation 6, no. 5: 86. https://doi.org/10.3390/asi6050086
APA StyleSyed Mustapha, S. M. F. D. (2023). Predictive Analysis of Students’ Learning Performance Using Data Mining Techniques: A Comparative Study of Feature Selection Methods. Applied System Innovation, 6(5), 86. https://doi.org/10.3390/asi6050086