Next Article in Journal
Potential Use of Coriander Waste as Fuel for the Generation of Electric Power
Previous Article in Journal
Multi-Temporal InSAR Deformation Monitoring Zongling Landslide Group in Guizhou Province Based on the Adaptive Network Method
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Sustainable e-Learning by Data Mining—Successful Results in a Chilean University

by
Aurora Sánchez
1,
Cristian Vidal-Silva
2,*,
Gabriela Mancilla
1,
Miguel Tupac-Yupanqui
3 and
José M. Rubio
4
1
Department of Administration, Universidad Católica del Norte, Angamos 0610, Antofagasta 1270709, Chile
2
Faculty of Engineering, School of Videogame Development and Virtual Reality Engineering, University of Talca, Talca 3460000, Chile
3
EAP, Ingeniería de Sistemas e Informática, Universidad Continental, Huancayo 12000, Peru
4
Escuela de Computación e Informática, Facultad de Ingeniería, Ciencia y Tecnología, Universidad Bernardo O’Higgins, Santiago 8370993, Chile
*
Author to whom correspondence should be addressed.
Sustainability 2023, 15(2), 895; https://doi.org/10.3390/su15020895
Submission received: 11 October 2022 / Revised: 10 December 2022 / Accepted: 20 December 2022 / Published: 4 January 2023

Abstract

:
People are increasingly open to using online education mainly to break the distance and time barriers of presential education. This type of education is sustainable at all levels, and its relevance has increased even more during the pandemic. Consequently, educational institutions are saving large volumes of data containing relevant information about their operations, but they do not know why students succeed or fail. The Knowledge Discovery in Databases (KDD) process could support this challenge by extracting innovative models to identify the main patterns and factors that could affect the success of their students in online education programs. This work uses the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology to analyze data from the Distance Education Center of the Universidad Católica del Norte (DEC-UCN) from 2000 to 2018. CRISP-DM was chosen because it represents a proven process that integrates multiple methodologies to provide an effective meta-process for data knowledge projects. DEC-UCN is one of the first centers to implement online learning in Chile, and this study analyses 18,610 records in this period. The study applies data mining, the most critical KDD phase, to find hidden data patterns to identify the variables associated with students’ success in online learning (e-learning) programs. This study found that the main variables explaining student success in e-learning programs are age, gender, degree study, educational level, and locality.

1. Introduction

Current advances in education and technology facilitate people to develop competencies in defined areas at home [1]. As [2] highlight, online learning is a model that has revolutionized education thanks to the inclusion of Information and Communication Technologies (ICTs) and the growing of Educational Data Mining (EDM). Educational institutions place attention on this revolution for the use of new methodologies in the educational process [3]. Multiple studies exist that evaluate the success of online learning technology platforms, mainly based on the success of DeLone and McLean information systems model to measure and assess the success and sustainability of electronic learning systems [4,5]. However, despite the rapid growth of online learning and EDM, there are many problems faced by institutions that offer online courses, and the variables that impact student success in distance education is yet unknown.
Tools for identifying behavioral data patterns and factors for the success of online learning already exist [6]. This study fills the gap in terms of the variables for the student success in e-learning programs by adapting the data mining methodology CRISP-DM (Cross-Industry Standard Process for Data Mining) to discover the variables of success [7]. E-learning could readily meet the needs, features, and requirements of potential students who select this modality of study [8,9,10], even more so during the pandemic [11]. No previous research exists in South American countries that identifies determinants for the student success in e-learning programs.
Knowledge Discovery in Databases (KDD), commonly known as data mining [12,13], is a process for the patterns discovery and predictive modeling in large databases [14]. KDD makes extensive use of data mining methods, automated techniques, and algorithms for pattern recognition and identifying hidden patterns in e-learning environments [15]. Characteristically, data mining uses machine learning methods developed in the domain of artificial intelligence [16]. Data mining uses statistical, mathematical, artificial intelligence, and machine learning techniques to extract and identify pertinent information and related knowledge hidden in large volumes of raw data [17]. Data mining is technically the process of finding correlations or patterns between thousands of fields in large databases [15]. Data mining finds these patterns and relationships using data analysis tools and techniques to build models and machine learning [18,19].
Data mining comprises various techniques for pre-processing, analyzing, and interpreting data. Most researchers in the area agree that we could organize them into pattern recognition and machine learning. Pattern recognition aims to identify implicit objects and relations, and machine learning techniques are mainly applied to extract generalized knowledge from data for use in prediction tasks. Researchers can use classification techniques in data mining to predict group membership for data occurrences. Consequently, data mining involves more than collecting and managing data because it includes analysis and prediction. Classification techniques allow the processing of a wider variety of data than regression, and they are growing in popularity. There is a great variety of algorithms for classification purposes. Scheuer and Mclaren [20] propose a model to identify the most influential factors that predict student academic performance. They predict students’ passing or failing status by considering and defining their academic performance (high, medium, or low).
Educational Data Mining (EDM) is concerned with developing, investigating, and applying machine learning, data mining, and statistical methods to detect patterns in extensive collections of data from educational institutions that would otherwise be impossible to analyze using traditional computing techniques [2,10,21]. In this sense, in recent years, the use of deep learning techniques has emerged in EDM. Hence, developing data mining competencies represents a research area. For example, the work of [22] presents positive experiences to enhance knowledge acquisition about data mining via the game-based approach. Regarding the implementation of EDM systems, the work of Almaiah et al. [23] discuss about traditional issues, and the success of using modern programming languages.

Problem Statement, Goal, and Contributions

In EDM, the data of interestis not limited to individual student interactions in an educational system. We can consider different administrative and demographic data such as gender, age, and grades for discovering patterns. As [2] discusses, EDM applications exist to find educational patterns such as cognitive skills, motivational effects, and social emotions. The work of [24,25,26] presents successful results of EDM regarding administrative issues of e-learning systems and the effects on students’ performance, factors that influence the use and success of mobile learning, and critical challenges and factors to determine students’ success in pandemic time. A report still needs to be made about EDM to discover patterns in students’ success in online education in South American countries. This work asks for the following research question: Can the use of data mining tools in education allow the identification of student success patterns in e-learning programs? Trying to answer it, the primary goal of this article is to determine variables associated with student success in e-learning programs by using the CRISP-DM methodology [7] with data from a university in the developing country Chile. This study analyses the causes of success or failure from the set of student variables since we consider their demographic and performance features.
The main contributions of this paper are the following:
  • First, this article identifies potential variables for success or failure in e-learning programs, not only academic factors, through a systematic literature review.
  • Second, this article defines a repeatable data mining application for identifying students’ success patterns in e-learning environments using a large set of data. This analysis was not feasible with other methods.
  • Third, this article provides a utilization example of multi-year historical data starting when e-learning programs began being a phenomenon in Chile and other countries (the year 2000). Other institutions in the region could repeat this application.
The remainder of this paper is organized as follows. Section 2 defines the main concepts of e-learning, CRISP-DM, the data mining process, and previous data mining in education experiences. Section 3 describes the applied methodology and case study data: we define the main steps of the data mining process, the data source, concepts, and expected results. After that, Section 4 highlights obtained results to validate our hypothesis. Section 5 describes the usefulness of this work for a similar context, overall, for online educational institutions and programs concerning what variables are relevant to consider. The paper concludes with a discussion of future work in Section 6.

2. e-Learning and Data Mining

The e-learning process is characterizable by recording most of the traditional learning process variables, which range from student entry data to the efficiency and ease-of-use of the applied platforms [1]. The large volume of data in the e-learning process provides the opportunity to analyze that data using knowledge discovery tools. The KDD (Knowledge Discovery Database) process looks for hidden patterns in large volumes of data that information systems usually store. Those patterns can be high-value information for the decision-making process in organizations [27]. Figure 1 [14,28] illustrates the KDD. Data mining is an essential KDD phase for applying algorithms to find hidden behavior patterns in the data [29].
The application of data mining techniques has two primary purposes: building models and detecting patterns [30]. The model building seeks to produce a summary of the data set to identify and describe the main characteristics. Pattern detection seeks to identify small deviations from the norm to detect unusual behavior patterns by discovering patterns and rules and searches for content. When it is not possible to build models for the data set, you can look for behavior patterns. Pattern and rule discovery seeks frequent combinations and associations of attributes found in database transactions (for example, products purchased together). Techniques based on association rules usually address that issue.

2.1. CRISP-DM

CRISP-DM method is one of the most efficient methodologies for developing projects applying data mining [31,32]. The objective of CRISP-DM is to allow different using a common vocabulary, methodology, and tools in data mining activities. CRISP-DM organizes in six phases from general to specific tasks:
  • Business Understanding Phase: The first phase analysis of the problem includes understanding the project’s objectives and requirements from a business or institutional perspective.
  • Data Comprehension Phase: The second phase of data analysis includes the initial data collection, identifying the quality of the data.
  • Data Preparation Phase: This phase includes general data selection tasks for applying modeling techniques (variables and samples), data cleaning, generation of additional variables, integration of different data sources, and format changes.
  • Modeling Phase: In this phase, selecting the most appropriate modeling techniques takes place to generate and evaluate the model. The parameters used in the model generation depend on the characteristics of the data.
  • Evaluation Phase: In the evaluation phase, the model is evaluated, not from the data point of view, but for fulfilling the problem’s success criteria. If the generated model is valid based on the success established in the first phase, the model is exploited.
  • Implementation Phase: At this stage, in addition to the implementation of the model, the results must be presented and documented understandably, to achieve an increase in knowledge.
Figure 2 [33] illustrates CRISP-DM stages.

2.2. Data Mining Techniques

This research applied four techniques, naive Bayes, random forest, AdaBoost, decision trees (J48), and neural networks, which are recognized as successful algorithms for classification purposes [34,35]. Some of the main characteristics of those classifiers are:
  • Naive Bayes: It is based on Bayes theorem with an assumption of independence among predictors. It assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Naïve Bayes mainly targets the text classification industry. It is mainly used for clustering and classification purposes depending on the conditional probability of happening [36].
  • Random Forest: This classifier combines prediction trees in which a hierarchical division of the underlying data space is sustained. In the hierarchical division of the data space, comment partitions are created that are more skewed in terms of their distribution of terms [37].
  • AdaBoost: The name is an acronym for Adaptive Boosting and it is a meta-algorithm. This algorithm supports a distribution or set of weights over the training set. Initially, all weights equally set, but on each round, the weights of incorrectly classified samples are increased then the weak learner focuses on these samples. AdaBoost originally ability to minimize the error, and maximize the margin with respect to features [38].
  • Decision Trees (J48): The J48 algorithm builds a decision tree that classifies the class attribute based on the input attributes. The algorithm is based on the C4.5 algorithm developed by Quinlan [39]. The algorithm uses a greedy search method to create decision trees and allows changing different parameters to obtain a better classification accuracy [40].
  • Neural Networks: The development of the neural network uses a non-linear optimization model. Unlike other analyses, it is not easy to interpret clearly, unlike the results and parameters provided by other analyses. For the construction of the classifier used in this research, the multilayer perceptron neural network was used, which builds a neural network in the form of a waterfall, which has one or more hidden layers [41].

2.3. Data Mining in Education

E-learning is the result of the adaptation and use of information technologies in education [42,43,44]. The works of Wani et al. [45,46,47] highlight the importance of e-learning in higher education using a virtual training environment for the development of professional competencies. The success of e-learning systems is associated with different variables, such as information technology and the program’ modality applied. Multiple studies exist that evaluate e-learning technology platforms’ success, most of which used the information systems success model of DeLone and McLean [4,48,49,50,51,52]. Despite the rapid growth of e-learning, institutions face problems in teaching courses in that modality. The knowledge of the variables that impact student success in e-learning is still unknown in different countries such as Chile and many in developing countries. Tools such as data mining that permit identifying behavioral data patterns could identify factors for e-learning success.
Higher distance education permits developing competencies regarding the current market demands without geographical, economic, and time barriers [53,54]. In this way, quality e-learning programs would contribute to developing countries such as Chile. For example, Cidral et al. [55] reviewed the success of e-learning platforms in Brazil. That study concluded that users’ satisfaction with program content quality and easy-to-use interfaces are critical issues for success. Several authors have studied the success or failure of students in e-learning programs, such as [56], who analyzed the causes of online university dropouts in a systematic analysis of the literature; they defined a classification of the involved variables: student, institution, teachers, media, degree of social, and academic integration. Those authors indicate that knowledge of personal, social, and demographic characteristics can be essential to predict student success and failure in e-learning programs.
By applying the data mining process of [20], Figure 3 shows our proposed model to predict student academic performance. Decision tree and random forest have known classification data mining techniques, whereas CHAID ID3 represents a version of the CHi-square automatic interaction detection algorithm, the oldest decision tree algorithm in history, and RMS (Root Mean Square) [34].

3. Methodology

This study looks to determine the success of the online learning modality provided by the Distance Education Center of the Universidad Católica del Norte (DEC-UCN) by using data mining to support the case analysis methodology to know about the initial conditions of students in educative programs.
This study worked with data on the admission and final results of DEC-UCN students between 2000 and 2018. The total number of students was of 12,264. The study stages were developed from the CRISP-DM model to analyze the database information and apply the corresponding tools. We used particular data mining techniques and algorithms, such as decision trees, descriptive statistics, and neural networks. The computational tool used was SPSS Statistics 22 [57]. The benefit of this technique is that it provides an easy understanding of data mining decision making.

3.1. Institution Background

In the form of programs leading to a professional degree, the origin of DEC-UCN dates back to 1982. The DEC-CNU was instituted in 1996. Located in Antofagasta, the capital city of Chile’s second region, the DEC-UCN is under the office of the university’s academic vice-rector. The pedagogical model of the DEC-UCN is in harmony with the PE-UCN design (PEdagogical model at the Universidad Católica del Norte). In other words, a pedagogical model that is accountable to the institutional mission supported by education in human values, based on training by competencies and taking into account the constant changes in society. This academic unit develops continuous training spaces through 100% online programs, using an education model focused on technologies to support distance education. One of DEC-UCN’s biggest problems is the lack of knowledge of student success variables who choose to study at a distance program.

3.2. Selection and Understanding of Data

This study required the data of the students enrolled in the period between January 2000 and December 2018 stored on a local server (LICANCABUR) with access restricted. This server provides services to the Oracle Developer 2000 database system called ANTEC, the official database of the DEC-UCN. Other complementary data are stored in files and printed matter managed by the managers of the different areas. We then obtained records of 18,610 DEC-UCN, data exported from the ANTEC database system, through SQL (Structured Query Language) queries for the generation of an Excel file with the requested data. Regarding the data source, DEC-UCN offered online education mainly for technical majors in the areas of management, business, and computer programming with a record of 430, 558, and 296 students, respectively. Since records are from a set of years, some students appear in more than one record.
In the ANTEC database system, we used the ANTEC browser to perform the necessary SQL queries. We selected tables STUDENT, STUDENT_ADDRESS, PROGRAM, and STUDENT_PROGRAM to obtain the personal data and the prior academic status that a student reached in a given program. The excluded data are indifferent to the sample since they did not contain reliable information for the investigation. Figure 4 presents an extract of the relational model of the ANTEC system.

3.3. Data Preparation

We applied the necessary changes to the files with the SPSS Statistics 22 tool due to its presentation capabilities that make the result more understandable to the end-user.
  • Data selection and cleaning: first, we selected the data attributes, considering the objective and data quality problems. As a result, we selected the following tables and attributes: (i) STUDENT (RUT of the student, sex, date of birth, profession, nationality, marital status, education); (ii) STUDENT_ADDRESS (RUT of the student, commune, city); (iii) PROGRAM (name of the program, date of registration, type of program); (iv) STUDENT_PROGRAM (RUT of the student, academic situation, final grade of the program). For the analysis, we followed the next filter: Students who enrolled in the programs between 2000 and 2018 (enrollment date > = 2000) and (enrollment date < = 2018)).
  • Data Quality: a problem presented by the data is the amount of missing data in some of them, such as the program’s attribute name. We decided to keep the records with unknown values because their elimination results in excluding rows with valid values in the objective set. Moreover, we faced data categorization because the applied techniques for the analysis (classification) mainly use categorical data to facilitate their construction and interpretation.
  • Construction of the Data: to fulfill the project’s objective, an attribute called region was created, which derives from the attribute commune and city. The region attribute takes the value corresponding to the region that the commune and city belong to. When the data selection and construction process was complete, the changes were saved to the files for their use later in the modeling stage. The new files are in SPSS format.

3.4. Modeling

We carried out the data modeling for the DEC-UCN at a global level. In this study, the classification model predicts the student profile’ associated with success in programs with an online learning modality. Considering the research and results of [58,59,60], we applied the decision tree AdaBoostM1 and tree J.48 along with the naive Bayes and random forest algorithms for classification. The classification model takes as a dependent variable “state”, which is a categorical variable, and the category “Graduated” as the highest level of success of a student in the model. The academic success in this perspective measures students in the category graduated from a started program.
For the formulation of the model, we applied neural networks to identify relationships between the variables and determine their importance concerning the target variable. For constructing the decision trees, we initially used two algorithms: (i) the C5.0 algorithm that presents rules that allow a clearer understanding of the generative partitioning; (ii) the CHAID algorithm that, from a statistical point of view (based on the significance of the chi-square test), constructs the trees by comparing the categories, contracting those that do not present differences in their results. Subsequently, a decision tree algorithm is selected based on the results obtained (case prediction) and the analysis of the construction of the tree itself.
In order to predict the accuracy and ensure precision, the study established a confusion matrix for each algorithm, which was necessary to calculate the metrics of Precision, Recall, F1, Accuracy, and the Matthews correlation coefficient. Table 1 defines the procedure and characteristic of those measures [61].

4. Results

The statistical results are the behavior patterns that influence students’ success in online learning modality and their failures. The programs with the largest number of students are Human Resources Administration, Environmental Management, Family Medication, Psychopedagogy, Total Quality Management, Integrated Management, and Educational Orientation (see Table 2).
Initially, we present the analysis of the data using decision trees. This analysis shows that the first level of the tree identified the variable “type of programs” as the main predictor of student success at the DEC-UCN, from left to right, from nodes 20 to 22. The type of program with the highest percentage of graduates is Continuous Improvement. With a p-value of 0.001, a chi-square of 66.4 and a degree of freedom (df) of 8, we can observe that students belonging to the Metropolitan, Magallanes, Tarapacá, and Bio Bio areas obtained the highest percentage of graduated students with 76.8%, followed by the Aysén and Los Lagos areas with 67.9%. Figure 5 illustrates the mentioned results.
The second type of program with the highest percentage of graduate students is Training and Technical Courses with 53.8%. In this program, students with non-university professions obtained a larger portion of qualifications than students with university professions from the art and health science areas. Students prefer this program due to its high degree percentage compared to professional and technical courses (see Figure 6).
Students with the highest percentage of elimination belong to undergraduate degree programs. Figure 7 depicts a total of 853 students in undergraduate programs. With a p-value of 0.002, a chi-square of 22.251, and a degree of freedom (df) of 3, we classify students with complete primary and secondary schools but without completing professional or higher-level technician studies with the highest elimination trend: 362 of 454; that is, 79.7% of students of undergraduate programs. In the same context, we also distinguish students who completed higher education studies and who present an elimination trend of 66.2%, 264 of 399 students.
The analysis of the data using neural networks gave additional information about the main variables that could predict success of students in e-learning programs. Table 3 shows that when using neural networks to identify the variables, the analysis classifies 60.8% of correct predictions. The model in Figure 8 showed that the determinant factors for academic success for all programs, from the highest to the lowest, are age, program code, profession, scholarity, type of program, region, and finally sex, according to the student’s final academic situation in the most successful programs, which gives a reasonable first approximation regarding the topic. The study analyzed comparatively the performance of the classification algorithms used, as defined in the research model. These results indicated that AdaBoostM1 and naive Bayes were the algorithms with the lowest performance. Table 4 shows that precision, recall, and F-measure indicators were comparatively low. The AdaBoost M1 algorithm achieved a correct classification of 62.15% compared to naive Bayes, with 61.7%. The MCC values are also closer to zero (0.118 and 0.007), so their prediction is not much better than chance. The ROC values are also quite close to 0.5, which is not an indication of good prediction. The tree J.48 and random forest algorithms had the best results. The random forest algorithm stands out as the one with the best result, with 64.5% of the correctly classified instances, achieving the best prediction of graduate students. In addition, this algorithm obtains the best MCC value, indicating a better relationship between the observed data and the prediction. The ROC value is also a sign of its good performance with a value of 65.2%, well above the rest of the algorithms.

5. Discussion

The advances in technology permit the massive application of data mining nowadays. As Soria-Barreto et al. [66] remark, computing tools and technologies permit a more effective e-learning success. In this research, we aimed to identify variables for student success or failures in the e-learning programs at DEC-UCN by applying the CRISP-DM methodology, which is one of the most widely used tools in this research field. We identified factors that determined student success in studying online programs through the decision tree and neural network techniques. Those results contribute to a greater understanding of the factors with the contingent issue of distance education in Chile. Our study identified the types of programs with the greatest success in terms of the student’s final academic situation and the programs with the greatest failure. The greatest failure programs are undergraduate and bachelor degrees that require more time and dedication for their completion. The number of programs without a degree continues increasing due to its short-term characteristics.
This study is highly relevant for e-learning programs because of data from a database of the oldest online program in Chile. The database contained student records from 2000 to 2018 inclusive; that is, 18,610 records in nineteen years. We highlight that our results found variables that determine the success and failure of students. Our study established that student success and failure largely depend on age, sex, previous education, job, and region. Understanding each program’s academic success factors is decisive for the students’ selection and dissemination of the programs. These results support the organization’s know-how to establish policies for disseminating and maintaining students in online learning modalities. The found variables are relevant for online education in Chile and other neighboring countries because educational institutions can consider those variables to organize their programs.

6. Conclusions, Recommendation, and Future Work

This study showed that data mining techniques are essential for discovering educational data patterns. Educational institutions can apply the described data mining techniques to analyze their data. Because we used data from a university in a developing country, institutions in countries such as Chile could use results from the presented techniques. Regarding them, we draw the following main conclusions:
  • The use of educational data mining, particularly the CRISP-DM methodology, greatly contributes to systematization and efficiency in identifying patterns in the data of distance education. The study allowed us to systematize the data in various sources and formats of the distance education platform in the institution under study (DEC-UCN) and provide valuable information for future analyses in this context.
  • Data mining tools can present more significant advantages than purely statistical tools since they are exploratory, allowing working with different dimensions of the same problem. It is also essential to highlight the possibility and flexibility of these analysis tools to allow us to work with categorical and numerical variables in the same analysis.
  • The performance analysis of the different classification algorithms indicated that the random forest and decision tree algorithms were the ones that allowed a better prediction of results and, therefore, identified the variables that could better explain the performance of students in e-learning programs. The decision tree proved to be a beneficial tool to find relationships between variables unidentified by previously used analysis tools, mainly because a decision tree uses techniques less restrictive than statistics. Those techniques do not require, for example, conditions of data normality and are tolerant of noise in the data.
The study results in the case of the DEC-UCN will allow the organization to focus the admission efforts on the retention of students potentially more exposed and prone to dropping out. We are currently working on applying big data techniques and data mining for pattern discovery to compare their results and know the best approach.

Author Contributions

Formal analysis, M.T.-Y. and J.M.R.; Investigation, A.S., C.V.-S. and G.M.; Data curation, M.T.-Y. and J.M.R.; Project administration, A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Data is part of the UCN database.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Coman, C.; Țîru, L.G.; Meseșan-Schmitz, L.; Stanciu, C.; Bularca, M.C. Online teaching and learning in higher education during the coronavirus pandemic: Students’ perspective. Sustainability 2020, 12, 10367. [Google Scholar] [CrossRef]
  2. Koedinger, K.R.; D’Mello, S.; McLaughlin, E.A.; Pardos, Z.A.; Rosé, C.P. Data mining and education. WIREs Cogn. Sci. 2015, 6, 333–353. [Google Scholar] [CrossRef] [PubMed]
  3. Asín, A.; Peinado, J.; Jurado, P. La sociedad del conocimiento y las TICs: Una inmejorable oportunidad para el cambio docente. In Pixel-Bit: Revista de Medios y Educación Nº 34; Universidad de Sevilla: Seville, Spain, 2009; pp. 179–204. ISSN 1133-8482. [Google Scholar]
  4. Delone, W.H.; McLean, E.R. The DeLone and McLean Model of Information Systems Success: A Ten-Year Update. J. Manag. Inf. Syst. 2003, 19, 9–30. [Google Scholar]
  5. Alsabawy, A.; Cater-Steel, A.; Soar, J. A Model to Measure E-Learning Systems Success. Meas. Organ. Inf. Syst. Success New Technol. Pract. 2012, 39, 293–317. [Google Scholar] [CrossRef] [Green Version]
  6. Herrera, M.; Ruiz, S.; Romagnano, M.R.; Ganga, L.; Lund, M.I.; Torres, E. Aplicando métodos y técnicas de la ciencia de los datos a datos universitarios. In Proceedings of the XXI Workshop de Investigadores en Ciencias de la Computación WICC 2019, Universidad Nacional de San Juan, San Jose, Argentina, 21 October 2019. [Google Scholar]
  7. Martínez-Plumed, F.; Contreras-Ochando, L.; Ferri, C.; Hernández Orallo, J.; Kull, M.; Lachiche, N.; Ramírez Quintana, M.J.; Flach, P.A. CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories. IEEE Trans. Knowl. Data Eng. 2019, 33, 3048–3061. [Google Scholar] [CrossRef] [Green Version]
  8. Hussin, W.N.T.W.; Harun, J.; Shukor, N.A. A Review on the Classification of Students’ Interaction in Online Social Collaborative Problem-based Learning Environment: How Can We Enhance the Students’ Online Interaction? Univ. J. Educ. Res. 2019, 7, 125–134. [Google Scholar] [CrossRef]
  9. Fukuzawa, S.; Cahn, J. Technology in problem-based learning: Helpful or hindrance? Int. J. Inf. Learn. Technol. 2019, 36, 66–76. [Google Scholar] [CrossRef]
  10. Valverde-Berrocoso, J.; Garrido-Arroyo, M.d.C.; Burgos-Videla, C.; Morales-Cevallos, M.B. Trends in educational research about e-learning: A systematic literature review (2009–2018). Sustainability 2020, 12, 5153. [Google Scholar] [CrossRef]
  11. Ocaña, J.M.; Morales-Urrutia, E.K.; Pérez-Marín, D.; Pizarro, C. Can a learning companion be used to continue teaching programming to children even during the COVID-19 pandemic? IEEE Access 2020, 8, 157840–157861. [Google Scholar] [CrossRef]
  12. Palacios, C.A.; Reyes-Suárez, J.A.; Bearzotti, L.A.; Leiva, V.; Marchant, C. Knowledge Discovery for Higher Education Student Retention Based on Data Mining: Machine Learning Algorithms and Case Study in Chile. Entropy 2021, 23, 485. [Google Scholar] [CrossRef]
  13. Gao, P.; Wu, W.; Yang, Y. Discovering Themes and Trends in Digital Transformation and Innovation Research. J. Theor. Appl. Electron. Commer. Res. 2022, 17, 1162–1184. [Google Scholar] [CrossRef]
  14. Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. From data mining to knowledge discovery in databases. AI Mag. 1996, 17, 37. [Google Scholar]
  15. Nájera, A.B.U.; de la Calleja Mora, J. Brief review of educational applications using data mining and machine learning. Redie. Rev. Electrón. Investig. Educ. 2017, 19, 84–96. [Google Scholar]
  16. Cummins, M.R. Nonhypothesis-driven research: Data mining and knowledge discovery. In Clinical Research Informatics; Springer: Berlin/Heidelberg, Germany, 2019; pp. 341–356. [Google Scholar]
  17. Sugiyarti, E.; Jasmi, K.A.; Basiron, B.; Huda, M.; Shankar, K.; Maseleno, A. Decision support system of scholarship grantee selection using data mining. Int. J. Pure Appl. Math. 2018, 119, 2239–2249. [Google Scholar]
  18. Witten, I.H.; Frank, E. Data mining: Practical machine learning tools and techniques with Java implementations. ACM Sigmod Rec. 2002, 31, 76–77. [Google Scholar] [CrossRef]
  19. Ngo, T. Data mining: Practical machine learning tools and technique, by ian h. witten, eibe frank, mark a. hell. ACM SIGSOFT Softw. Eng. Notes 2011, 36, 51–52. [Google Scholar] [CrossRef]
  20. Scheuer, O.; McLaren, B.M. Educational data mining. Encycl. Sci. Learn. 2012, 1075, 1079. [Google Scholar]
  21. Hernández-Blanco, A.; Herrera-Flores, B.; Tomás, D.; Navarro-Colorado, B. A systematic review of deep learning approaches to educational data mining. Complexity 2019, 2019, 1306039. [Google Scholar] [CrossRef]
  22. Cengiz, M.; Birant, K.U.; Yildirim, P.; Birant, D. Development of an interactive game-based learning environment to teach data mining. Int. J. Eng. Educ. 2017, 33, 1598–1617. [Google Scholar]
  23. Almaiah, M.A.; Almulhem, A. A conceptual framework for determining the success factors of e-learning system implementation using Delphi technique. J. Theor. Appl. Inf. Technol. 2018, 96, 5962–5976. [Google Scholar]
  24. Almaiah, M.A.; Alyoussef, I.Y. Analysis of the effect of course design, course content support, course assessment and instructor characteristics on the actual use of E-learning system. IEEE Access 2019, 7, 171907–171922. [Google Scholar] [CrossRef]
  25. Almaiah, M.A.; Alismaiel, O.A. Examination of factors influencing the use of mobile learning system: An empirical study. Educ. Inf. Technol. 2019, 24, 885–909. [Google Scholar] [CrossRef]
  26. Almaiah, M.A.; Al-Khasawneh, A.; Althunibat, A. Exploring the critical challenges and factors influencing the E-learning system usage during COVID-19 pandemic. Educ. Inf. Technol. 2020, 25, 5261–5280. [Google Scholar] [CrossRef] [PubMed]
  27. Hendrickx, T.; Cule, B.; Meysman, P.; Naulaerts, S.; Laukens, K.; Goethals, B. Mining Association Rules in Graphs Based on Frequent Cohesive Itemsets. In Proceedings of the Advances in Knowledge Discovery and Data Mining; Cao, T., Lim, E.P., Zhou, Z.H., Ho, T.B., Cheung, D., Motoda, H., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 637–648. [Google Scholar]
  28. Moro, S.; Cortez, P.; Laureano, R. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology; EUROSIS-ETI: Ostend, Belgium, 2011. [Google Scholar]
  29. Ghazal, M.M.; Hammad, A. Application of knowledge discovery in database (KDD) techniques in cost overrun of construction projects. Int. J. Constr. Manag. 2022, 22, 1632–1646. [Google Scholar] [CrossRef]
  30. Hand, D.J.; Smyth, P.; Mannila, H. Principles of Data Mining; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
  31. Dåderman, A.; Rosander, S. Evaluating Frameworks for Implementing Machine Learning in Signal Processing: A Comparative Study of CRISP-DM, SEMMA and KDD; KTH, School of Electrical Engineering and Computer Science (EECS): Stockholm, Sweden, 2018. [Google Scholar]
  32. Wiemer, H.; Drowatzky, L.; Ihlenfeldt, S. Data Mining Methodology for Engineering Applications (DMME)—A Holistic Extension to the CRISP-DM Model. Appl. Sci. 2019, 9, 2407. [Google Scholar] [CrossRef] [Green Version]
  33. Wirth, R.; Hipp, J. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, Manchester, UK, 11–13 April 2000; Volume 1, pp. 29–39. [Google Scholar]
  34. Phyu, T.N. Survey of classification techniques in data mining. In Proceedings of the International Multiconference of Engineers and Computer Scientists, London, UK, 1–3 July 2009; Volume 1. [Google Scholar]
  35. Soofi, A.A.; Awan, A. Classification techniques in machine learning: Applications and issues. J. Basic Appl. Sci. 2017, 13, 459–465. [Google Scholar] [CrossRef]
  36. Mahesh, B. Machine learning algorithms-a review. Int. J. Sci. Res. (IJSR) 2020, 9, 381–386. [Google Scholar]
  37. Phan, T.N.; Kuch, V.; Lehnert, L.W. Land Cover Classification using Google Earth Engine and Random Forest Classifier—The Role of Image Composition. Remote Sens. 2020, 12, 2411. [Google Scholar] [CrossRef]
  38. Hameed, K.; Chai, D.; Rassau, A. A sample weight and adaboost cnn-based coarse to fine classification of fruit and vegetables at a supermarket self-checkout. Appl. Sci. 2020, 10, 8667. [Google Scholar] [CrossRef]
  39. Quinlan, J. C4.5: Programs for Machine Learning; Ebrary online; Elsevier Science: Amsterdam, The Netherlands, 2014. [Google Scholar]
  40. Badawi, S.A.Q.; Takruri, M.; Albadawi, Y.; Khattak, M.A.K.; Nileshwar, A.K.; Mosalam, E. Four Severity Levels for Grading the Tortuosity of a Retinal Fundus Image. J. Imaging 2022, 8, 258. [Google Scholar] [CrossRef]
  41. Chaves, L.; Marques, G. Data mining techniques for early diagnosis of diabetes: A comparative study. Appl. Sci. 2021, 11, 2218. [Google Scholar] [CrossRef]
  42. Martínez-Cerdá, J.F.; Torrent-Sellens, J.; González-González, I. Socio-technical e-learning innovation and ways of learning in the ICT-space-time continuum to improve the employability skills of adults. Comput. Hum. Behav. 2020, 107, 105753. [Google Scholar] [CrossRef]
  43. Pozón-López, I.; Kalinic, Z.; Higueras-Castillo, E.; Liébana-Cabanillas, F. A multi-analytical approach to modeling of customer satisfaction and intention to use in Massive Open Online Courses (MOOC). Interact. Learn. Environ. 2020, 28, 1003–1021. [Google Scholar] [CrossRef]
  44. Gilar-Corbi, R.; Pozo-Rico, T.; Castejón, J.L. Desarrollando la Inteligencia Emocional en Educación Superior: Evaluación de la Efectividad de un Programa en tres Países; Universidad Nacional de Educación a Distancia (España): Madrid, Spain, 2019. [Google Scholar]
  45. Wani, H.A. The relevance of e-learning in higher education. ATIKAN 2013, 3. [Google Scholar]
  46. Meskhi, B.; Ponomareva, S.; Ugnich, E. E-learning in higher inclusive education: Needs, opportunities and limitations. Int. J. Educ. Manag. 2019, 33, 424–437. [Google Scholar] [CrossRef]
  47. Saqr, M.; Alamro, A. The role of social network analysis as a learning analytics tool in online problem based learning. BMC Med. Educ. 2019, 19, 160. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  48. Al-Fraihat, D.; Joy, M.; Sinclair, J. Evaluating E-learning systems success: An empirical study. Comput. Hum. Behav. 2020, 102, 67–86. [Google Scholar] [CrossRef]
  49. Romi, I.M. A Model for e-Learning Systems Success: Systems, Determinants, and Performance; Palestine Polytechnic University: Hebron, Palestinian, 2017. [Google Scholar]
  50. Hayashi, A.; Chen, C.; Ryan, T.; Wu, J. The role of social presence and moderating role of computer self efficacy in predicting the continuance usage of e-learning systems. J. Inf. Syst. Educ. 2020, 15, 5. [Google Scholar]
  51. Damabi, M.; Firoozbakht, M.; Ahmadyan, A. A Model for Customers Satisfaction and Trust for Mobile Banking Using DeLone and McLean Model of Information Systems Success. J. Soft Comput. Decis. Support Syst. 2018, 5, 21–28. [Google Scholar]
  52. Donovan, E.; Guzman, I.R.; Adya, M.; Wang, W. A Cloud Update of the DeLone and McLean Model of Information Systems Success. J. Inf. Technol. Manag. 2018, 29, 23–34. [Google Scholar]
  53. Németh, T. How to back up Modules with blended learning The e-Learning platform of FAME. Prosperitas 2019, 6, 102–141. [Google Scholar] [CrossRef]
  54. Radha, S.; Michael Mariadhas, J.; Subramani, A.; Akbar Jan, N. Role of e-learning and digital media resources in employability of management students. Online J. Distance Educ. e-Learn. 2019, 7, 116–123. [Google Scholar]
  55. Cidral, W.A.; Oliveira, T.; Di Felice, M.; Aparicio, M. E-learning success determinants: Brazilian empirical study. Comput. Educ. 2018, 122, 273–290. [Google Scholar] [CrossRef]
  56. García Aretio, L. El problema del abandono en estudios a distancia. Respuestas desde el Diálogo Didáctico Mediado. RIED. Rev. Iberoam. Educ. Distancia 2019, 22, 245–270. [Google Scholar] [CrossRef]
  57. Weinberg, S.L.; Abramowitz, S.K. Statistics Using IBM SPSS: An Integrative Approach, 3rd ed.; Cambridge University Press: Cambridge, CA, USA, 2016. [Google Scholar]
  58. Li, M.; Xu, H.; Deng, Y. Evidential Decision Tree Based on Belief Entropy. Entropy 2019, 21, 897. [Google Scholar] [CrossRef] [Green Version]
  59. Zhao, L.; Lee, S.; Jeong, S.P. Decision Tree Application to Classification Problems with Boosting Algorithm. Electronics 2021, 10, 1903. [Google Scholar] [CrossRef]
  60. Chiu, Y.P. Social Recommendations for Facebook Brand Pages. J. Theor. Appl. Electron. Commer. Res. 2021, 16, 71–84. [Google Scholar] [CrossRef]
  61. Hossin, M.; Sulaiman, M.N. A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag. Process 2015, 5, 1. [Google Scholar]
  62. Nhu, V.H.; Janizadeh, S.; Avand, M.; Chen, W.; Farzin, M.; Omidvar, E.; Shirzadi, A.; Shahabi, H.; Clague, J.; Jaafari, A.; et al. Gis-based gully erosion susceptibility mapping: A comparison of computational ensemble data mining models. Appl. Sci. 2020, 10, 2039. [Google Scholar] [CrossRef] [Green Version]
  63. Tsiakmaki, M.; Kostopoulos, G.; Kotsiantis, S.; Ragos, O. Implementing AutoML in educational data mining for prediction tasks. Appl. Sci. 2019, 10, 90. [Google Scholar] [CrossRef] [Green Version]
  64. Chicco, D.; Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med. Inform. Decis. Mak. 2020, 20, 1–16. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  65. Jiménez-Valverde, A. Insights into the area under the receiver operating characteristic curve (AUC) as a discrimination measure in species distribution modelling. Glob. Ecol. Biogeogr. 2012, 21, 498–507. [Google Scholar] [CrossRef]
  66. Soria-Barreto, K.; Ruiz-Campo, S.; Al-Adwan, A.S.; Zuniga-Jara, S. University students intention to continue using online learning tools and technologies: An international comparison. Sustainability 2021, 13, 13813. [Google Scholar] [CrossRef]
Figure 1. KDD methodology process.
Figure 1. KDD methodology process.
Sustainability 15 00895 g001
Figure 2. The CRISP-DM methodology process.
Figure 2. The CRISP-DM methodology process.
Sustainability 15 00895 g002
Figure 3. Proposed model to predict the most influential factors of students at risk.
Figure 3. Proposed model to predict the most influential factors of students at risk.
Sustainability 15 00895 g003
Figure 4. Excerpt from the relational model of the ANTEC database system.
Figure 4. Excerpt from the relational model of the ANTEC database system.
Sustainability 15 00895 g004
Figure 5. Decision Tree I: Program with the highest percentage of graduates.
Figure 5. Decision Tree I: Program with the highest percentage of graduates.
Sustainability 15 00895 g005
Figure 6. Decision Tree II: Second program with the highest percentage of graduates.
Figure 6. Decision Tree II: Second program with the highest percentage of graduates.
Sustainability 15 00895 g006
Figure 7. Decision Tree III: Program with the highest percentage of students eliminated.
Figure 7. Decision Tree III: Program with the highest percentage of students eliminated.
Sustainability 15 00895 g007
Figure 8. Determinant factors for academic success.
Figure 8. Determinant factors for academic success.
Sustainability 15 00895 g008
Table 1. Data mining metrics.
Table 1. Data mining metrics.
MetricDefinition
PrecisionIt is used to measure the positive patterns that are correctly
predicted from the total predicted patterns in a positive class [62].
RecallIt permits to measure the fraction of positive patterns that
are correctly classified [62].
AccuracyIt measures the ratio of correct predictions over the total
number of instances evaluated [63].
F1Metric that represents the harmonic mean between recall and
precision values [41].
Matthew’s correlation   
coefficient (MCC)
Measure that is not affected by the dataset problem of being
unbalanced. MCC is a correlation coefficient between observed
and predicted binary rankings; returns a value between 1 and + 1 .
A coefficient of + 1 represents a perfect prediction, 0 is
no better than a random prediction, and 1 indicates complete
disagreement between prediction and observation [64].
ROC curveIt is a graphic representation of the relationship between the true-positive
and false-positive ratios of the classifier. The area under the ROC curve
provides an approach to evaluate which model is better on average.
A model will be considered to discriminate better than chance if the curve
lies above the diagonal of no discrimination, i.e., if the AUC is higher than [65].
Table 2. List of educational programs with the largest number of students.
Table 2. List of educational programs with the largest number of students.
ProgramFrequencyPercentage
Human Resources Management199610.7
Environmental Management16178,7
Family Mediation15068.1
Psychopedagogy14998.0
Total Quality Management Total11696.3
Integrated Management: Quality. Environment. and Safety10095.4
Educational Orientation8944.8
Higher Education6483.5
Education and Professional Technical High School Teacher5452.9
Primary Education Teacher with a Minor in NB1 and NB25092.7
Family Counseling5072.7
Behavioral Management Techniques Applied to Children and Adolescents4992.7
Education and Primary Education Teacher4672.5
Criminal Procedural Law: “Accusatory System or Oral Trial”4462.4
Communication and Language Disorder4182.2
Educational Administration4022.2
Administration of Technical-Pedagogical Units3882.1
Preparation and Evaluation of Investment Projects3261.7
Minor in Language and Communication for Teachers of the Second Cycle of Language and Communication2921.6
Minor in Education in Mathematics for Teachers of the Second Cycle of Basic General Education2471.3
Degree in Education and Primary Education Teacher2371.3
Management in Corporate Communication1791.0
Continuous Improvement1730.9
Higher level in Executive Secretariat1540.8
Mathematics Education for Primary Education Teachers1450.8
Pedagogical Management for Higher Level Technical Training1290.7
Formulation and evaluation of projects980.5
Others218011.6
Table 3. Importance grade of variables in the global program type using neural networks.
Table 3. Importance grade of variables in the global program type using neural networks.
SamplePredicted
ActRemAbnTransGradCertCorrect %
TrainingActive (Act)02000340.0%
Removed (Rem)06102000114084.3%
Abnegated (Abn)0426000200.0%
Transferred (Trans)01700030.0%
Graduated (Grad)03020002010.0%
Certified (Cert)02906000172037.2%
Overall %0.0%75.8%0.0%0.0%0.0%24.2%60.8%
TestingActive (Act)02000160.0%
Removed (Rem)0256500047984.3%
Abnegated (Abn)018100080.0%
Transferred (Trans)01300000.0%
Graduated (Grad)0103000970.0%
Certified (Cert)0130000075336.7%
Overall %0.0%75.0%0.0%0.0%0.0%24.5%60.2%
Table 4. Classification results of applied algorithms.
Table 4. Classification results of applied algorithms.
TP RateFP RatePrecisionRecallF-MeasureMCCROC AreaPRC Area
AdaBoostM10.6220.5320.5930.6220.5740.1180.5430.559 
Naïve Bayes0.6170.6160.5480.6170.4750.0070.535 0.554 
Random Forest0.6450.4350.6340.6450.6360.2220.652 0.658
TREE J.480.643 0.4940.6250.6430.6070.1840.604 0.609
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sánchez, A.; Vidal-Silva, C.; Mancilla, G.; Tupac-Yupanqui, M.; Rubio, J.M. Sustainable e-Learning by Data Mining—Successful Results in a Chilean University. Sustainability 2023, 15, 895. https://doi.org/10.3390/su15020895

AMA Style

Sánchez A, Vidal-Silva C, Mancilla G, Tupac-Yupanqui M, Rubio JM. Sustainable e-Learning by Data Mining—Successful Results in a Chilean University. Sustainability. 2023; 15(2):895. https://doi.org/10.3390/su15020895

Chicago/Turabian Style

Sánchez, Aurora, Cristian Vidal-Silva, Gabriela Mancilla, Miguel Tupac-Yupanqui, and José M. Rubio. 2023. "Sustainable e-Learning by Data Mining—Successful Results in a Chilean University" Sustainability 15, no. 2: 895. https://doi.org/10.3390/su15020895

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop