Predictive Modeling of Student Dropout in MOOCs and Self-Regulated Learning

Psathas, Georgios; Chatzidaki, Theano K.; Demetriadis, Stavros N.

doi:10.3390/computers12100194

Open AccessArticle

Predictive Modeling of Student Dropout in MOOCs and Self-Regulated Learning

by

Georgios Psathas

^1,*,

Theano K. Chatzidaki

²

and

Stavros N. Demetriadis

^1,*

¹

School of Informatics, Aristotle University of Thessaloniki, 541 24 Thessaloniki, Greece

²

Department of Economics, University of Macedonia, 156 Egnatia Street, 546 36 Thessaloniki, Greece

^*

Authors to whom correspondence should be addressed.

Computers 2023, 12(10), 194; https://doi.org/10.3390/computers12100194

Submission received: 31 July 2023 / Revised: 8 September 2023 / Accepted: 19 September 2023 / Published: 27 September 2023

(This article belongs to the Special Issue Recent Advances in Computer-Assisted Learning)

Download

Browse Figure

Versions Notes

Abstract

:

The primary objective of this study is to examine the factors that contribute to the early prediction of Massive Open Online Courses (MOOCs) dropouts in order to identify and support at-risk students. We utilize MOOC data of specific duration, with a guided study pace. The dataset exhibits class imbalance, and we apply oversampling techniques to ensure data balancing and unbiased prediction. We examine the predictive performance of five classic classification machine learning (ML) algorithms under four different oversampling techniques and various evaluation metrics. Additionally, we explore the influence of self-reported self-regulated learning (SRL) data provided by students and various other prominent features of MOOCs as potential indicators of early stage dropout prediction. The research questions focus on (1) the performance of the classic classification ML models using various evaluation metrics before and after different methods of oversampling, (2) which self-reported data may constitute crucial predictors for dropout propensity, and (3) the effect of the SRL factor on the dropout prediction performance. The main conclusions are: (1) prominent predictors, including employment status, frequency of chat tool usage, prior subject-related experiences, gender, education, and willingness to participate, exhibit remarkable efficacy in achieving high to excellent recall performance, particularly when specific combinations of algorithms and oversampling methods are applied, (2) self-reported SRL factor, combined with easily provided/self-reported features, performed well as a predictor in terms of recall when LR and SVM algorithms were employed, (3) it is crucial to test diverse machine learning algorithms and oversampling methods in predictive modeling.

Keywords:

Massive Open Online Courses (MOOCs); self-regulated learning (SRL); prediction; self-reported data; oversampling

1. Introduction

This study aims to examine the factors that contribute to the early prediction of MOOC dropout to identify at-risk students and facilitate their successful completion through supportive interventions. Emphasis is placed on the SRL factor, which is often overlooked in MOOC dropout prediction studies. The motivation for this research lies in the persistent issue of high dropout rates in MOOCs and the investigation of the SRL factor in addressing this problem, given that students with higher SRL indices have been shown to achieve better learning outcomes. Additionally, the study explores the impact on the prediction performance of the employment of different oversampling techniques to address the class imbalance problem. Therefore, a variety of oversampling techniques and algorithms to compare prediction performance and identify the best combinations for the specific dataset were used.

1.1. Massive Open Online Courses (MOOCs)

MOOCs are a new distance education model established in 2008 by George Siemens. Many MOOC platforms offer educational design with video lectures, announcements, forums, and assessments (quizzes, assignments, etc.) [1,2]. Some MOOCs allow students to progress at their own pace, while others follow a predetermined schedule [3]. The acquisition of completion certificates serves as a motivation for many students [4]. Most students fail to complete MOOCs successfully, even if they intend to do so [5]. The challenges of MOOCs include the absence of a supporting and guiding instructor [6], limited social interaction between teachers and students [7], and the most critical challenge lies in the high dropout rates observed within the MOOC environment [3]. In the literature, dropout rate percentages are cited as 93.5% [8] or 91–93% [9].

Many studies have extensively explored the phenomenon of dropout rates in MOOC courses. Several researchers have indicated significantly low percentages in completion rates, falling below 10% and even as low as 5% [3,10,11,12]. Although these figures may appear alarming, such an assessment is based on the assumption that enrollment in a MOOC is comparable to enrolling in a traditional course, which is not always the case, as the intentions of individuals enrolling in a MOOC differ—some seek professional development. In contrast, others pursue simple information or entertainment [11]. However, it is essential to acknowledge some positive exceptions to the high dropout rates. For instance, programming MOOCs demonstrated retention rates above 60% [3,13].

Identifying and exploring factors directly influencing the attrition of students from MOOCs will enable researchers and educators to examine novel strategies and techniques to enhance students’ persistence and successful course completion. Dalipi et al. [14] have categorized these factors contributing to high dropout rates into those associated with students (such as lack of motivation, poor time management, inadequate background knowledge, and skills) and those related to MOOCs (course design, lack of interactions, hidden costs).

Ihantola et al. [3] conducted a study to investigate the attrition rates of students in MOOCs with flexible versus strict scheduling. The findings revealed that students enrolled in MOOCs with flexible scheduling were more likely to drop out early compared to those in MOOCs with rigid schedules. In their study, approximately 17% of students abandoned the course within the first week, while the corresponding rate for the flexible MOOC was 50%. However, after the initial week, the dropout behavior between the two versions of MOOCs became nearly similar.

Furthermore, the researchers observed that both versions of MOOCs had students who completed all computer programming assignments within a week but did not continue further, possibly due to perceiving a heavy workload. Therefore, the authors suggest that identifying the profiles of students who benefit from each type of MOOC could lead to novel, more effective methods of organizing and grading courses. Previous studies have also identified that the lack of a sense of community and ineffective social interactions and collaborations contribute to the high attrition rates in MOOCs [7,14,15].

Hone and El Said [16] also investigated factors influencing retention in MOOCs and found that 32.2% of students successfully completed their preferred courses, a rate surpassing the average completion rate. The main driver for their completion was the satisfaction derived from the course content, which was perceived as unique and not readily available elsewhere. However, non-completers identified several reasons for their discontinuation, including feelings of isolation due to inadequate communication channels, perceived complexity and technical difficulties of the courses, and a lack of engagement. A related study by Zhang [17] explored how to enhance MOOC attractiveness by aligning courses with students’ regulatory foci. The observations made in this study indicate that students with promotion-focused mindsets were more influenced by advocates emphasizing gains and positive outcomes, while prevention-focused students responded better to advocates stressing the avoidance of losses.

1.2. Prediction of Dropout and the SRL Factor

According to Gardner and Brooks (2018) [2], regarding the statistical models used to map features to predictions, supervised learning techniques are extensively used in predictive student modeling in MOOCs, compared to unsupervised approaches, as student dropout/stopout is easily observable. In supervised learning, machines are trained using well “labelled” training data, and on the basis of that data, machines predict the output. Authors chose to present indicatively popular techniques for MOOC learner modeling, with very good empirical performance in large-scale MOOC modeling studies, (e.g., Dass et al., 2021 [18]). LR and SVM are among the most used, while NB, kNN, and DT are less frequently used in surveys ([2]). There is no pattern found for an algorithm to be distinguished compared to others. According to Herrmannova et al., 2015 [19], each model captures different properties of input data and the results are complementary.

Self-regulated learning (SRL) is a complex multidimensional phenomenon often described by a set of individual cognitive, social, metacognitive, and behavioral processes embedded in a cyclical model. Students with limited application of SRL strategies do not perform well [1,20,21,22]. Zimmerman [23] proposed a cyclical model consisting of three interrelated phases of the learning process. In the forethought phase, self-regulated learners set learning goals and design the strategy for their learning. This is followed by the performance and control phase, during which self-regulated learners employ strategies to process the learning material. They seek help when needed, manage their time, structure their environment, and monitor their learning processes. In the third phase of self-reflection, self-regulated learners evaluate their performance and adjust their strategies to achieve their learning goals [1,18,22,24,25].

Research using questionnaires has shown positive correlations between the mentioned SRL activity and the completion of MOOCs [24]. Despite the limitations of self-reported data [26], the large sample size enhances their usefulness. On the other hand, the use of trace data in measuring SRL has increased, but their interpretation remains challenging [26]. Jansen et al. [24] propose the combined use of SRL data from traces and questionnaires. Timely SRL support interventions in MOOCs should be considered significant pedagogical tools that contribute to achieving positive outcomes for students [20,21].

The features of MOOCs, such as internet-based massiveness, openness, and flexible learning, create a unique blend of a large number of learners, making the prediction of learner success (as well as providing support based on these predictions) particularly challenging. Several researchers have developed prediction models by employing machine learning (ML) algorithms [27] and adopting supervised, unsupervised, and semi-supervised architectures [28]. Deep learning methods are also utilized for predicting dropout. For instance, Moreno-Marcos et al. [29] applied a combination of random forest (RF), generalized linear model (GLM), support vector machines (SVM), and decision trees (DT). Feng et al. [10] utilized logistic regression (LR), support vector machine with a linear kernel (SVM), random forest (RF), gradient boosting decision tree (GBDT), and a three-layer deep neural network (DNN) for their analysis.

Diverse features, even in limited quantities, provide a more comprehensive, multidimensional view of learners and can improve the quality of predictive models. Collecting additional data, especially during the initial weeks of a course, enhances prediction performance [2,28]. For successful timely interventions, predictive models need to be transferable, meaning they perform well on new course iterations by utilizing historical data [30,31,32]. Some researchers have examined specific aspects of SRL and observed its impact on predicting success, including goal setting and strategic planning [6], student-programmed plans [33], the combination of self-reported SRL strategies and patterns of interaction sequences, demographic features, and intentions [34]. For example, Kizilcec et al. (2017) [6] investigated which specific SRL strategies predict the attainment of personal course goals and how they are mapped in specific interactions with MOOCs’ online content. Yielding to form a longitudinal account of SRL, authors combined learners’ self-reported SRL strategies and characteristics, achievement data, and records of individual’s engagement with the course’s content. Using multiple linear and logistic regression modeling, Maldonado-Mahauad et al. (2018) [34] concluded that specific self-reported SRL strategies (“goal setting”, “strategic planning”, “elaboration” and “help seeking”), complex behavioral data on MOOC’s platform like meaningful online activity sequence patterns, self-reported prior experience and level of interest in MOOC’s assessments as well as total time spent online belong to factors that contribute to the prediction of MOOC learners’ success.

1.3. Handling Imbalanced Classes in Dropout Prediction

Due to the very high rates of student dropout, imbalance in the classes is a common phenomenon in MOOCs and can result in a bias toward the majority class [35], while the class of interest typically has fewer samples [36]. Although it is strongly recommended to pay attention to this issue when training an ML model for dropout prediction [37,38], according to Nagrecha et al., 2017 [35], very few studies in the MOOCs domain refer the consideration of the imbalanced problem (e.g., [39]). Barandela et al. (2004) [40] suggest oversampling the minority class when the imbalance is severe and undersampling the majority class in the opposite case. Most of the MOOC domain studies refer to the employment of oversampling SMOTE techniques to overcome imbalance (e.g., [18,41]).

Effective techniques to improve the performance of classifiers in the presence of a class imbalance in training data [42] include data-level approaches such as undersampling, oversampling, and their combinations [43,44]. In the context of this study, the majority class consists of successful students, and this asymmetry can lead a classifier to primarily predict that students do not drop out of the MOOC, which is an undesirable situation as students at risk are incorrectly identified as non-dropouts, resulting in them not receiving the necessary support intervention. Therefore, oversampling is preferred in this study to avoid neglecting important samples from the majority class [44,45]. To address the problem of overfitting in random oversampling, where minority class samples are repeated to increase their size [46], Chawla et al. [47] proposed the synthetic minority over-sampling technique (SMOTE). SMOTE generates synthetic samples along the line segments between minority class instances and a specified number, k (typically k = 5), of their nearest neighbors, thereby expanding the decision region [44]. This technique has found widespread use [36]. The newly generated synthetic samples contribute essential information to the minority class, mitigating misclassification issues of these samples [46].

Efforts to improve SMOTE include Borderline-SMOTE and ADASYN techniques. Han et al. [44] proposed the Borderline-SMOTE technique, which performs oversampling only on instances near the decision boundary between the two classes, as they are more likely to be misclassified, aiming to create a clearer decision boundary [36]. The technique has two variations: Borderline-SMOTE1, which only uses minority class samples as neighbors, and Borderline-SMOTE2, which additionally uses samples from the majority class to generate synthetic data [36]. As a variation of the SMOTE technique, He et al. [48] proposed ADASYN (adaptive synthetic sampling method), which generates more synthetic data for the instances that are more difficult to learn [49]. Improved predictive performance of classification models through the combined use of PCA and resampling techniques has been reported (e.g., [42]).

Brandt and Lanzén (2021) [49] examined the performance of SMOTE and ADASYN using different classification models and evaluation metrics and found that both techniques improved the performance in most cases. Authors conclude that none of the pre-processing methods they examined consistently outperformed the other, confirming the literature of imbalanced techniques comparison and the existence of a variety of results, where such pre-processing techniques may even decrease the performance. Therefore, in our study, we examine different oversampling methods, in addition to commonly used SMOTE in the MOOCs domain.

1.4. Rationale and Research Questions

Based on the above theoretical background, this study supports the idea that it is interesting to investigate the early prediction of dropout in the MOOC “Programming in Python for Non-Programmers”, taking into consideration primarily the SRL (self-regulated learning) factor, after using various oversampling techniques. For this purpose, we employ five classic educational models [50] commonly used in the literature for comparing oversampling techniques when dealing with class imbalance [36,37,42,46] predictive models (NB, LR, SVM, DT, kNN), and we consider data from student questionnaires regarding their SRL, as well as other typical data, easily provided for this purpose, aiming to identify at-risk students as early as possible and support them with appropriate interventions to complete the course.

The research questions addressed in this study are as follows:

What is the performance of the five classic models (NB, LR, SVM, DT, kNN) in predicting dropout before and after oversampling for various oversampling methods?
What appear to be the important predictors for students’ MOOC dropout and why?
What is the impact of the SRL factor, as determined by self-reported student data, on the performance of dropout prediction?

2. Materials and Methods

2.1. Context

We deployed the xMOOC “Programming in Python for Non-Programmers” that was offered in the Greek language during two time periods: 16 November 2020, to 20 December 2020 (referred to as MOOC1), and 22 February 2021, to 28 March 2021 (referred to as MOOC2). This initiative was a collaboration between the Department of Informatics at AUTH, EKETA, and GUnet, as part of the European-funded project “colMOOC: Integration of Dialogic Agents and Learning Analytics in MOOCs” (https://colmooc.eu/). It had a duration of 5 weeks, specific start and end dates, and included an equal number of 5 thematic modules, organized in each of the weeks. The course curriculum comprised various components aimed at augmenting the learning experience, encompassing video lectures, mini-quizzes, weekly quizzes, weekly assignments, and chat activities. We integrated chat-based collaborative activities with the support of a conversational agent [51], aiming to facilitate productive peer dialogue [52]. These chat activities involved students engaging in dyadic collaboration to collectively respond to open-ended questions with a shared goal, and they were encouraged to interact with a preconfigured conversational agent during their discussions. To foster individual accountability, support on students’ SRL, and awareness of peers’ progress, a learning analytics module was provided, enabling students to reflect on their own and their partner’s performance and compare metrics with other students who completed the same activity, thereby allowing for potential adjustments in learning strategies to enhance their performance and SRL skills. During the second week, an email was disseminated to notify students about the existence of the learning analytics module.

Participants had the flexibility to study the educational material and complete assessments and activities at their own pace within the duration of the MOOC. A certificate of completion was granted to participants with a minimum score of 60%, based on weekly quizzes, assignments, collaborative activities, and a final quiz. After enrollment, students were required to complete a questionnaire, which was essential for receiving the certificate (provided they met the grading criteria). This questionnaire, available only during the first week, collected demographic data, participation intention (degree of engagement with the educational material), programming and Python experience, prior MOOC experience, and their self-regulated learning (SRL) profile).

2.2. Description of the Data Set

The initial dataset comprises 1543 rows and 232 columns, consisting of data from students in MOOC1 (1324 rows) and MOOC2 (219 rows). Our focus was on actively engaged students, defined as those who registered and remained active until the conclusion of the first week, completed the introductory questionnaire, and submitted the first week’s programming assignment. This subset consisted of 935 students for MOOC1 and 134 students for MOOC2 (Table 1).

According to a study by Stein and Allione [53], completing an assessment within the first week is a significant predictor of retention in the MOOC. The success completion rates for these specific subsets of students in the two MOOCs were 88.40% and 87.30%, respectively. It is important to note that there exists an imbalance in the distribution of classes among students who submitted and completed the MOOC and those who did not.

2.3. Procedure

Using the Python/sci-kit ecosystem, we fitted a predictive model on the data from MOOC1 by applying the NB, LR, SVM, DT, and kNN algorithms. We evaluate the performance of these models in predicting dropout in the MOOC2 dataset. Default settings were used for all algorithms. In a case study by Gardner and Brooks [2], parameter tuning showed minimal impact compared to the results obtained from different features or algorithms. The dependent variable DROPOUT categorizes students into the categories 1 (dropout) and 0 (non-dropout) as follows: those who participated in less than 10% of the activities are considered dropouts. The approach followed in this study to investigate the research questions involves the following steps:

Retrieval of data.
Preprocessing and cleaning of the data.
Selection of significant features.
Normalization, dimensionality reduction using PCA, and oversampling.
Training and testing the models using stratified 10-kfold cross-validation.
Comparative evaluation of the models for different oversampling techniques.

2.3.1. Data Retrieval

The student data from the 2 MOOCs, the potential independent variables (Table 2), and the dependent variable were retrieved.

2.3.2. Preprocessing and Data Cleaning

The dataset consists of students who completed the intro questionnaire, submitted the first week’s programming assignment, and received the general intervention (control group). Missing values were handled by deleting the corresponding students from the dataset.

2.3.3. Feature Selection

Different features are likely to incorporate different aspects of the same information. On the other hand, there may be interdependence among features and their coexistence in predictive modeling should be avoided. In this study, the user selects as candidate predictors the feature or combination of features that are desired, in order to arrive at subsets of significant features that contribute to the predictive performance. The results of filtering techniques that evaluate and rank candidate features are taken into account. It is common for variables to be ranked differently by heterogeneous filters, but ultimately, a stronger subset of features is identified [54]. Various statistical tests can be used to determine the features with the strongest relationship with the output variable. We apply the SelectKBest filter type method from the sci-kit-learn library, using chi-square for feature ranking (Figure 1a). Additionally, we can assess the importance of each feature using tree-based classifiers (Figure 1b). We utilize the sklearn built-in extra tree classifier. The higher the score of a feature, the more significant or relevant it is to the output variable [55].

2.3.4. Normalization, Dimension Reduction with PCA and Oversampling

The principal component analysis (PCA) technique in feature extraction aims to reduce the dimensionality of the dataset by identifying and selecting features with higher variance, which carry significant information, especially when dealing with noisy data that can adversely affect model accuracy [36]. Holland [56] suggests determining the number of principal components to retain based on their collective ability to explain a predetermined percentage of the total variance, typically set at 90% in this context. A crucial prerequisite before employing PCA is data normalization, as emphasized by Mourdi et al. [26]. In our study, we ensured data remained on the same scale through scaling, adjusting only the value range. Our analytical pipeline involved the sequential application of normalization, PCA for dimensionality reduction, and oversampling, a commonly employed practice in the literature (e.g., [57]), to overcome imbalance and avoid neglecting important samples from the majority class.

2.3.5. Training and Testing

Data splitting into training and testing sets mimics the application of the model to new data and evaluates its performance [27]. We train the models using the MOOC1 dataset and apply the stratified 10-kfold cross-validation technique, which is suitable for classification problems with class imbalance (e.g., [36]).

2.3.6. Prediction Performance Evaluation of Models

Model evaluation is a crucial component of predictive modeling research in the context of learning analytics. Suitable performance metrics for the field (e.g., [58,59]) include accuracy, precision, recall, F1 score, as well as the confusion matrix (e.g., [44]). Due to the high number of dropouts in MOOCs, minimizing the number of false negatives (FN) [27] and focusing on the recall metric, which summarizes how well students at risk (i.e., the positive class) are predicted, is critical to ensure that they are not missed. This metric indicates missed positive predictions.

3. Results

Results for recall values ≥0.8 on the validation data of MOOC2 before and after oversampling are presented in Table 3 and Table 4, respectively. Except for recall, various popular evaluation metrics are also presented. Comparing Table 3 and Table 4, we observe that oversampling significantly improves the performance of the models in terms of recall. The kNN algorithm achieves the highest recall performance when data are imbalanced. All algorithms used appear in Table 4 but kNN appears more frequently. Programming assignment performance at the end of the first module of the MOOC seems a significant predictor for MOOC completers rather than dropouts as it performs high in terms of precision rather than recall. In the context of imbalanced data, it is essential to recognize that high accuracy values do not necessarily translate to superior predictive performance of a model. Through the process of oversampling, we identified Employment and UseOfChatTools as the most significant predictors associated with the risk of DROPOUT, as evidenced by their respective recall values. Specifically, both aforementioned features demonstrate a minor adverse correlation with the occurrence of DROPOUT (Spearman’s correlation coefficient = −0.09, p = 0.007, and −0.03, p = 0.341, respectively).

However, it is noteworthy that not all oversampling techniques employed yielded high recall values for the same algorithm and feature set, underscoring the importance of exploring alternative oversampling approaches. For example, when the UseOfChat feature and SMOTE or Borderline-SMOTE1 were used, the recall value on MOOC4 validation data is 0,11 when NB or LR are employed, while the use of ADASYN or Borderline-SMOTE2 yields to corresponding recall value 1.0. When Intention to Participate and Education features combined with Borderline-SMOTE2 were used, the recall value on MOOC4 validation data is 0,47 when LR is employed while the use of ADASYN yields to corresponding recall value 1.0. Similarly, this observation extends to the choice of algorithms, necessitating the exploration and incorporation of diverse algorithms to achieve improved model performance.

Table 5 also presents validation data of MOOC2 but focuses specifically on the prediction performance of the SRL factor in terms of recall. We observe in Table 5 that the SRL factor, combined with other self-reported features, seems a good predictor in terms of recall when LR and SVM algorithms are used.

4. Discussion

We utilize MOOC data to examine the dropout predictive performance of five classic classification ML algorithms under different oversampling techniques (e.g., ADASYN, SMOTE, Borderline-SMOTE1, and Borderline-SMOTE2, for k = 5 neighbors) and evaluation metrics. We also investigate the impact on the performance of self-reported SRL data by students and other popular MOOC features. Using the Python/sci-kit version 0.24.1 ecosystem, we perform predictive modeling on the data from MOOC1 by applying the NB, LR, SVM, DT, and kNN algorithms. We evaluate the performance of these models in predicting dropout in another similar (MOOC2) dataset.

Oversampling may significantly improve the performance of the models in terms of recall. kNN on the Employment feature and NB, LR on UseOfChatTools achieve excellent recall performance, the latter under particular oversampling techniques. Therefore, employing different oversampling methods and classic ML algorithms is useful to compare performance and identify the best combinations for the specific dataset. Intention to Participate, Education, and Experience features (MOOC, Python, Programming) contribute to predictive performance. SRL’s contribution to MOOC dropout prediction performance emerges with LR and SVM algorithms in terms of recall (maximum value according to Table 5: 0.77). SRL factor as a sole feature yielded moderate recall values (0.59), but, combined with other, easily provided data, performs better.

Research Question 1: The datasets of the two Massive Open Online Courses (MOOCs) exhibited significant class imbalances, necessitating the utilization of oversampling techniques to enhance the predictive capabilities of the algorithms. Notably, the employment of a combination of principal component analysis (PCA) and an oversampling method emerged as the most influential factor contributing to the overall performance of all algorithms when employing readily available features such as employment status, frequency of chat tool usage, gender, education level, prior experiences, and intentions. An intriguing observation from our findings is that no discernible pattern emerged favoring one oversampling method over others, confirming the literature of imbalanced techniques comparison. This suggests that the choice of the appropriate oversampling technique may depend on the specific characteristics of the dataset under investigation. Consequently, we advocate for the adoption of diverse oversampling methods and algorithms to better accommodate the heterogeneity inherent in different datasets.

Our investigation further revealed that all algorithms demonstrated commendable performance in terms of recall values. Particularly, the k-nearest neighbors algorithm exhibited exceptional recall for employment-related features, while Naive Bayes and logistic regression excelled in capturing the frequency of chat tool usage. These outcomes indicate the efficacy of the selected algorithms in identifying potential dropout cases, especially concerning specific predictive aspects.

In conclusion, our study contributes valuable insights into the performance of classic predictive models in the context of dropout prediction, shedding light on the interplay between oversampling techniques and algorithm selection. Our findings are consistent with the literature on imbalance methods comparison and the existence of a variety of results in the prediction performance, indicating the need to examine different imbalance techniques (e.g., [49]). By embracing a diverse range of oversampling methods and algorithms, researchers can capitalize on the strengths of different approaches and improve the accuracy and robustness of their predictive models for dropout identification in MOOCs.

Research Question 2: The discussion of the second research question, which aims to identify the important predictors for students’ MOOC dropout, yielded significant findings through the incorporation of readily available student features. Notably, variables such as self-reported employment status emerged as crucial indicators in predicting dropout rates, with the utilization of kNN (k-nearest neighbors) and variables such as UseOfChat with the utilization of LR (logistic regression), and NB (Naive Bayes) algorithms resulting in highly prediction performance, particularly in terms of recall. On the other hand, the NB algorithm exhibited enhanced performance when specific oversampling methods were applied, indicating its sensitivity to the data distribution and the potential to be more effective under certain conditions.

Moreover, the utilization of easily accessible student profile attributes, including experiences, gender, education level, and intention to participate, also proved to be valuable in predicting MOOC dropout rates. These attributes contributed to commendable recall rates when employing various traditional machine learning classification algorithms in conjunction with appropriate oversampling methodologies.

The significance of these findings lies in their practical implications for MOOC platform providers and educators. By identifying the key predictors of dropout, interventions can be designed and implemented to mitigate attrition rates effectively. For instance, targeted support and tailored interventions can be offered to students with specific employment statuses or chat tool usage patterns, thereby fostering a supportive learning environment and increasing the likelihood of course completion.

Research Question 3: The present study aimed to investigate the impact of the self-regulated learning (SRL) factor, as determined through self-reported student data, on the performance of dropout prediction. Our investigation focused on addressing the third research question, which sought to explore the potential contribution of the SRL factor to the predictive accuracy of dropout prediction models.

The discussion of our research findings indicates that the self-reported student self-regulated learning (SRL) factor, when combined with other widely recognized self-reported features, has demonstrated considerable predictive power in terms of recall when utilized with the LR and SVM algorithms. The inclusion of the SRL factor as a predictor offers valuable insights into the dynamics of dropout prediction.

5. Conclusions

This study addresses the challenging problem of early MOOC dropout prediction, which has implications for student retention and success, considering the significant issue of mitigating the imbalanced problem when training an ML model for dropout prediction. We explored a variety of oversampling techniques and classic supervised ML algorithms to compare performance using various evaluation metrics and identify the best combinations for the specific dataset. We also examined the factors that contribute to the early prediction of MOOC dropout in order to identify and support at-risk students. We investigated the self-reported SRL factor as well as various other prominent features of MOOCs as potential indicators of early stage dropout prediction.

The main conclusions are: (1) prominent predictors, including employment status, frequency of chat tool usage, prior subject-related experiences, gender, education, and willingness to participate, exhibit remarkable efficacy in achieving high to excellent recall performance, particularly when specific combinations of algorithms and oversampling methods are applied, (2) self-reported SRL factor, combined with easily provided/self-reported features, performed well as a predictor in terms of recall when LR and SVM algorithms were employed, (3) it is crucial to test diverse classification machine learning algorithms and oversampling methods in predictive modeling.

Researchers in the MOOC domain are suggested to consider thoughtfully the imbalance issue at the pre-processing stage and employ different imbalance mitigation techniques to overcome it, even of the same general category (in our study, oversampling) as we confirmed the literature of imbalanced techniques comparison about the existence of a variety of results in the context of MOOC dropout prediction performance.

Our results also suggest that the self-reported SRL factor contributes significantly to the accuracy of dropout prediction models. However, there is room for further enhancement and refinement of predictive capabilities by delving deeper into the identification of the SRL factor. For instance, complementing self-reported data with trace data could yield more precise and comprehensive insights into students’ self-regulated learning behaviors.

6. Limitations and Future Research

The present study acknowledges certain limitations that warrant consideration. Firstly, the algorithms’ hyperparameters were employed with default settings, and we did not explore their optimization for the specific dataset. Consequently, there exists potential for further investigation into fine-tuning the algorithms’ performance to enhance predictive accuracy and robustness.

It is crucial to recognize that this research constitutes a case study centered on a specific Massive Open Online Course (MOOC) offering. The investigation focused on the subject of “Python for non-programmers” and was conducted in a particular language. As a result, generalizability beyond this specific context should be approached with caution, and caution must be exercised when extending the findings to other MOOCs or diverse academic domains.

Another important consideration is that the data obtained from questionnaires primarily provide insights into students’ perceptions regarding their utilization of self-regulated learning (SRL) strategies. While valuable for understanding student perspectives, such self-reported data may not fully capture the actual SRL behaviors demonstrated by the students. Particularly noteworthy is the fact that the introductory questionnaire was administered early in the MOOC, potentially rendering the responses more reflective of behavioral intentions rather than realized SRL practices.

While the current findings contribute valuable insights into the predictive potential of the self-reported SRL factor, this research could be extended in several ways. To overcome this limitation and gain a more accurate understanding of the SRL factor, future research may consider employing a combination of trace data and self-referential information. The use of a shorter questionnaire, possibly targeting specific, effective strategies (dimensions) of SRL in the MOOC context, could be investigated and contribute to adaptive scaffolding research. This comprehensive approach could provide a more nuanced and reliable assessment of students’ self-regulatory practices throughout the duration of the course. Future studies might explore the integration of additional factors, such as socio-economic background and prior academic performance, to further enhance the accuracy and robustness of dropout prediction models. Additionally, employing more advanced machine learning techniques (e.g., deep learning), employing a variety of imbalance mitigation techniques as well as exploring the use of longitudinal data could lead to more sophisticated and precise predictions.

Notwithstanding these limitations, the present study offers valuable insights into the role of the self-reported SRL factor within the specific context of the investigated MOOC. It provides a foundation for further explorations in the realm of dropout prediction and the design of tailored interventions to support student success in online learning environments. As the field of educational data mining and learning analytics progresses, more advanced techniques and larger datasets can be leveraged to refine and expand upon these findings, thus fostering advancements in the understanding and application of SRL strategies for enhanced educational outcomes.

Author Contributions

Methodology, T.K.C.; Writing—original draft, G.P. and T.K.C.; Writing—review & editing, G.P., T.K.C. and S.N.D.; Supervision, G.P. and S.N.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy restrictions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hsu, S.Y. An Experimental Study of Self-Regulated Learning Strategies Application in MOOCs. Ph.D. Thesis, Teachers College, Columbia University, New York, NY, USA, 2021. [Google Scholar]
Gardner, J.; Brooks, C. Student success prediction in MOOCs. User Model. User-Adapt. Interact. 2018, 28, 127–203. [Google Scholar]
Ihantola, P.; Fronza, I.; Mikkonen, T.; Noponen, M.; Hellas, A. Deadlines and MOOCs: How Do Students Behave in MOOCs with and without Deadlines. In Proceedings of the 2020 IEEE Frontiers in Education Conference (FIE), Uppsala, Sweden, 21–24 October 2020; IEEE: Piscateville, NJ, USA, 2020; pp. 1–9. [Google Scholar]
Chuang, I.; Ho, A. HarvardX and MITx: Four years of open online courses-fall 2012-summer 2016. 2016. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2889436 (accessed on 1 June 2023).
Kizilcec, R.F.; Schneider, E. Motivation as a lens to understand online learners: Toward data-driven design with the OLEI scale. ACM Trans. Comput.-Hum. Interact. (TOCHI) 2015, 22, 1–24. [Google Scholar]
Kizilcec, R.F.; Pérez-Sanagustín, M.; Maldonado, J.J. Self-regulated learning strategies predict learner behavior and goal attainment in Massive Open Online Courses. Comput. Educ. 2017, 104, 18–33. [Google Scholar]
Zheng, S.; Rosson, M.B.; Shih, P.C.; Carroll, J.M. Designing MOOCs as interactive places for collaborative learning. In Proceedings of the Second (2015) ACM Conference on Learning@ Scale, Vancouver, BC, Canada, 14–18 March 2015; pp. 343–346. [Google Scholar]
Jordan, K. Initial trends in enrolment and completion of massive open online courses. Int. Rev. Res. Open Distrib. Learn. 2014, 15, 133–160. [Google Scholar] [CrossRef]
Peng, D.; Aggarwal, G. Modeling mooc dropouts. Entropy 2015, 10, 1–5. [Google Scholar]
Feng, W.; Tang, J.; Liu, T.X. Understanding dropouts in MOOCs. Proc. AAAI Conf. Artif. Intell. 2019, 33, 517–524. [Google Scholar]
Eriksson, T.; Adawi, T.; Stöhr, C. “Time is the bottleneck”: A qualitative study exploring why learners drop out of MOOCs. J. Comput. High. Educ. 2017, 29, 133–146. [Google Scholar] [CrossRef]
Reich, J. MOOC completion and retention in the context of student intent. EDUCAUSE Rev. Online 2014.
Lepp, M.; Luik, P.; Palts, T.; Papli, K.; Suviste, R.; Säde, M.; Tõnisson, E. MOOC in programming: A success story. In Proceedings of the International Conference on e-Learning, Belgrade, Serbia, 28–29 September 2017; pp. 138–147. [Google Scholar]
Dalipi, F.; Imran, A.S.; Kastrati, Z. MOOC dropout prediction using machine learning techniques: Review and research challenges. In Proceedings of the 2018 IEEE Global Engineering Education Conference (EDUCON), Santa Cruz de Tenerife, Spain, 17–20 April 2018; IEEE: Piscateville, NJ, USA, 2018; pp. 1007–1014. [Google Scholar]
Zheng, S.; Rosson, M.B.; Shih, P.C.; Carroll, J.M. Understanding student motivation, behaviors and perceptions in MOOCs. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work and Social Computing, Vancouver, BC, Canada, 13–18 March 2015; pp. 1882–1895. [Google Scholar]
Hone, K.S.; El Said, G.R. Exploring the factors affecting MOOC retention: A survey study. Comput. Educ. 2016, 98, 157–168. [Google Scholar]
Zhang, J. Can MOOCs be interesting to students? An experimental investigation from regulatory focus perspective. Comput. Educ. 2016, 95, 340–351. [Google Scholar]
Dass, S.; Gary, K.; Cunningham, J. Predicting student dropout in self-paced MOOC course using random forest model. Information 2021, 12, 476. [Google Scholar] [CrossRef]
Herrmannova, D.; Hlosta, M.; Kuzilek, J.; Zdrahal, Z. Evaluating weekly predictions of at-risk students at the open university: Results and issues. In Proceedings of the EDEN 2015 Annual Conference Expanding Learning Scenarios: Opening out the Educational Landscape, Barcelona, Spain, 9–12 June 2015. [Google Scholar]
Callan, G.L.; Longhurst, D.; Ariotti, A.; Bundock, K. Settings, exchanges, and events: The SEE framework of self-regulated learning supportive practices. Psychol. Sch. 2021, 58, 773–788. [Google Scholar] [CrossRef]
Sebesta, A.J.; Bray Speth, E. How should I study for the exam? Self-regulated learning strategies and achievement in introductory biology. CBE—Life Sci. Educ. 2017, 16, ar30. [Google Scholar] [CrossRef] [PubMed]
Zimmerman, B.J. Self-efficacy: An essential motive to learn. Contemp. Educ. Psychol. 2000, 25, 82–91. [Google Scholar] [CrossRef] [PubMed]
Zimmerman, B.J. Investigating self-regulation and motivation: Historical background, methodological developments, and future prospects. Am. Educ. Res. J. 2008, 45, 166–183. [Google Scholar] [CrossRef]
Jansen, R.S.; van Leeuwen, A.; Janssen, J.; Conijn, R.; Kester, L. Supporting learners’ self-regulated learning in Massive Open Online Courses. Comput. Educ. 2020, 146, 103771. [Google Scholar] [CrossRef]
Zimmerman, B. Becoming learner: Self-regulated overview. Theory Into Pract. 2002, 41, 64–70. [Google Scholar] [CrossRef]
Winne, P.H. Learning analytics for self-regulated learning. In Handbook of Learning Analytics; SOLAR, Society for Learning Analytics and Research: New York, NY, USA, 2017; pp. 241–249. [Google Scholar]
Cunningham, J.A. Predicting Student Success in a Self-Paced Mathematics MOOC. Ph.D. Thesis, Arizona State University, Tempe, AZ, USA, 2017. [Google Scholar]
Mourdi, Y.; Sadgal, M.; El Kabtane, H.; Fathi, W.B. A machine learning-based methodology to predict learners’ dropout, success or failure in MOOCs. Int. J. Web Inf. Syst. 2019, 15, 489–509. [Google Scholar] [CrossRef]
Moreno-Marcos, P.M.; Munoz-Merino, P.J.; Maldonado-Mahauad, J.; Perez-Sanagustin, M.; Alario-Hoyos, C.; Kloos, C.D. Temporal analysis for dropout prediction using self-regulated learning strategies in self-paced MOOCs. Comput. Educ. 2020, 145, 103728. [Google Scholar] [CrossRef]
Kuzilek, J.; Zdrahal, Z.; Fuglik, V. Student success prediction using student exam behaviour. Future Gener. Comput. Syst. 2021, 125, 661–671. [Google Scholar] [CrossRef]
Wan, H.; Liu, K.; Yu, Q.; Gao, X. Pedagogical intervention practices: Improving learning engagement based on early prediction. IEEE Trans. Learn. Technol. 2019, 12, 278–289. [Google Scholar] [CrossRef]
Kuzilek, J.; Hlosta, M.; Herrmannova, D.; Zdrahal, Z.; Vaclavek, J.; Wolff, A. OU Analyse: Analysing at-risk students at The Open University. Learn. Anal. Rev. 2015, LAK15-1, 1–16. [Google Scholar]
Yeomans, M.; Reich, J. Planning prompts increase and forecast course completion in massive open online courses. In Proceedings of the Seventh International Learning Analytics and Knowledge Conference, Vancouver, BC, Canada, 13–17 March 2017; pp. 464–473. [Google Scholar]
Maldonado-Mahauad, J.; Pérez-Sanagustín, M.; Kizilcec, R.F.; Morales, N.; Munoz-Gama, J. Mining theory-based patterns from Big Data: Identifying self-regulated learning strategies in Massive Open Online Courses. Comput. Hum. Behav. 2018, 80, 179–196. [Google Scholar] [CrossRef]
Nagrecha, S.; Dillon, J.Z.; Chawla, N.V. MOOC dropout prediction: Lessons learned from making pipelines interpretable. In Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, 3–7 April 2017; pp. 351–359. [Google Scholar]
Bajer, D.; Zonć, B.; Dudjak, M.; Martinović, G. Performance analysis of SMOTE-based oversampling techniques when dealing with data imbalance. In Proceedings of the 2019 International Conference on Systems, Signals and Image Processing (IWSSIP), Osijek, Croatia, 5–7 June 2019; IEEE: Piscateville, NJ, USA, 2019; pp. 265–271. [Google Scholar]
Buraimoh, E.; Ajoodha, R.; Padayachee, K. Importance of Data Re-Sampling and Dimensionality Reduction in Predicting Students’ Success. In Proceedings of the 2021 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), Kuala Lumpur, Malaysia, 12–13 June 2021; IEEE: Piscateville, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Fei, M.; Yeung, D.Y. Temporal models for predicting student dropout in massive open online courses. In Proceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic City, NJ, USA, 14–17 November 2015; IEEE: Piscateville, NJ, USA, 2015; pp. 256–263. [Google Scholar]
Al-Shabandar, R.; Hussain, A.; Laws, A.; Keight, R.; Lunn, J.; Radi, N. Machine learning approaches to predict learning outcomes in Massive open online courses. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: Piscateville, NJ, USA, 2017; pp. 713–720. [Google Scholar]
Barandela, R.; Valdovinos, R.M.; Sánchez, J.S.; Ferri, F.J. The imbalanced training sample problem: Under or over sampling? In Proceedings of the Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshops, SSPR 2004 and SPR 2004, Lisbon, Portugal, 18–20 August 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 806–814. [Google Scholar]
Mulyani, E.; Hidayah, I.; Fauziati, S. Dropout prediction optimization through smote and ensemble learning. In Proceedings of the 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), Yogyakarta, Indonesia, 5–6 December 2019; IEEE: Piscateville, NJ, USA, 2019; pp. 516–521. [Google Scholar]
Revathy, M.; Kamalakkannan, S.; Kavitha, P. Machine Learning based Prediction of Dropout Students from the Education University using SMOTE. In Proceedings of the 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 20–22 January 2022; IEEE: Piscateville, NJ, USA, 2022; pp. 1750–1758. [Google Scholar]
Mduma, N.; Kalegele, K.; Machuve, D. Machine learning approach for reducing students dropout rates. International Journal of Advanced Computer Research. 9. 10.19101/IJACR.2018.839045. 2019. Available online: https://www.researchgate.net/publication/333016151_Machine_Learning_Approach_for_Reducing_Students_Dropout_Rates (accessed on 1 June 2023).
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
Rahman, M.M.; Davis, D.N. Addressing the class imbalance problem in medical datasets. Int. J. Mach. Learn. Comput. 2013, 3, 224. [Google Scholar] [CrossRef]
Shelke, M.S.; Deshmukh, P.R.; Shandilya, V.K. A review on imbalanced data handling using undersampling and oversampling technique. Int. J. Recent Trends Eng. Res 2017, 3, 444–449. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; IEEE: Piscateville, NJ, USA, 2008; pp. 1322–1328. [Google Scholar]
Brandt, J.; Lanzén, E. A comparative review of SMOTE and ADASYN in imbalanced data classification. (Dissertation). 2021. Available online: https://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-432162. (accessed on 1 June 2023).
Brooks, C.; Thompson, C. Predictive modelling in teaching and learning. In Handbook of Learning Analytics; SOLAR, Society for Learning Analytics and Research: New York, NY, USA, 2017; pp. 61–68. [Google Scholar]
Demetriadis, S.; Tegos, S.; Psathas, G.; Tsiatsos, T.; Weinberger, A.; Caballé, S.; Dimitriadis, Y.; Sánchez, G.E.; Papadopoulos, M.; Karakostas, A. Conversational agents as group-teacher interaction mediators in MOOCs. In Proceedings of the 2018 Learning With MOOCS (LWMOOCS), Madrid, Spain, 26–28 September 2018; pp. 43–46. [Google Scholar]
Tegos, S.; Demetriadis, S.; Papadopoulos, P.M.; Weinberger, A. Conversational agents for academically productive talk: A comparison of directed and undirected agent interventions. Int. J. Comput.-Support. Collab. Learn. 2016, 11, 417–440. [Google Scholar] [CrossRef]
Stein, R.M.; Allione, G. Mass attrition: An analysis of drop out from a Principles of Microeconomics MOOC; PIER Working Paper Archive 14-031; Penn Institute for Economic Research, Department of Economics, University of Pennsylvania: Philadelphia, PA, USA, 2014. [Google Scholar]
Haq, A.U.; Zhang, D.; Peng, H.; Rahman, S.U. Combining multiple feature-ranking techniques and clustering of variables for feature selection. IEEE Access 2019, 7, 151482–151492. [Google Scholar] [CrossRef]
Shohag, S.I.; Bakaul, M. A Machine Learning Approach to Detect Student Dropout at University. Int. J. Adv. Trends Comput. Sci. Eng. 2021, 10. [Google Scholar]
Holland, S.M. Principal Components Analysis (PCA); Department of Geology, University of Georgia: Athens, GA, USA, 2008; pp. 30602–32501. [Google Scholar]
Mulla, G.A.; Demir, Y.; Hassan, M. Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data. Bitlis Eren Üniversitesi Fen Bilim. Derg. 2021, 10, 858–869. [Google Scholar] [CrossRef]
Umer, R.; Susnjak, T.; Mathrani, A.; Suriadi, S. Prediction of students’ dropout in MOOC environment. Int. J. Knowl. Eng. 2017, 3, 43–47. [Google Scholar] [CrossRef]
Pelánek, R. Metrics for Evaluation of Student Models. J. Educ. Data Min. 2015, 7, 1–19. [Google Scholar]

Figure 1. Ranking of features (a) using chi-square test; (b) using decision tree.

Table 1. Dataset.

	MOOC1	MOOC2
Enrolled	1.324	219
Started	935	134
%	0.70	0.61

Table 2. The variables used in the predictive models.

Features (Corresponding Question in the Questionnaire)	Type	Values
Performance_A1 (Grade in programming assignment at the end of the first module)	Continuous that was discretized	1–10
Gender	Categorical	Male Female
Age	Categorical	<17 18–25 26–35 36–45 46–55 56+
Employment (Are you employed?)	Categorical	Yes No
Studies	Categorical	Doctorate Master’s degree Higher Educational Institution/Higher Technological Educational Institution/Technological Educational Institution Vocational Training Institution High School (upper secondary education) Diploma Gymnasium (lower secondary education) Diploma Primary School Diploma None of the above
MOOC Experience (What is your prior experience in MOOCs?)	Categorical	It is my first time attending a MOOC I have attended 1–3 MOOCs in the past I have attended more than 3 MOOCs in the past
Programming Experience (Previous experience in any programming language)	Categorical	None, complete beginner Basic programming knowledge Moderate programming knowledge Advanced programming knowledge
Python Experience (Previous experience in Python)	Categorical	None, complete beginner Basic programming knowledge Moderate programming knowledge Advanced programming knowledge
UseOfChatTools (Frequency of use of chat tools like Skype, Messenger, or Google Hangouts)	Categorical	I do not use such tools Occasionally (e.g., once a week) Moderately (e.g., 2–3 times a week Frequently (e.g., more than 3 times a week Daily
Intention_to_Participate Which of the following best matches your intended way of participating in the course?	Categorical	I intend to attend the entire course, complete assignments/tasks, and obtain the certificate of successful completion I intend to attend the entire course, and complete assignments/tasks, but I am not interested in obtaining the certificate of successful completion at the moment I intend to attend the course and complete assignments/tasks to the extent that I can, without being particularly interested in its completion I plan to review the course material at the beginning and decide later on whether and how much I will attend I intend to watch only the educational videos of the course and I am not interested in the assignments/tasks I registered out of curiosity to obtain access to the material. I am currently unsure whether and how much I will attend None of the above
SRL START num (SRL profile in the intro questionnaire)	Continuous	1–7

Table 3. MOOC2 (test) data, imbalanced.

Features	Alg.	Roc_auc	Acc.	Prec.	Rec.	f1
SRL START num	LR	0.50	0.87	0.00	0.00	0.00
SRL START num	DT	0.50	0.79	0.13	0.11	0.13
SRL START num, Employment	LR	0.50	0.87	0.00	0.00	0.00
SRL START num, Employment	DT	0.55	0.78	0.20	0.24	0.22
Intention_to_Participate	kNN	0.56	0.72	0.18	0.35	0.24
SRL START num, Intention_to_Participate	LR	0.50	0.87	0.00	0.00	0.00
UseOfChatTools	kNN	0.59	0.69	0.07	0.35	0.12
Intention_to_Participate, SRL START num, Employment, Age	DT	0.54	0.77	0.18	0.24	0.20
Intention_to_Participate, SRL START num, Employment, Age	kNN	0.63	0.89	0.56	0.29	0.39
Performance_A1	NB	0.50	0.87	0.00	0.00	0.00
	LR	0.50	0.87	0.00	0.00	0.00
	SVM	0.50	0.87	0.00	0.00	0.00
	DT	0.50	0.87	0.00	0.00	0.00
	kNN	0.56	0.88	0.67	0.12	0.20
SRL START num, Performance_A1	LR	0.50	0.87	0.00	0.00	0.00
Intention_to_Participate, Performance_A1, SRL START num	LR	0.49	0.89	0.00	0.00	0.00
	SVM	0.50	0.87	0.00	0.00	0.00
	DT	0.52	0.77	0.16	0.18	0.17
Intention_to_Participate, Performance_A1, SRL START num, Age	kNN	0.58	0.88	0.60	0.17	0.27
Intention_to_Participate, Performance_A1, SRL START num, Employment	NB	0.46	0.80	0.00	0.00	0.00
	LR	0.50	0.87	0.00	0.00	0.00
	SVM	0.50	0.87	0.00	0.00	0.00
	DT	0.44	0.72	0.04	0.06	0.05
	kNN	0.51	0.85	0.20	0.06	0.09
Intention_to_Participate, Performance_A1, Programming Experience	kNN	0.59	0.80	0.26	0.29	0.28
Intention_to_Participate, Performance_A1, SRL START num, Programming Experience	DT	0.59	0.80	0.26	0.29	0.28

With gray shading we distinguish the 2 highest values of the recall and roc_aucmetrics and the corresponding application algorithm.

Table 4. MOOC2 (test) data, balanced. Contains recall values >0.8.

Features	Alg.	Oversampling Method, k = 5	Roc_auc	Acc.	Prec.	Rec.	f1
Employment	kNN	ADASYN, SMOTE, Borderline-SMOTE1, Borderline-SMOTE2	0.50	0.13	0.13	1.00	0.23
UseOfChatTools	NB	Borderline-SMOTE2	0.59	0.28	0.15	1.00	0.26
UseOfChatTools	NB, LR	ADASYN	0.50	0.13	0.13	1.00	0.23
Programming experience, Gender	kNN	SMOTE	0.54	0.25	0.14	0.94	0.24
Intention, Education	LR	ADASYN	0.59	0.37	0.16	0.89	0.26
Python experience, MOOC experience	kNN	Borderline-SMOTE2	0.56	0.32	0.14	0.89	0.25
Python experience, MOOC experience	kNN	ADASYN	0.53	0.27	0.14	0.88	0.23
Python experience, MOOC experience	kNN	SMOTE	0.56	0.32	0.14	0.88	0.25
Python experience, MOOC experience	kNN	Borderline-SMOTE1	0.55	0.29	0.14	0.88	0.24
Education	SVM	SMOTE, Borderline-SMOTE1	0.57	0.38	0.15	0.82	0.25
Python experience, Gender	NB	Borderline-SMOTE1	0.57	0.38	0.15	0.82	0.25
Python experience, Gender	LR	Borderline-SMOTE1	0.57	0.37	0.15	0.82	0.25
Python experience, Gender	SVM	Borderline-SMOTE2	0.57	0.37	0.15	0.82	0.25
Python experience, Gender	SVM	SMOTE	0.56	0.37	0.15	0.82	0.25
Python experience, Gender	DT	SMOTE	0.58	0.40	0.15	0.82	0.26
Python experience, Gender	DT, kNN	Borderline-SMOTE2	0.58	0.40	0.15	0.82	0.26
Programming experience, Gender	NB, SVM, DT	Borderline-SMOTE1	0.51	0.27	0.13	0.82	0.22
Python experience, Gender, Intention	NB	SMOTE	0.64	0.51	0.18	0.82	0.30
Python experience, Gender, Intention	LR	Borderline-SMOTE1	0.57	0.37	0.15	0.82	0.25
MOOC experience, Gender	kNN	Borderline-SMOTE1	0.62	0.46	0.17	0.82	0.28
MOOC experience, Gender	LR	Borderline-SMOTE2	0.61	0.46	0.17	0.82	0.28

Table 5. SRL factor performance on MOOC2 (test) data, balanced.

Features	Alg.	Oversampling Method, k = 5	Roc_auc	Acc.	Prec.	Rec.	f1
SRL, MOOC experience, UseOfChat	LR	ADASYN	0.61	0.50	0.17	0.77	0.28
SRL, MOOC experience	LR	ADASYN, Borderline-SMOTE1	0.64	0.59	0.19	0.71	0.30
SRL, MOOC experience	LR	SMOTE	0.63	0.57	0.19	0.71	0.29
SRL, Education	SVM	ADASYN	0.61	0.54	0.17	0.71	0.28
SRL, Programming experience	DT	SMOTE	0.66	0.66	0.22	0.65	0.33
SRL, Programming experience	NB	Borderline-SMOTE2	0.65	0.65	0.21	0.65	0.32
SRL, Intention	LR	Borderline-SMOTE2	0.62	0.60	0.19	0.65	0.29
SRL	kNN	SMOTE	0.61	0.62	0.19	0.59	0.28
SRL	LR	ADASYN, SMOTE	0.57	0.55	0.16	0.59	0.25
SRL	LR	Borderline-SMOTE1	0.57	0.56	0.16	0.59	0.25

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Psathas, G.; Chatzidaki, T.K.; Demetriadis, S.N. Predictive Modeling of Student Dropout in MOOCs and Self-Regulated Learning. Computers 2023, 12, 194. https://doi.org/10.3390/computers12100194

AMA Style

Psathas G, Chatzidaki TK, Demetriadis SN. Predictive Modeling of Student Dropout in MOOCs and Self-Regulated Learning. Computers. 2023; 12(10):194. https://doi.org/10.3390/computers12100194

Chicago/Turabian Style

Psathas, Georgios, Theano K. Chatzidaki, and Stavros N. Demetriadis. 2023. "Predictive Modeling of Student Dropout in MOOCs and Self-Regulated Learning" Computers 12, no. 10: 194. https://doi.org/10.3390/computers12100194

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predictive Modeling of Student Dropout in MOOCs and Self-Regulated Learning

Abstract

1. Introduction

1.1. Massive Open Online Courses (MOOCs)

1.2. Prediction of Dropout and the SRL Factor

1.3. Handling Imbalanced Classes in Dropout Prediction

1.4. Rationale and Research Questions

2. Materials and Methods

2.1. Context

2.2. Description of the Data Set

2.3. Procedure

2.3.1. Data Retrieval

2.3.2. Preprocessing and Data Cleaning

2.3.3. Feature Selection

2.3.4. Normalization, Dimension Reduction with PCA and Oversampling

2.3.5. Training and Testing

2.3.6. Prediction Performance Evaluation of Models

3. Results

4. Discussion

5. Conclusions

6. Limitations and Future Research

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI