1. Introduction
One of the most serious problems facing academic institutions is the high rate of student failure. Several studies have provided evidence that student failure is influenced by an interaction between several decisive factors throughout the academic process, suggesting that the risk of dropping out is configured from a group of variables, rather than a single variable [
1,
2,
3].
The problem of higher education attrition has been extensively researched [
4]. In 1973, Tinto and Cullen defined two categories of dropping out: leaving the college of registration, and failing to obtain any degree [
5]. Both definitions cover a wide range of concerns, including economics. As governments make substantial investments in public and community colleges, they need to measure how much they spend on students who drop out during the first year. During the 2008–2009 academic year, U.S. taxpayers spent more than USD 900 million on full-time, degree-seeking community college students who dropped out during their first year [
6]. Between 2003 and 2008, the U.S. invested nearly USD 6.2 billion in colleges and universities to educate students who did not return for a second year. State governments gave more than USD 1.4 billion and the federal government gave more USD 1.5 billion in grants to students who did not return for a second year [
7].
In Europe, many countries do not systematically monitor higher education success rates. In a study of 35 European countries, only 12 regularly report indicators related to completion. Even fewer countries report on retention rates, dropout rates, or time-to-degree. The available cross-country comparative statistics must be interpreted with care due to differences in underlying definitions, context, and institutional arrangements across higher education systems [
8]. Norway is one of the most concerned countries, analyzing significant higher education data; in a report published on 20 August 2020, 67.5% of new students in Norway completed a degree within 8 years [
9].
In countries that belong to the OECD, 12% of students who enter a full-time bachelor’s program, on average, leave the tertiary system before beginning their second year of study. This share increases to 20% by the end of the program’s theoretical duration and to 24% three years later. In all countries with available data, women have higher completion rates than men in BA programs [
10].
In Latin America, the performance of the higher education system has been disappointing. On average, around half of citizens aged 25–29 have not established a career. Only Mexico and Peru have a completion rate near that of the United States (65 percent). In Colombia, around 37 percent of students who begin a BA program drop out of the higher education system altogether; this rises to around 53 percent among students who begin short-cycle programs. Around 36 percent of all dropouts leave university at the end of their first year, in contrast to approximately 15 percent of student dropouts in the United States. Despite the concentration of dropouts at the start of their college careers, almost 30 percent of all dropouts leave the system after four years [
11]. According to a national survey of employment, unemployment, and underemployment in Ecuador, there are 7,780,767 people aged between 25 to 64 years, of whom only 20% have been to university. It is therefore highly probable that the remaining 80% includes students who have dropped out of college.
In view of the previous data, we consider that the study of the student dropout phenomenon is a fundamental aspect for the improvement of higher education, and with this study, we intend to answer some questions, such as: what socioeconomic and academic variables influence dropout during the first levels of higher education? Can a predictive model be defined for the identification of these factors in order to improve university student retention and decrease dropout rates?
2. Literature Review
The academic and socio-economic backgrounds of students with lower socio-economic status can negatively affect their ability to remain at university [
1,
2,
3]. In various studies, factors including a lack of prior academic preparation and economic and financial difficulties have been shown to potentially cause students to drop out of the higher education system [
12,
13,
14]. However, these are not the only factors affecting dropout rates; the problem must be considered as a phenomenon affecting the entire higher education system in which endogenous and exogenous factors converge within the same system.
According to Donoso et al. [
15], the retention of students in university education is a broad phenomenon, related to access and higher education selection policies, which reflect the fact that some high-school graduates do not possess the skills, conditions, capacities, aptitudes, or competences to continue their university studies. Recent studies have concluded that the college-dropout issue generally arises during the early years of an individual’s career [
16]; Tinto [
17] highlights two critical periods when the risk of desertion is higher than usual. The first critical period is the admissions process, when a student first accesses the university. The second critical period occurs during the first semesters spent in university, when the student begins the process of social and academic adaptation. Bean [
18] points out that dropout is not only due to academic variables but can also be explained by psychosocial, environmental, and socialization factors. Authors such as Chen and DesJardins [
19,
20] argue that dropout is based on a cost-benefit decision, highlighting the impact of student benefits in said decision. Braxton et al. [
21] and Kuh [
22] point out that dropout depends on the quality of teaching and the student’s learning experience. When students enter higher education, they present their own family and personal characteristics; they must find ways to fit in with the reality of the institutional social system. The organization of higher education institutions has an impact on the individual and his or her socialization and satisfaction [
23]. For this reason, a student’s social relations with classmates, teachers, and the social environment are vitally important [
24]; it is essential to achieve a balance between social adaptation and social support [
8,
24].
In this context, greater attention must be paid to retention rates during the first stages of education. This is vitally important for international evaluations, which reflect the capacity and effectiveness of institutions in retaining students, given that the highest dropout rates occur early [
25]. Factors that can lead to early desertion include performance, but also region of origin, age, and year of admission. Economic factors are not necessarily of vital importance [
26].
During the last two decades, there has been an increase in empirical studies of variables associated with performance in higher education; Schneider [
4] conducted a systematic review of the literature, including 38 meta-analyses based on almost 2 million students. They set out to answer this important question: ‘What characteristics of students, teachers and instruction are strongly associated with better learning outcomes?’ The key variables that explain academic performance appear to be (a) related to the following instructional variables: social interaction, stimulating meaningful learning, evaluation and feedback, presenting information clearly, technology employment, and extracurricular training programs, as well as (b) related to the following student variables: intelligence and previous achievements, strategies, motivation, personality, and context.
Several studies have attempted to identify the factors that influence dropout rates. In this context, factors such as monthly family income, type of school, type of housing, and even gender have been identified as factors that influence the student-dropout phenomenon [
12,
27]. A Romanian study showed that academic satisfaction at the beginning of a course was a significant predictor of dropout intention [
28]. In South Korea [
29] and in Latin America, Amo et al. [
28] found that getting a job and receiving a student scholarship were two major factors that reduced the university dropout rate. Other factors to consider include the financial situation [
8], class attendance [
24] and study time [
24,
30]; for example, students who combine part-time study with work are more likely to drop out of higher education [
30].
Gallegos et al. [
26] found that the student’s region of origin, grades, and scholarships are variables that affect the probability of dropping out. Similarly, the level of income, parents’ education, and the type of school in which they attended secondary education were statistically significant in explaining dropout.
Álvarez [
31] classified the factors that cause students to drop out as follows: (a) personal factors (motivational factors, psychological or emotional factors, student expectations, health problems, age, lack of discipline, etc.); (b) academic factors (lack of academic aptitude, lack of vocational guidance, poor choice of career or institution, poor academic performance, poor previous training, deficiencies in academic programs, etc.); and (c) socio-economic factors (precarious economic-social situation, institutional reasons, etc.).
Lin, Yu, and Chen [
32], using the Probit model, analyzed in first-year students the predictive factors in the retention of higher education students, finding that the previous grade point average (GPA), obtained in secondary education, was one of the predictive factors, like the rank class, the size of the secondary school of origin, and gender (women are less likely). They also found very interesting results regarding possible interventions to reverse the dropout process: programs that include orientation or remedial English courses, on-campus jobs, and on-campus residency have a positive impact on retention.
At the same time, governments and universities have proposed affirmative action policies to help students overcome difficulties associated with these factors. Identifying these factors and analyzing their influence on student academic performance is an important process, which can help to identify at-risk students early and to take corrective actions during the educational process [
33,
34]. Mathematical modeling and data-mining algorithms have been used to identify factors that influence education-related phenomena.
Logistic regression is a traditional, predictive method that is often used in the educational field, especially when the predictor variables are continuous. The model is based on calculating the probability that a categorical variable will take a certain value from a set of values given by predictor variables. During model training, regression coefficients analogous to linear-regression variables are established. It is therefore necessary to verify the fulfillment of statistical assumptions to guarantee the validity of the model [
35].
By contrast, Artificial Neural Networks (ANN) constitute a predictive model based on machine learning, which allows researchers to explore and model functional relationships between variables that cannot be established using traditional statistical methods. Neural networks have been used to solve prediction and classification problems in various areas of knowledge, generating special interest in the field of education, especially in relation to student performance modeling and variables that influence the educational process [
35,
36,
37].
It is essential to study this phenomenon because there is a pressing need to reduce the dropout figures, at both the national and international level. The main objective of this paper was to present the dropout prediction model in order to identify the academic and socio-economic factors that cause students to drop out. This model will allow university authorities to early identify possible dropout cases and to establish policies that support these vulnerable students and reduce the dropout rates.
4. Results and Discussion
The results of the logistic regression are presented below, in
Table 2 and
Table 3.
Of the four variables considered to model dropping out of the leveling course through logistic regression, only application grade and regime had statistical importance at a significance level of 5% (the p-value for each term tests the null hypothesis that the coefficient is equal to zero, which implies that it has no effect on the modeled variable). In other words, neither the vulnerability index nor the type of leveling course provided useful information for the model. On the other hand, the residuals were distributed symmetrically, which is why they could be considered normally distributed.
Table 4 shows the confusion matrix of the logistic regression model, obtained by cross validation. The correctly classified instances were 1525 (true positives) and 0 (true negatives), while the misclassified instances were 571 (false positives) and 1 (false negative). From this information, a classification accuracy of 0.727 was obtained (1525/2097); although this is a relatively high value, it is not an adequate indicator of the overall performance of the model, since the selectivity of the model is null. In other words, the model practically classifies all students as having dropped out of the leveling course, since, according to the original data, a student is more likely than not to have dropped out (72.8% of cases).
Table 5 presents the training stage results of the artificial neural network. During the training stage, twenty-four models were trained with different combinations of neurons in the hidden layer and activation functions, while the number of neurons in the input and output layers remained constant and equal to 4 and 1, respectively. Thus, according to Equation (1), in the training stage, the range of neurons in the hidden layer was between 3.7 and 8, and since the number of neurons must be an integer, the actual range is between 4 and 8.
It is observed that the simplest model, with 4–3–1 architecture (4 neurons in the input layer, 3 neurons in the hidden layer, 1 neuron in the output layer) and an identity activation function, presents a relatively high and comparable performance with respect to the other models. However, the number of neurons in the hidden layer is outside the range established in the methodology.
The model with 4–7–1 architecture and a logistic activation function, and the model with 4–6–1 architecture and the activation function ReLu present comparable performances. Although the classification accuracy of the 4–6–1 ReLu model is slightly higher (less than 1%), the area under the ROC curve of the latter is slightly higher (less than 1%). For this reason—and because the area under the ROC curve is the most significant indicator of model performance—the model with 4–7–1 architecture and a logistic activation function was selected as the optimized model.
Figure 1 shows the results of the Garson test, applied to the neural network model with 4–7–1 architecture and a logistic activation function.
All variables have a relative importance greater than 5%, the most important being the vulnerability index. The reference category for the regime variable was Sierra and for the leveling course variable was technical degree.
Table 6 shows the confusion matrix of the neural network model obtained by cross-validation. From this information, a classification accuracy of 0.768 was obtained, a relatively high performance value with a precision value of 0.796.
Table 7 compares the models’ performance indicators;
Figure 2 shows the ROC curve for each model.
The area under the neural network ROC curve is greater than 0.5, while the corresponding value for logistic regression is less than 0.5; this value (0.5) constitutes an important theoretical reference, since, according to literature, models with an AUC value under 0.5 operate in a random manner, that is, without considering interactions between variables [
35]. For this reason, it can be stated that the ANN model performs better than logistic regression.
At this point, it is important to note that the vulnerability index, which had no statistical importance in the logistic regression model, had the greatest relative importance in the neural network model. This contrast may show that machine-learning methods can find relationships between variables that are rarely found using traditional methods.
When the classification accuracy of the two models is compared, the values are similar. However, as the corresponding results indicate, the logistic regression model cannot correctly classify students who did not drop out of the leveling course. This shows that the performance of a model cannot be evaluated solely by its classification accuracy, if so, there would be no notable differences between the logistic regression and the neural network. Nevertheless, as mentioned before, when the two models are compared by the area under their corresponding ROCs curves, the ANN performs much better than the logistic regression, and, for this reason, the ANN model should be applied to predict dropout in the leveling course. Although the neural network model presents a false-positive rate of 0.627, this contrasts with a false-negative rate of 0.085. This result is extremely important because the model will be used to carry out interventions with students at risk of dropping out. From an academic point of view, it is far better to carry out interventions on students who do not need them (false positives) than to fail to identify students who do need interventions (false negatives).
The classification accuracy of the ANN model did not exceed 80%. This apparently low performance is explained, on the one hand, from a theoretical perspective, since dropout is a multifactorial phenomenon and it is impossible to include all the factors and their corresponding information in a model, as every model will have a range of misclassifications as a consequence [
35,
43]. On the other hand Helal [
42] and Yang [
45] agree that the predictions made in very early stages of the educational process are inaccurate, therefore, it is advisable to build and apply different models at different educational stages. This is evidenced, for example, when comparing the classification accuracy of the ANN model of this study (0.768) with the results obtained by [
46], who modeled multilayer perceptrons (a more complex type of ANN) to predict the dropout in the pre-registration, first semester and first year stages. In this sense, the classification accuracy of the model applied in the pre-registration stage, analogous to the application stage of the ANN model built in this study, was 0.667; however, the classification accuracies of the models applied in the first semester and the first year were greater than 0.95. This suggests that the model of the present study could be adapted and applied at different stages so that more factors that influence dropout rates could be included.
However, despite the fact that the ANN model proposed in this study makes predictions at the stage in which students are just entering university, its application is an advantage in terms of carrying out early academic interventions, since dropout rates are higher during the first years of university [
47]. Furthermore, the ANN model may be periodically optimized since the variables used in this study are generated in each academic semester; in this way, the student information generated during each semester will be incorporated into the partition data for training the model.
From a practical point of view, a neural network is considered as a black box in the sense that, although the results are easily interpretable, the interactions between the variables and the activation functions are far complex. Therefore, the neural network model presented in this study does not directly determine the profile of a student at risk of dropping out. However, based on the results of the model, some approximations can be achieved by applying traditional techniques to determine the profile of a dropout student.
Table 8 shows the results of a
t-test, comparing the vulnerability index and application grades of students who did and did not drop out. The vulnerability index and application grade both reveal statistically significant differences between students who dropped out of the leveling course and those who did not. Overall, the values of vulnerability index and application grades of students who remained in the leveling course are higher than the values for students who dropped out.
Table 9 shows the results of an independence test of the regime and type of leveling course, comparing students who did and did not drop out. It is clear that the regime and type of leveling course are not independent of the dropout variable. An analysis of the proportions of each category shows that students enrolled in the Costa regime and leveling course for a technical degree were more likely to drop the leveling course.
The results achieved in this study agree with those of other investigations in that among the risk factors that influence dropout rates are those of a socio-economic and academic nature [
48]. It is possible to categorize the variables into one of these types of factors.
First, the application grade in
Table 8 is a variable that shows students’ previous knowledge in mathematics, language, natural sciences, and social sciences, therefore, it is an academic factor. As explained above, the vulnerability index is an indicator of a student’s relative socio-economic vulnerability, therefore, it is a socio-economic factor.
On the other hand, the regime variable shows the academic period in which students enter the leveling course. When a student graduates from high school, they should enter to EPN in the Sierra regime. However, those students whose application grade was not high enough to get a place in the EPN, take the admission test again and, if their application score is high enough, then they have the possibility of entering the Costa regime. In this context, the regime is an academic factor. Most of the students who enter the Costa regime are those who did not enter the leveling course in the process corresponding to their first application; they are students with a lower academic level. Most of the students who enter the Sierra regime are those who entered the EPN in the process corresponding to their first application.
Regarding the variable regarding leveling course type, the application grade of engineering sciences, and administrative sciences students is higher than that of the students of technical degree, (in this study, the mean of the application grade of the engineering sciences, and administrative sciences students was 863.26, while that of the students of technical degree was 790.66). The variable leveling course type is an academic factor.
The problem of university dropouts has economic repercussions, since if students drop out, the money that has been allocated to their fees and scholarships ends up not being used. González and Uribe [
49] indicated that 23.5% of the expenditure invested by the state in higher education is lost with desertion. In this way, if the student drops out, the investment made by the state or by private entities in those students cannot be recovered. Likewise, not having university studies is related to a greater probability of suffering unemployment, since most of the unemployed in Ecuador, as previously mentioned, do not have university studies. Therefore, university dropout becomes a very important problem, both for the individual and for the university institution, which in order to improve its effectiveness should try to reverse this dropout problem.
5. Conclusions
The present study has considered variables related to academic and socio-economic factors influencing dropout. In the model proposed in this study, the vulnerability index is a socio-economic factor, whereas the application grade, the regime, and the leveling course type are academic factors.
In comparing the logistic regression and neural network models, this study concludes that the neural network model offers superior performance. The area under the logistic regression model operation curve is less than 0.5, meaning that this model cannot consider the interactions between variables based on the baseline classification level.
The optimized neural network model has 4–7–1 architecture and a logistic activation function. This model has a classification accuracy of 0.768 and an area under the ROC curve of 0.795. Although the neural network model has a false-positive rate of 0.627, it is offset by a false-negative rate of 0.085. This result is extremely important because the model will be used to carry out interventions with students at risk of dropping out. From an academic point of view, it is far better to carry out interventions with students who do not need them (false positives) than to fail to identify students who do need them (false negatives).
From running the optimized model and statistical analysis of the variables considered when developing the model, we conclude that students who are likely to drop out the leveling course have a profile that includes a low application grade, a low vulnerability index, and enrolment in both the Costa regime and the leveling course for the technical degree. However, it is not possible to establish the threshold for the quantitative variables (application grade and vulnerability index) at which a student risks dropping out. The results of the model are the product of a complex interaction between the four variables. One way of optimizing the results of the neural network is to model a multilayer perceptron, a more complex neural network, with more hidden layers and hyperparameters. However, this modeling falls outside the scope of this work. This model could be implemented by the Admissions Unit of Escuela Politécnica Nacional to identify potential dropout cases early, establishing, in coordination with the university authorities, policies that support such students. In this way, the dropout and failure rates could be reduced.
Considering the entire process of developing this study, it is essential to reiterate that students drop out as a consequence of combinations of variables. We must reconsider the relationship between those variables and levels of social integration, since following the logic of [
17], students who reach higher education shape their own social integration based on situations that involve rewards. This presents institutes of higher education with a task: they must implement strategies that generate a sense of belonging and motivate students to remain in university. Likewise, given the fact that previous academic performance, attendance, active class participation, dedication to studying, and other activities have an impact on the dropout phenomenon, this study advises universities to implement actions that promote student persistence and support affirmative-action students. Institutes of higher education should create support and guidance programs for students at a higher risk of dropping out, implementing strategies that address socio-economic and psychological factors, as well as institutional and academic issues.
Early identification of students at risk of dropping out will allow authorities and university teachers to take actions with the aim of reducing the dropout rate. Martínez [
50] points out that among the preventive actions to strengthen retention, tutoring programs, mentoring, preparatory courses, first-year seminars, remedial courses, curricular learning communities, learning support services, and use of technology to make teaching more flexible and motivate students stand out. In the case of the EPN, it is suggested to define the access process for students who are at risk of dropping out. The application of diagnostic tests and questionnaires would allow one to determine the profile in the academic, emotional and socioeconomic areas of these students with the aim of identifying those areas in which interventions should be carried out. Those students with academic deficiencies should take remedial courses in the different subjects that have deficiencies as well as tutoring programs to support their learning process. Students living in poverty should receive institutional scholarships and be regularly monitored by the Student Welfare Directorate. Additionally, peer mentoring programs should be established, in which higher-level students provide accompaniment to lower-level students. All these actions together will allow students at risk of dropping out to be adequately integrated into university life, thereby reducing university dropout rates.
One limitation of this study is the fact that it has analyzed the specific case of the EPN, an engineering university that receives approximately 2% of the total student intake of all universities in Ecuador. The model can easily be applied by universities that share the same variables as the EPN. For other universities, however, a new model with new variables must be generated. A future study could standardize the variables and generate a model that could be used at the national level.
The study was developed in different conditions than the current ones; the data used in this research correspond to the face-to-face teaching modality. The COVID-19 pandemic positively affects school dropouts, as millions of students have been affected, both by the closure of schools around the world and by the global economic recession. According to the World Bank Group [
51], until the end of April, schools had been closed in 180 countries, and 85% of students around the world were not attending school; in addition, the world economy will shrink by 3% in 2020. Dropouts will increase, and many of these students will probably drop out of school forever. The highest dropout rate will be concentrated in students belonging to vulnerable groups, as they are forced to abandon their studies due to lack of economic and technological resources, thus widening the existing gap in education. These events delay the achievement of the sustainable development goal of ensuring inclusive, equitable and quality education for all by 2030.