1. Introduction
The abilities and knowledge of each student are evaluated during his studies with grades or percentages that tell us how good the student is in the study. This evaluation gives us the GPA and according to it we consider students to be good, bad, talented or lazy. Students’ learning outcomes and achievements during their university studies can be influenced by many factors, such as learning patterns, talents, interpersonal relationships, motivation and many others. The student’s results can subsequently affect life after graduation and can even help with finding a job. However, it is not always important what academic results the student has achieved during their studies, and for many employers, the experience and what they can do is much more important. Not surprisingly, many people with a worse academic result or without a university degree are very successful in their lives.
The school system in Slovakia has several levels. The first level is pre-school education and concerns children aged 5. This is followed by an elementary school (primary education). After completing 9 years of primary school, students choose a secondary school. There are three types of secondary schools in Slovakia—grammar school, secondary professional school (SOS) and vocational school (SOU). Vocational schools are mostly intended for students who do not achieve good results in primary school. Other students choose between a professional school and a grammar school. If the student is clear in his/her future profession, he/she can choose a specific field and attend a secondary professional school. Since most students who are 15 years old do not have a clear idea of their future profession, they opt for a grammar school. Grammar school is a type of secondary school where general knowledge and skills from various fields such as languages, mathematics, natural sciences and humanities are developed. The best students from primary school usually continue in grammar schools. In secondary schools, the evaluation is as follows. Students are graded 1–5, with 1 being the best and 5 the worst. The comparison of students is carried out on the basis of averaging grades from subjects. After graduating from high school, students can continue at a university in Slovakia or abroad. Many Slovak students continue their studies at a university in the Czech Republic. Slovak higher education has three levels—bachelor’s, master’s and doctoral studies. The evaluation system at Slovak universities is implemented based on the ECTS scale, which is defined from grade A (best) to grade FX (unsuccessful). The student receives credits for successful completion of the course (A–E). To obtain a degree, a student needs to obtain a certain number of credits (e.g., in a bachelor’s degree, a student needs to obtain 180 credits). At universities, the evaluation of students is carried out according to the GPA, which is based on grades A–E as follows: A = 1.0, B = 1.5, C = 2.0, D = 2.5, E = 3.0, FX = 4.0.
Our research had two main objectives:
The first main objective was to identify the factors that influence students’ learning results. With the findings from statistical analysis, schools could identify what makes some students more successful than others. These findings could be used to increase the success level of all students in the faculty. To meet this goal, we defined 40 hypotheses and evaluated them using statistical hypothesis testing.
The second main objective was to predict GPAGPA of university students. The application of these models could be useful for the management of faculties or universities, e.g., in the process of allocating elective courses, in the process of admission to universities or in the process of identification of excellent students.
It is relatively common in the field of education to examine the factors that influence learning outcomes through correlation analysis. However, although Becker et al. [
1] point out the importance of predictive analytics in the field of education, predicting learning outcomes is not entirely common in this field. Below are some studies that deal with prediction in the field of education.
Linear regression models are common applications of predictive analytics in the field of education. The study by [
2], for example, is dedicated to predicting students’ academic success. For this purpose, authors use a model of multiple linear regression, i.e., the model uses multiple independent variables. Although the model identified factors that are important in predicting academic success (e.g., stress, time pressure, classroom communication), the model was able to explain only 16% of the variability of the dependent variable. Another way to apply linear regression in education is reported by Esmat and Pitts [
3]. The authors predict the success of students in an undergraduate exercise science program. To accomplish this, they create a model of multiple linear regression which is quantified using the ordinary least squares method. Based on the results of the regression analysis, they identify the factors that are the best predictors in the required major courses. They see the potential of their procedures in “when examining methods to improve retention of students, progression, minimizing repeat attempts at courses, and improving graduation rates” [
3]. Some authors use hierarchical regression models. Huberts et al. [
4] present a three-level hierarchical regression model for student grades to predict student success or failure. To estimate parameters of their model they use Bayesian estimation; more specifically, they use Markov Chain Monte Carlo (MCMC) methods based on Gibbs sampling procedure. To evaluate their model, they compare it with the benchmark model—simple one-level linear regression model. Krurei-Mancuso et al. [
5] construct hierarchical linear regression models to predict first-year college student success via psychologic factors. Using their models, they investigate what is the effect of the CLEI scales for predicting GPAGPA. Tinajero et al. [
6] predict academic success of Spanish university students using hierarchical regression models. As independent factor they chose perceived social support. The following variables were considered as dependent variable: GPA for first year at university, GPA for third year and change in GPA over time.
Although machine learning applications in this field are also known (e.g., Bir and Ahn [
7] use logistic regression models to identify factors that influence students’ persistence and make it possible to predict students’ persistence), there are not many of these studies. Most focus on statistical methods. Limited research exists in applying machine learning methods in education. In our research we construct machine learning models to predict GPAGPA.
The paper is divided into eight chapters. In the first section, we perform an exhaustive analysis of studies and discuss what factors affects the learning outcomes of students. In the second section, we present a very short theoretical background about the methods we use later in the paper.
Section 3 describes the data and presents the results—we identify factors influencing the GPAs and we present our created prediction models, which are able to predict the student’s GPA.
Section 4 discusses the results and
Section 5 summarizes the paper.
1.1. Literature Review
One of the big factors influencing a student’s success can be motivation and whether the student has a talent for learning. While some students may not need to make a great effort to achieve good academic results, others need to work hard to achieve good academic results. However, except for these, there are many factors which influence academic success. In this section we present studies related to the factors which influence academic performance.
1.1.1. Psychological Factors of Academic Success
We have identified several research studies that consider psychological aspects to be significant in student success. The authors [
5,
8,
9] state that the decisive factor influencing a student’s success is motivation and examine whether students are motivated to gain new experiences and to become sufficiently qualified. The authors focus on students who are still motivated to learn and examine the factors that are part of their motivational strategies. The authors claim that motivation, support and access to education are among the most important factors influencing studies. They also study whether students’ academic goals and their satisfaction with life are related.
In their study, Han et al. [
10] created two models to find out what affects a student’s grades. One model involved only academic training and the other one also included non-cognitive factors such as motivation and a sense of belonging. The second model described the success of students better. Based on the results, one can claim that non-cognitive factors also affect the student’s success.
A study by Basto et al. [
11] and Oreški et al. [
12] examined what causes academic failure of students. The authors found that reluctance to study, moving, lack of sleep, age or status have a major impact.
Several studies [
13,
14] have shown that metacognitive strategies (thought processes) have a significant impact on students’ academic performance. Other factors which affect student performance are social interaction with other participants or the online environment. Moreover, setting goals has indirect impact on academic achievement. Authors also confirmed that motivation and student satisfaction have a positive connection with student results.
In a study by Burger et al. [
15], successful students were mainly associated with motivated balance and effective study behavior. This study confirmed the importance of understanding and accurately solving the problem of student success.
1.1.2. Study Factors of Academic Success
According to a study by Novaková et al. [
16] there is a relationship between the percentage of students’ participation in lectures throughout the semester, the results of the written part of the exam and the overall results of the exam.
The authors Shulruf et al. [
17] and Birr and Ahn [
7] examined the influence of school factors on student success. They found that organizational factors (the way students are taught) have an impact on success. Other important factors were type of secondary education, ability to cope with the academic work and satisfaction with academic life. The authors sought to improve students’ endurance and academic performance. The same issue was addressed in the study by Sustekova et al. [
18], which aimed to examine the impact of the type of secondary school on university results in the subject of computer science. The authors found that the type of high school affects students’ results.
In a study by Esmat et al. [
3], the authors studied student success according to how a student progresses in their studies and stays in school. The aim of the study was to examine the admission process for the university program. They examined the students over a period of six years. Finally, authors found out that students who attended preparatory courses before the start of their studies had better prospects to succeed during their studies than students who had not attended the preparatory courses. A similar issue was investigated in a study by Oppenheimer et al. [
19], whose authors found that students who completed summer preparatory programs had higher success rates than those who did not.
Mitra et al. [
20] examined a set of factors that are the basis for success. Factors included learning style and learning analytics along with demographic and academic backgrounds. Predicting student success based on these factors had 95 % accuracy.
Bou-Sospedra et al. [
21] examined in their study aspects of students’ effective learning from the perspective of students, teachers and the family. Every group preferred a different learning style to increase student success. The relationship between different learning strategies and the student’s success in the exam was also addressed in a study by Nettekoven et al. [
22]. The authors found that success is influenced by quantitative (number of solved exercises) but also qualitative (used teaching materials) factors.
According to Pechac and Slantcheva-Durst [
23], coaching is a promising approach to student support and is also linked to student success. The authors examined specific factors of coaching. The study by Xhomar [
24] dealt with the lecturer’s support and individual work of the student and their influence on academic results. The paper states that the individual work of the student also influences their success, while the work of the lecturer does not.
In a study by Moravec et al. [
25] the authors investigated the impact of the use of e-learning teaching tools on student outcomes. The results of students who had an e-learning tool available were compared with those who did not. The authors found that the use of such tools improves student outcomes.
In a study by Huberts et al. [
4] authors predicted student success based on academic results at high school. They found that the most important factors influencing success were study materials, attendance and education of parents.
Goegan and Daniels [
26] examined students’ academic achievement, namely average grades, knowledge and skills and overall satisfaction. They found that students’ academic abilities had an impact on academic averages but not on overall satisfaction.
Gurr et al. [
27] dealt with the creation of successful schools. They studied how leaders influence the development of the school. The authors state that the way the school is run is an important factor for the success of students.
1.1.3. Sociological Factors of Academic Success
Veselina et al. [
28] state that an important factor, which influences and can increase success, is collaboration and an effective learning environment. They also state that competences that students acquire in the areas of communication and cooperation are important for students.
Oreški et al. [
12] state that the academic success, failure and early school leaving of current students and graduates, and the age, status and position of students at enrollment proved to be the most important factors influencing students’ success.
In a study by Ackerman-Barger et al. [
29], the authors describe a model of collective influence to increase students’ academic achievement. As part of the study, workshops were organized where students had the opportunity to collaborate with other academic organizations, colleagues and stakeholders. These workshops included active learning exercises, expert lectures, group discussion and structured event planning.
Dam [
30] investigated the role of the family in student success. Students who have no family problems and are supported by their families and students who have some family problems were compared. Authors found out that the success of students who do not have family problems is higher. Based on these results one can claim that the family influences success during their studies.
According to a study by Tinajero et al. [
6], social support is a key factor influencing the academic performance of university students. Data were obtained from students during the first and third years of study. The study examined students’ perceptions of social support and their academic results.
A study by Schmidt [
31] says that friendship is an important factor in the study. Creating study groups is a good way to strengthen and expand the learning process. The study also discusses creating study rooms and spaces where students would have the opportunity to get to know each other, develop relationships and create a group identity. These results are also confirmed by a study by Bipp et al. [
32], where the authors report that antisocial students achieve worse learning outcomes and have a higher risk of dropping out.
According to a study by Skendzic et al. [
33] it is possible to determine the influence of social networks such as Facebook on student evaluation and results. The authors state that there is a negative correlation coefficient between the time spent on the social network and the student’s success.
Marbouti et al. [
34] describe in their study the factors influencing the success of computer engineering students. The average of students and factors such as dormitory life, form of study and degree of study were examined. Moreover, another important factor was whether the student works while studying. Surprisingly, authors found out that if the student was working, his average might be better.
Anderson et al. [
35] describe in their study whether a transfer to another school has an effect on a student’s success. The aim was to ensure a successful transition to university study programs. They also found that it was beneficial for students to participate in solving real problems and learn to work in teams.
A study by Gansemer-Topf et al. [
36] examined the success of weaker students. The authors found that children for whom at least one parent has a degree, women and people who are socially and academically involved have a better chance of obtaining a degree.
In his study, Aydin [
2] also examines, in addition to study factors, emotional, social and cognitive development. He found that students’ success is influenced by stress, time constraints and classroom communication.
The aim of the study by Nunez et al. [
37] was to identify the factors contributing to the success of students and analyze their academic results from the first year at the school.
2. Materials and Methods
Machine learning as a research discipline in the field of artificial intelligence has emerged in recent years. According to this method of learning, we distinguish between supervised learning and unsupervised learning. As we focused on supervised learning in this research, we briefly present supervised methods we later used to create prediction models.
2.1. Multinomial Linear Regression
Linear regression is a supervised regression method for predicting quantitative response. The main goal of linear regression is to describe the dependence between variables [
38]. In our study, we used multinomial linear regression. Multinomial linear regression, which is an extension of simple linear regression, models a dependent variable
Y. This dependent variable depends not only on one but on several other independent variables
X. We therefore extended the model of simple linear regression by several variables. Let us have
p different regressors; then the multinomial linear regression model is defined as:
where
represents the value of the
jth predictor and
are coefficients. We interpret the value of coefficient as the increment size effect of
Xi on
Y while all other variables remain unchanged. Residual sum of squares (RSS) which is used to optimize the model can be defined as
where
is the predicted value of the dependent variable and
is the actual value of the dependent variable. We find the values of coefficients
using the least squares method so that the sum of the squares of the error estimate is minimal [
38].
2.2. Decision Trees
Decision tree is a universal method of supervised machine learning. Decision trees can be used for regression as well as classification problems. In other words, decision trees can model both numerical and categorical dependent variables [
39]. The principle of this method is the nonlinear division of data space using splits. The inner nodes of the tree further contain a condition that splits the data into other nodes. At the bottom of the tree there are leaf nodes that contain the response, i.e., the value of dependent variable. There are several algorithms for construction decision trees, e.g., CART, C4.5, SPRINT, ID3 or SLIQ Algorithm [
40]. The decision tree has many advantages, e.g., selection of appropriate significant independent variables is performed automatically, i.e., unimportant variables do not affect the result. Multicollinearity also has no effect on the quality of the output.
One should bear in mind that the decision tree should not be too large. Otherwise, overfitting can occur. In other words, the tree would have a high accuracy rate on the training set, but low accuracy rate on the test data. We can prevent overfitting by setting the depth of the decision tree or by trimming the decision tree [
38].
2.3. Random Forest
Like decision trees, random forests can be used for classification as well as for regression problems. Random forests eliminate some problems of decision trees, e.g., tree instability. According to Breiman [
41], a random forest is defined as a set of trees
, whose regression (or classification) functions can be expressed in the form
where
x is the vector of predictor values,
,…,
are random equally distributed random vectors and
k is the number of trees in the forest [
42].
As stated before, random forests contain improvements over decision trees, e.g., the use training data for forests are bootstrap selections from the whole dataset. Observations that are in this selection are used to create the trees and estimate the training error. In the test set, these estimates are also called out-of-bag estimates. The accuracy of random forests is increased by allowing the trees to grow to a great depth, not pruning and maintaining tolerable variance by combining the results of the trees. To reduce the correlation between the trees and to avoid overfitting, a random selection of
from the
M predictors available is performed. The tree then looks for the best splits only among the variables, which is based on
set. Another advantage of random forest is that it also works on smaller datasets, as well as the ability to work well for sets that contain many predictors. Moreover, random forests are quite easy to learn and to tune. Random forests can be used to solve many problems such as classification and prediction, measuring the significance of variables and the effect of variables on prediction, clustering or detection of outliers [
42].
3. Results
3.1. Data
To collect data, a questionnaire was created. The main objective of the questionnaire was to find out what factors influence the success of students. The reason for that was to identify what could help students to achieve better GPAGPA. The respondents for our research were the students of the Faculty of Management and Informatics at University of Zilina. We focused on students in the study program of informatics in the third year, who already had experience with studying at the faculty. Students answered questions about their learning outcomes, satisfaction with their studies and also various questions focused on factors such as the number of hours of sleep, the method of learning, extracurricular activities, etc. The structure of the questionnaire was as follows:
Demographic questions: The first part contained questions regarding basic information about each student, e.g., age, gender, in which year they are currently, whether they have completed their studies, field of study, etc.
Psychological questions: The second part focused on psychological factors that could affect the GPA of students. These issues mainly included issues related to motivation, e.g., why the student decided to study, why he chose the given school, faculty and field of study. We also tried to find out whether the student enjoys the field of study and whether he is satisfied with the choice.
Study questions: We included questions about the student average at high school and what was their student average at college in completed years. This part also included questions such as whether the student learns on an ongoing basis, what study materials they use to study for an exam or how often they attend lectures and seminars. There were questions about students’ views on online learning as well.
Sociological issues: This part contained questions related to the student’s socialization. These questions included, e.g., how many siblings students have, how many members the household in which they live has, whether they study with their classmates or alone, whether they are more comfortable working in groups in the seminars.
Questions about the faculty: The last part focused on factors related to the faculty. There were questions about what grades the student had from specific subjects, but also questions about their opinion on the study. In addition, this section included questions about watching videos on YouTube to help students in their studies.
Data collection was performed through Google Forms and lasted approximately two weeks in January 2021. Students were addressed via the social network Facebook. The questionnaire was sent to Facebook groups, which included current students of the faculty but also graduates. A total of 79 responses were recorded.
3.2. Exploratory Data Analysis
Most respondents were aged 23 years. As for the structure, 71% of respondents were men and 29% were women. Since the questionnaire was intended mainly for students in the field of informatics, we can see that up to 91% of the respondents were the students in the field of informatics and only a few were students in the field of management and computer engineering. Surprisingly, up to 57% of respondents did not live in a dormitory.
3.2.1. Main Findings from the Exploratory Data Analysis
Findings from the psychological part:
- -
Most students decided to study at university mainly because of employment and better earnings in future professions.
- -
Most students chose their field of study at the faculty due to job application and good earnings.
- -
In total, 24% of respondents fully agreed with the statement: “My field of study is fun and fulfilling to me” and up to 44% tended to agree. Altogether, only 9% disagreed.
- -
As for the question, if a student would choose that field of study again, most students (78% in total) would choose the field of study again, while only 22% would choose another field. This may be due to the fact that the given field of study is not fun for them, or they find it too difficult.
Findings from the sociological part:
- -
Almost half of respondents (40%) spend two-to-three hours a day on social networks and 23% spend from one to two hours a day on social networks. In total, 86% of students spend time on social networks in the range of one-to-four hours.
- -
Among students, the most popular leisure activity is watching movies and TV series (up to 80% of students do it). More than half of students said they play computer games and read books in their leisure time. Less popular activities include playing board games or engaging in group sports. Surprisingly, up to 48% of respondents play individual sports. 4% of respondents said they did not have any free time.
Findings from the study section:
- -
In total, 51% of the surveyed students said they study only before the final exam or midterm exam. Furthermore, 47% stated they study continuously throughout the semester. Only 2% of the surveyed students stated they do not study at all.
- -
The study materials that students most often use to obtain the best understanding of the topic are materials from previous years (87%), their own notes or materials from classmates. Students also often use the internet or YouTube videos to study. Only 35% of students use teacher materials and only 20% use scripts and textbooks.
- -
The participation of students in seminars is much higher than participation in lectures. While 77% of respondents always attend all seminars, only 23% always attend all lectures. As many as 53% of respondents attend only lectures of important subjects.
Findings from the distance education section:
- -
During the current situation, students and teachers have had to get used to teaching online. The following findings relate to teaching through the so-called distance learning:
- ○
In total, 45% of students found distance learning more comfortable.
- ○
In total, 52% of respondents had approximately the same grades during distance learning; 20% improved slightly and 15% slightly worsened.
- ○
As many as 90% of students considered the possibility of recording seminars and lectures to be a great advantage of online study.
- ○
Most students (82%) lacked social contact with classmates during online studying.
- ○
Another big disadvantage was that 62% of respondents were not able to concentrate at home.
3.2.2. Main Findings from the Structured Exploratory Analysis
Subsequently, a structured exploratory analysis was performed. The data were divided into several groups. The first division was the division of dataset to men and women. Next, we divided data into groups according to their average grade. Finally, we performed a division of students between grammar school graduates and secondary vocational school graduates.
Findings from the psychological part:
- -
Students at secondary vocational schools stated more strongly the acquisition of new knowledge as a reason for studying at a university (81% of respondents), compared to the grammar school students (64% of respondents).
- -
In total, 73% of men study due to new knowledge and 86% of men study for better earnings. A greater percentage of women (43%) than men (27%) study because of work experience.
- -
In total, 81% of grammar school students and 69% of secondary vocational school students chose their field of study because of future employment; however, most secondary vocational school students (81%) chose their field of study because they enjoy it.
- -
More women (83%) than men (75%) chose the field of study due to future employment.
- -
In total, 67% of respondents with an average grade A would choose their field of study again. Furthermore, 63% of respondents with an average grade E would probably choose the field of study again.
- -
In total, 65% of women and 70% of men state that they totally agree or rather agree with the claim that they enjoy their field of study.
Findings from the sociological part:
- -
Students with a grade A and C spend the least time on social networks.
- -
In total, 39% of men spend from one to two hours per day on social networks, while most women (39%) spend two to three hours a day on social networks.
- -
In total, 83% of respondents with an average grade A spend their free time by reading books. On the other hand, 75% of respondents with an average grade E spend their free time playing computer games. The most popular activity among all respondents is watching movies and TV series.
Findings from the study section:
- -
More women have an average grade B (26%) compared to only 16% of men.
- -
In total, 12% of students from secondary vocational school have grade A compared to only 6% of grammar school students; 25% of grammar school students have a grade B and only 12% of respondents from a secondary vocational school have an average grade B.
- -
Grammar school students achieve slightly better results in mathematical subjects, e.g., Mathematical Analysis 1 and Probability and Statistics, than vocational school students. Students from secondary vocational school, on the other hand, are slightly better in other subjects such as Algorithms and Data Structures.
- -
Students with an average grade A, unlike other students, more often use teacher materials, scripts and textbooks.
- -
Students with grade A had 100% attendance in all seminars. The attendance of other students was also relatively high.
- -
As for lectures, most students with an A grade (67%) attend all lectures. Other students participate more in lectures of important subjects.
Findings from the online study section:
- -
During distance learning, students with an average grade A still have approximately the same grades.
- -
During distance learning, for 55% of men, the grades remained roughly the same. Up to 13% of women and only 2% of men have significantly better grades.
3.3. Confirmatory Data Analysis
Prior to confirmatory and correlation analysis, we preprocessed all data. The answers to the question of gender and boarding school were replaced by the values 0 and 1. For questions where the answers expressed a degree of agreement or satisfaction, the answers were replaced by a scale of 1–5 or 1–4. If the student did not write the number, this answer was replaced by the average of all values. If the student wrote or selected a range of values from the options, the used value was the arithmetic mean of these values. All data that contained the answer, e.g., “More than 5” or “5 and more” have been replaced by 5.
3.4. Correlation Data Analysis
In this section, we tried to meet the first main objective of our research. We tried identifying the factors that affect the GPA variable. We defined 40 hypotheses, which were subsequently tested using statistical hypothesis testing. The hypotheses were defined as follows:
H1: There is a significant dependency between GPA and grade of study
H2: There is a significant dependency between GPA and the question: “The perspective of my field of study is more important to me than whether I enjoy my field”
H3: There is a significant dependency between GPA and whether the student considers their field of study as fun and fulfilling
H4: There is a significant dependency between GPA and whether the student would choose the same field of study again
H5: There is a significant dependency between GPA and whether the student would choose the same faculty for their university studies again
H6: There is a significant dependency between GPA and whether the student studies continuously during the whole semester
H7: There is a significant dependency between GPA and how often they attend seminars
H8: There is a significant dependency between GPA and how often they attend lectures
H9: There is a significant dependency between GPA and whether the student considers online studying to be more comfortable
H10: There is a significant dependency between GPA and whether the student’s grades improved after switching to distance learning
H11: There is a significant dependency between GPA and grade from the subject Algorithms and Data Structures 1
H12: There is a significant dependency between GPA and grade from the subject Database Systems
H13: There is a significant dependency between GPA and grade from the subject Informatics 2
H14: There is a significant dependency between GPA and grade from the subject Mathematical Analysis 1
H15: There is a significant dependency between GPA and grade from the subject Discrete Optimization,
H16: There is a significant dependency between GPA and grade from the subject Probability and Statistics
H17: There is a significant dependency between GPA and how much videos from Probability and Statistics on YouTube helped the student understand the topic
H18: There is a significant dependency between GPA and whether the student is male or female
H19: There is a significant dependency between GPA and whether they live at dormitory
H20: There is a significant dependency between GPA and GPA at high school
H21: There is a significant dependency between GPA and type of high school
H22: There is a significant dependency between GPA and whether they study with their dormitory roommates
H23: There is a significant dependency between GPA and preparatory courses they attended before studying at the faculty
H24: There is a significant dependency between GPA and to what extent student uses consultations with the teacher
H25: There is a significant dependency between GPA and a type of seminar work that suits the student
H26: There is a significant dependency between GPA and part of the day with their best focus
H27: There is a significant dependency between GPA and how they rate the course in math practice
H28: There is a significant dependency between GPA and how they rate the course in programming practice
H29: There is a significant dependency between GPA and number of watched videos where the teacher from the faculty explained topics from Algebra and from Probability and Statistics
H30: There is a significant dependency between GPA and age of a respondent
H31: There is a significant dependency between GPA and number of siblings of a respondent
H32: There is a significant dependency between GPA and number of household members
H33: There is a significant dependency between GPA and GPA in the first year of your university studies
H34: There is a significant dependency between GPA and GPA in the second year of your university studies
H35: There is a significant dependency between GPA and GPA at high school
H36: There is a significant dependency between GPA and number of people the student usually studies with
H37: There is a significant dependency between GPA and number of hours the student spends on social networks per day
H38: There is a significant dependency between GPA and number of cups of coffee the student drinks per day
H39: There is a significant dependency between GPA and number of hours of sport during the week
H40: There is a significant dependency between GPA and number of hours the student sleeps per day.
To determine which variables affect the dependent variable (student’s GPA), we performed the correlation analysis. We later used this information to construct linear regression models. To determine the correlation between the dependent variable and all independent variables, the data were divided into groups:
The first group was categorical ordinal variables, which were variables expressing the degree of agreement and satisfaction or variables that depend on order, such as students’ grades in individual subjects.
The second group consisted of variables that were categorical nominal. For these variables, no answer was more valuable than the other, e.g., gender or type of high school.
The third group was a group of numerical variables. All variables of numeric type were located here.
For each type of variable, a different statistical test was used to determine the level of dependency. When testing the dependency between the dependent variable and the independent variable, the Shapiro–Wilk test was first used to determine if the dependent variable had a normal distribution. Depending on whether the dependent variable had a normal distribution, we used a parametric test or a nonparametric test. A function was created in the R language that returned the p-value of this test. We accepted the null hypothesis of a normal data distribution if the p-value was higher than the level of significance (in our case, α = 0.05). In all cases, nonparametric tests were used to determine the dependency.
3.4.1. Correlation between Numerical Variable and Ordinal Variable
Pearson’s correlation test was used to determine the correlation between the numerical dependent variable and the categorical ordinal variables. In this test, the null hypothesis stated that there was no dependence between the variables and the alternative hypothesis stated that there was a significant dependence.
A function was created in the R language. The input parameter was a dataset, in which the first variable was dependent, and all the other variables were ordinal independent variables. The function calculated the Pearson’s test between the dependent variable and each independent ordinal variable. The function returned a table with the
p-values of the test and the column name of the variable. If the
p-value was smaller than the significance level (α = 0.05), we rejected the null hypothesis, and hence, the independent variable statistically significantly affected the students’ learning outcomes. In
Table 2 we see that the dependency proved to be statistically significant for the following variables: pFriAgain, sContinuousLearning, sLectures, fAaDS1, fDS, fInf2, fMatA1, fDO, fPS.
3.4.2. Correlation between Numerical and Categorical Nominal Variable
To determine the correlation between a numerical dependent variable and categorical ordinal variables we used ANOVA test. In the R language, we used the aov() function to perform this test. The null hypothesis stated there was no dependency between variables. If the
p-value was less than α = 0.05, there was a dependence between the variables. In
Table 3 we see the results. We can see that statistically significant dependence was found between the GPA and the variable fConsultations and fVideosAlgPS.
3.4.3. Correlation between Numerical and Numerical Variable
Pearson’s correlation test was most appropriate to determine the correlation between the numerical dependent variable and the numerical dependent variables. As seen in
Table 4, we rejected the null hypothesis and accepted the alternative hypothesis (there is a significant correlation) for the following variables: sAverageFirst, sAverageSecond, sAverageHighSchool, sGroupSize, sSocialNetworks, and sSleep.
3.5. Supervised Machine Learning Models
The second main objective of our research was to predict the GPA using other variables and factors. To meet this objective, supervised machine learning models were created. We wanted the models to achieve the highest possible accuracy for predicting the GPA. We therefore implemented several regression prediction models based on linear regression, decision tree and random forest. The partial task was to identify variables that statistically significantly affect the GPA through a causal relationship in our models.
3.5.1. Validation
To verify the accuracy of the models we used the ex-post testing as well as cross-validation methodology. The ex-post validation methodology was implemented as follows: we randomly divided the data into a training and test set. The training set contained the data on which the model was trained. On the test set the model predicted the value of a dependent variable without knowing the data.
To ensure greater objectivity we also performed the cross-validation methodology. The reason for this was that we evaluated our models based on the average results from five different validation sets. Using this, we tried to avoid subjective evaluation which would be probable if we used only one validation set. The cross-validation was implemented as follows: The training set was divided into five folds, on which cross-validation was performed. We then recorded the results from every performed cross-validation iteration and calculated the final error value as the average of all five errors (James et al., 2013). Except for higher objectivity, another purpose of cross-validation was to find optimal values of hyperparameters in decision tree and random forest models. For linear regression models, the purpose was to determine the general, more objective accuracy of our models. For cross-validation, 60 observations were used (these observations were randomly selected—with a seed of 1000 in R), i.e., there were 12 observations in each fold. The remaining 19 observations were used for final ex-post testing on the test set. Finally, a model evaluation was performed.
As for data manipulation, after dividing the data, we created two datasets. The first dataset contained all variables except for the variables that contained multiple responses. The second dataset included variables that were found in the previous chapter to affect students’ learning outcomes.
3.5.2. Model Evaluation
Although very often regression models (especially linear regression) are evaluated indirectly, by defining the model only by statistically significant variables, we decided to implement direct evaluation of our models. For this purpose, we used error metrics. In other words, the statistical significance of the variables in the created prediction models was not important to us, but the main criterion for evaluating the models was the predictive accuracy. The evaluation of the models therefore consisted of a comparison between the actual value and the value predicted by our models. We used the accuracy metrics based on residual characteristics. We calculated the residuals as follows
This error was defined as the difference between the predicted value of
and the actual value of
. Using residuals, we then calculated the residual accuracy metrics. We used mean square error (MSE) and mean absolute percentage error (MAPE) metrics to determine the accuracy of our created models. We calculated the MSE as follows:
where
was the predicted value,
was the actual value and
n was the number of observations in the defined set of observations. We calculated MAPE as follows:
We always calculated the MSE and MAPE error on both the training set and the test set. During cross-validation, MAPE was a criterion function on the validation set, according to which we optimized our models. We considered the best model to be the model in which the MAPE metric was the lowest on the validation set.
3.5.3. Created Models
We implemented several multi-aspect supervised machine learning models. The aim of these models was to use the widest possible range of variables for the most accurate prediction of the student’s GPA. Since our goal was to make the most accurate predictions of the student’s GPA, we decided to evaluate the models not from a statistical but from a predictive point of view. As stated before, we therefore implemented the so-called direct validation of our models through error metrics.
Hyperparameters and independent model variables were optimized for the best results. The objective of the model optimization was to minimize the error. In the case of linear regression models, the model optimization was performed by the methodology of selection of suitable variables (feature selection). For decision tree and random forest models, cycle was used, which gradually changed the size of the minbucket or nodesize hyperparameters during cross-validation phase. After cross-validation, final validation was performed with such hyper-parameters where the lowest average percentage error was recorded.
Linear Regression Models
We created several models of linear regression. We used different approaches for features selection. In all models, the dependent variable was the student’s GPA. We modelled the dependent variable using other, independent variables. All mentioned models were created in the R programming language. As stated before, for higher credibility of our results, the linear regression models were also validated through cross-validation. Finally, the model was trained on the training set data and validated on the final validation set data.
First, a LIN1 linear regression model was created. This model predicted the dependent variable using such independent variables where a correlation with the dependent variable from correlation data analysis was identified. The LIN2 linear regression model was created almost in the same way; however, multicollinearity between independent variables was removed. The Variance Inflation Factor was used to identify these variables. Subsequently, all variables with a VIF value of more than 5 were removed. In this way, the variables fVideosAlgPS and sAverageSecond were removed. We also used standard feature selection procedure for selecting variables. We used forward regression as well as backward regression. In forward regression, the variables were added to the linear regression model gradually, and then the model, which was the best in terms of the AIC criterion, was selected. During backward regression, the variables were gradually removed from the model. In
Table 5 we can see the results of our linear regression models.
As seen in
Table 5, all models achieved good results. From the results of cross-validation, on average, model LIN4 (implemented through backward regression) was the most accurate on validation sets.
Decision Tree Models
Regression decision tree (RDT) was also used to predict the dependent variable. Two regression decision tree models were created. The first regression decision tree (RDT1) used all variables as inputs. The second model (RDT2) used as inputs only variables in which we found that they statistically significantly affect the student’s GPA.
In general, in the decision tree model, the accuracy of the prediction is affected by the minbucket parameter, which limits the minimum number of observations in a node. To obtain the best results, we tested the value of minbucket experimentally using cross-validation. We performed 15 experiments with minbucket, testing minbucket values from 1 to 15. Using results from five-fold-cross-validation, we selected the best value for the minbucket parameter (i.e., the value where the average MAPE error on the five cross-validation sets was minimal).
Table 6 shows the results of the cross-validation of the RDT1 model—the best results were achieved at minbucket = 3.
In a similar way, the minbucket optimization was performed with the model RDT2. The regressors for RDT2 were only variables correlated with the dependent variable. Based on the results of another five-fold-cross-validation procedure, the optimal value of minbucket was determined to be minbucket = 2.
Table 7 shows the results from five-fold-cross-validation of the RDT2 model.
Random Forest Models
Finally, we created random forest models. To implement random forest models, we used the randomForest library in R. We also defined the number of trees using the ntree parameter. In our experiments, we set the number of trees ntree = 200. In addition, we also tested the nodesize hyperparameter, which had the same function as the minbucket parameter. Based on performed experiments, we determined the most appropriate nodesize value. We constructed two random forest models. Model RF1 used as inputs all independent variables, while the model RF2 used as inputs only variables that were statistically significantly correlated with the dependent variable.
Table 8 shows the results of performed five-fold-cross-validation for the RF1 model. The theoretical assumption, that the number of observations in the node does not play a significant role in the random forest, was confirmed.
In addition to the RF1 random forest model, we also created an RF2 random forest. At this time, the inputs contained only variables that were found to significantly affect the student’s GPA. We performed another five-fold-cross-validation on the data of the training set and experimentally found the best value of nodesize parameter.
Table 9 shows the results of the cross-validation for RF2 model.
In cross-validation, the lowest percentage error of the RF2 model was achieved with the parameter nodesize = 2. This percentage error in the test set was slightly higher than in model RF1.
3.5.4. Comparative Analysis
The comparative analysis clearly summarizes the results from the previous sections. The procedure of the final comparison of models was as follows: The implemented models were optimized using cross-validation. In the case of linear regression models, the optimization concerned the appropriate selection of independent variables. For the decision tree and random forest models, the optimization concerned the selection of the minbucket and nodesize parameters.
Table 10 summarizes the results of the cross-validation. In decision trees and random forest models, only the models with the optimized value of hyperparameter are stated.
The models RF1, RF2 and backward linear regression model LIN4 achieved the best generalization accuracy on the cross-validation set. In other words, these models achieved, on average, the lowest error rate when testing on multiple validation sets composed of new data (i.e., out-of-bag validation). The worst performers were vif models, which used fewer variables.
Since cross-validation was used to find the optimal value of the hyperparameter, especially in the random forest and decision tree models, the final training of the model (60 observations) and the final testing on the test set, which included 19 observations, was performed. This was to show the true predictive power of the implemented models on our specific test set.
Table 11 shows the results on the final test set.
RDT2 tree model was more accurate than the RDT1 model and its percentage error in the final testing was 9%. A good sign is that in the final test set, none of the models achieved an error higher than 10%. Linear regression model implemented through backward regression was most accurate. Other models of linear regression were also very accurate.
4. Discussion
The assumption that the random forest model is generally a better predictor than the decision tree model was confirmed. The random forest model generates many trees that are not mutually correlated, and this contributes to results that are more reliable and ultimately more accurate than simple decision trees.
As stated in the previous section, we decided to construct models with all independent variables. The reason for that was that the quantification of decision tree and random forest models does not presuppose nonlinearity of independent variables. An interesting finding is that the random forest and decision tree models achieved comparable results, even if the input dataset contained all independent variables, not only those that were not identified in the correlation analysis as statistically significantly correlated with the dependent variable. In other words, the regression decision tree, which used all variables in the prediction, was as accurate as the tree which used only variables with statistical dependence. However, an interesting finding is that the RDT1 model, which included an input set of independent variables, i.e., even those for which no significant correlation was found, delivered slightly worse MAPE results on cross-validation sets. This could be due to the decision tree creation algorithm, where some statistically insignificant variables in the tree creation process (when selecting a suitable split variable) reach the local optimum and to some extent worsen the overall predictive power of the model. For this reason, we recommend choosing only correlated variables when constructing decision tree models. A similar situation occurred when quantifying random forest models. In this case, based on several experiments performed, it can be stated that the selection of correlated or all variables as potential inputs to the random forest model plays a negligible effect on the accuracy of the model.
4.1. Significant Factors Influencing the GPA
Based on the correlation analysis, we identified the following significant factors influencing the student’s GPA:
Motivation to study again on the faculty (variable: pFriAgain: Would you choose the faculty for your university studies again?)
Regular studying during the whole semester (variable: sContinuousLearning, answer to the questions: Do you study continuously during the whole semester?)
Lecture attendance (variable: sLectures, answer to the question: How often do you attend lectures?)
Grade from the subject Algorithm and Data Structures 1 (variable: fADS1, answer to the question: What was your grade in the subject Algorithms and Data Structures 1?)
Grade from the subject Algorithm and Database Systems (variable: fDS, answer to the question: What was your grade in the subject Database Systems?)
Grade from the subject Informatics 2 (variable: fInf2, answer to the question: What was your grade in the subject Informatics 2?)
Grade from the subject Mathematical Analysis 1 (variable: fMatA1, answer to the question: What was your grade in the subject Mathematical Analysis 1?)
Grade from the subject Discrete Optimization (variable: fDO, answer to the question: What was your grade in the subject Discrete Optimization?)
Grade from the subject Algorithm and Data Structures 1 (variable: fPaS, answer to the question: What was your grade in the subject Probability and Statistics?
Use of consultation during studies (variable: fConsultations, answer to the question: To what extent do you use consultations with the teacher?)
Watching specific teaching videos on YouTube (variable: fVideosAlgPS, answer to the question: Have you watched videos on YouTube, where the teacher from the faculty explains topics from Algebra and from Probability and Statistics? How many have you seen?)
GPA in the first year of college (variable: sAverageFirst, answer to the question: What is your GPA in the first year of college?)
GPA in the second year of college (variable: sAverageSecond, answer to the question: What is your GPA in the second year of college?)
GPA at high school (variable: sAverageHighSchool, answer to the question: What is your GPA at high school?)
Group of students a student usually studies with (variable: sGroupSize, answer to the question: How big is the group of students you usually study with?)
Number of hours in a day a student uses social networks (variable: sSocialNetworks, answer to the question: How many hours a day do you spend on social media? (Facebook, Instagram, YouTube))
Number of hours a student sleeps per day (variable: sSleep, answer to the question: How many hours a day do you sleep?)
In our models, the most often seen variables sAverageFirst and sAverageSecond were identified as the determinants of the overall GPA, i.e., we predict the most accurate student average if there are variables indicating the GPA the student has achieved in the completed years. The GPA that the student achieved in high school also proved to be significant, as well as the grades from specific subjects that the student has completed at the faculty. In addition, in the implementation section, we also implemented two models (RDT1 and RF1), which included all the variables that were available, i.e., even those where no statistically significant correlation was found with the GPA variable. An interesting finding is the fact that after quantifying the RDT1 decision tree with all independent variables, the model also identified as significant variables two variables that were not statistically correlated with the GPA. Specifically, these were the variables number of household members (dHouseholdMembers) and number of cups of coffee a student drinks in a day (sCoffee).
It is no surprise that not all variables identified in section correlation analysis proved to be statistically significant in the quantified models. This could be due to the fact that some of these independent variables were correlated. In this case, it was not necessary to include both variables in the model to explain the dependent variable, but for a sufficient explanation of the dependent variable, it was sufficient to include only one of the two mutually correlated variables.
4.2. Research Limitations
In our research one can find some research limitations. The first research limitation is the subjectivity of our results. Even though random forest models are considered to be more accurate models than linear regression, in our case linear regression models performed more accurate predictions on the final test set than random forest or decision tree models. Even though this result is not usual, we believe this could be for two reasons. The first reason may be a coincidence—linear regression models were simply lucky and better captured data from the specific test set. As in this case there was only one test set, there is some probability that linear regression could achieve higher accuracy on this set. The second reason may be that the final test set contained data that could be modelled by a linear function. Since linear regression is a linear model, it could model linear data better than decision tree or random forest models. However, to avoid the subjectivity of our results, in addition to standard ex-post testing, we also performed five-fold-cross validation. The reason for that was a higher objectivity of our results. In this procedure we divided training data into five folds, created five datasets for training and five datasets for validation of our models. To obtain more objective results we then calculated the average error on the validation set. In this case the results of our models in the final test set changed compared to the results from the ex-post testing. While the random forest and decision tree models achieved the best results in cross-validation, it was not the case in the final test. This means that the decision tree and the random forest models are better in general, but this may not be the case for a specific test set that has been randomly selected from all data. In this one case, the linear regression models achieved better results. Nevertheless, based on higher objectivity, we would recommend choosing a random forest as the default model for studying average predictions for different datasets. Its generalization ability was better than that of linear regression and decision tree models.
The second limitation is the method selection. We are aware of the fact that there are many more methods for regression problems in machine learning. However, we tried to use only some of them. We selected some basic methods (linear regression, decision trees) as well as more advanced methods (random forest). There is chance that if we selected other supervised machine learning methods, we could achieve higher accuracy. For example, we could use deep neural networks and support vector machines or conditional random field models. However, we believe that even with our selected methods the accuracy was relatively high, and the models were able to model the GPA.
The width of our data (number of variables) is the third limitation. Even though we collected quite a number of variables from our respondents, we believe there are other variables (factors) which influence the GPA significantly. Unfortunately, it is not possible to collect all possible data from respondents. However, we believe that by obtaining additional information from students we could improve our models and make the prediction of the average more accurate. For example, these new variables could include grades from specific school-leaving subjects, the success rate percentage in the written part of the school-leaving examination or grades from several subjects that the student has successfully completed at the faculty.
The fourth limitation of our research is the size of our dataset (number of observations). Our sample consisted of answers of 79 respondents (university students). It is obvious that the results would not be the same if our dataset was different. However, it is questionable whether the increased size of our dataset would change our results. We tried to generalize our findings on the whole population by using statistical hypothesis testing and using more validation sets using cross-validation procedure.
Composition of our dataset is the fifth research limitation. We sampled students of the Faculty of Management Science and Informatics and most of our respondents were the students of the field Informatics. It is possible that if we had respondents who are studying different fields of study, the results could be different. However, this is only a hypothesis.
Finally, we believe that by extending our dataset by unstructured data we could achieve higher accuracy. For example, we could use face images of students, use convolutional neural network and then improve the accuracy of our models. Moreover, we could use some sort of text data from our students and use text analytics to improve our models.
5. Conclusions
The evaluation of student’s abilities and knowledge is defined by the GPA. According to it, we consider students to be good, bad, talented or lazy. Students’ learning outcomes and their achievements are influenced by many factors, such as learning patterns, talents, interpersonal relationships, motivation and many others. If we were able to identify factors which influence the GPAs of specific students, we could nudge students for better results. This effect would have many positive benefits for both the school and the student. Therefore, the first main objective of our research was to identify the factors that influence students’ learning results. Moreover, if we were able to predict GPAs, management of schools could optimize the functioning of the university, e.g., in the enrolling process for optional subjects or in the admission process to the university. Therefore, the second main objective of our research was to predict the GPA of students.
We performed literature review analysis of the current state. Analyzing studies from many authors, we identified factors that influence the student’s results. We divided the identified factors into psychological, sociological and study factors. Using these findings, we then designed a questionnaire and collected data. Data were collected from students of the Faculty of Management and Informatics at University of Zilina.
To become familiar with the data, we first used the basic and structured exploratory analysis. We compared the results of the answers of different groups of respondents. We also tested the differences in means between the groups using statistical tests. For this purpose, we used the Shapiro–Wilk test and the non-parametric Mann–Whitney U-test. To meet our first research objective (to identify the factors that influence students’ learning results), we performed correlation analysis, where we examined the statistically significant influence of factors on the dependent variable—the student ‘s GPA. We used Pearson test and ANOVA test. Based on the correlation analysis we identified factors with a statistically significant dependence with the GPA. The identified factors were as follows: motivation to study again on the faculty, regular studying during the whole semester, lecture attendance, grade from the subject Algorithm and Data Structures 1, grade from the subject Algorithm and Database Systems, grade from the subject Informatics 2, grade from the subject Mathematical Analysis 1, grade from the subject Discrete Optimization, grade from the subject Algorithm and Data Structures 1, use of consultation during studies, watching specific teaching videos on YouTube, GPA in the first year of college, GPA in the second year of college, GPA at high school, group of students a student usually studies with, number of hours in a day a student uses social networks, number of hours a student sleeps per day, number of household members, number of cups of coffee a student drinks in a day.
To meet our second main objective (to predict the GPAs of students), we implemented supervised machine learning models in R programming language. The assumption was that these models would be able to predict the GPA based on other, independent variables. We created 10 models, using methods of linear regression, decision trees and random forests. The models predicted the GPA of a student studying at the faculty using independent variables. One random forest model and one decision tree model used all variables as inputs. Based on the MAPE metric on test set, the model created by the backward regression procedure of linear regression provided the best results. However, the random forest model achieved the best average accuracy on five validation sets in cross-validation procedure. In other words, the random forest model RF1 achieved the best generalization accuracy on the new data. Therefore, we recommend the use of a random forest as a starting model for modeling learning outcomes based on other independent variables.