Next Article in Journal
A Lightweight Network for Human Pose Estimation Based on ECA Attention Mechanism
Previous Article in Journal
EDF-YOLOv5: An Improved Algorithm for Power Transmission Line Defect Detection Based on YOLOv5
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evaluation of Students’ Learning Engagement in Online Classes Based on Multimodal Vision Perspective

College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China
*
Authors to whom correspondence should be addressed.
Electronics 2024, 13(1), 149; https://doi.org/10.3390/electronics13010149
Submission received: 31 October 2023 / Revised: 23 December 2023 / Accepted: 26 December 2023 / Published: 29 December 2023
(This article belongs to the Section Artificial Intelligence)

Abstract

:
The method of evaluating student engagement in online classrooms can provide a timely alert to learners who are distracted, effectively improving classroom learning efficiency. Based on data from online classroom scenarios, a cascaded analysis network model integrating gaze estimation, facial expression recognition, and action recognition is constructed to recognize student attention and grade engagement levels, thereby assessing the level of student engagement in online classrooms. Comparative experiments with the LRCN model, C3D network model, etc., demonstrate the effectiveness of the cascaded analysis network model in evaluating engagement, with evaluations being more accurate than other models. The method of evaluating student engagement in online classrooms compensates for the shortcomings of single-method evaluation models in detecting student engagement in classrooms.

1. Introduction

Online education provides ubiquitous learning opportunities for learners, making the learning process more learner-centered. However, it has been found in practice that most online learning platforms have issues with high dropout rates and low completion rates [1,2]. Jordan [3] studied the learning situation of courses on various MOOC (massive open online courses) platforms and found that the average completion rate of MOOC courses was only 15%, and could only reach 40% at most. By analyzing the learning behavior data of approximately 80,000 participants in six MOOCs offered by Peking University on Coursera, Jiang et al. [4] also found that learners had a rapid decrease in course engagement in the early stage and a gentler decrease in the later stage, resulting in a low completion rate. Enhancing the completion rate of online learning, reducing dropout rates, and improving the effectiveness of online learning have become concerns for many researchers. Several studies have found a positive relationship between learners’ learning outcomes and the degree of their participation in learning activities [5,6], and learners with lower engagement are likely to drop out [7]. Therefore, there is an urgent need to identify learners’ learning engagement and to provide timely interventions for those who are not highly engaged in order to ensure learning effectiveness. At the same time, a precise assessment of learners’ online learning conditions can help promote the iterative development of various online teaching platforms, providing technical support to create a personalized and intelligent teaching environment.
Learning engagement refers to the amount of physical and psychological energy that learners devote to learning [8] and reflects the psychological state of learners’ emotional and cognitive activities in course study, as well as the behavioral participation they invest in the course [9]. According to [10], learning engagement can be divided into three dimensions: emotional involvement, cognitive involvement, and behavioral involvement. Learning behavioral involvement is the basic dimension of learning engagement and is the carrier of emotional involvement and cognitive involvement [11]. Traditional methods for evaluating learning engagement mainly include self-reporting, experience sampling, teacher grading, interview methods, and observation. These methods are based on the measurement and recording of student performance, using frequency data analysis as a basic technique, and often involve pen-and-paper observation records or small-scale learning log data. Their efficiency and credibility are greatly compromised. Moreover, due to factors such as teachers and students being separated by time and space, a large number of students, and weakened teacher constraints, traditional evaluation methods in the context of education are not well suited to the online learning environment. In recent years, artificial intelligence technology has made significant progress in recognizing human emotions and behaviors [12,13], and the importance of applying AI technology to smart education has been agreed upon in the industry. The recognition of learning engagement should be one of its most important application scenarios. Based on this, a multimodal vision-based method for assessing student learning engagement in online classes was proposed, which compensates for the deficiencies of a single evaluation model in detecting student learning participation in the classroom.

2. Related Work

Since the 1980s, student learning engagement has become an important topic in the field of educational research. Part of the reason for researchers’ interest in learning engagement in the early days was concern about high dropout rates and the fact that 25–60% of students reported feeling bored and disengaged in class [14,15]. Evaluating students’ learning engagement can assist teachers in understanding students’ participation in learning, so as to intervene in a timely manner, help students reflect on their own learning, and promote their active participation in learning.
Methods based on computer vision can measure learners’ engagement by studying clues such as gestures, postures, eye movements, and facial expressions. These clues can be perceived by external observers and are the basis for teachers to adjust their teaching behavior in traditional classroom settings. Whitehill et al. [16] manually labeled the expressions of 7574 frames from the HBCU dataset and 16,722 frames from the UC dataset as disengaged, pretending to be engaged, engaged, and highly engaged, and then used Boost (BF), SVM (Gabor), and MLR (CERT) methods for automatic recognition of learning engagement. The experimental results showed that machine learning methods have a high accuracy in recognizing learning engagement. Grafsgaard et al. [17] also used the Computer Expression Recognition Toolbox (CERT) to analyze facial movements during computer-mediated tutoring, and the study showed that upper facial movements can predict learning engagement, frustration, and learning ability. Monkaresi et al. [18] used Kinect facial tracker and heart rate to detect learners’ engagement in educational activities and updatable Naive Bayes, Bayesian networks, K-means clustering, Rotating Forest, and Dagging classifiers for decision-level and feature-level fusion. The study found that accuracy based on facial expressions is higher than that based on heart rate. Zhang et al. [19] used cameras to capture students’ facial images and simultaneously captured mouse movement data. The research results showed that the recognition accuracy of learning engagement using both expression and mouse datasets was 94.6%, while the recognition accuracy using only expression data was 91.5%. Zhan [20] combined expression recognition and eye tracking technology to construct an emotion and cognition recognition model for remote learners based on an intelligent Agent, iteratively recognizing eye tracking and expression monitoring, coupling the emotion and cognition recognition process, improving the recognition accuracy of remote learners’ learning status, and improving the Agent’s emotional and cognitive support for learners. With the successful application of deep learning in the field of computer vision, some researchers have also used it for student learning engagement detection. Alkabbany et al. [21] proposed a real-time automatic measurement method for student engagement by analyzing students’ behaviors and emotions. Zhang et al. [22] conducted research on classroom learning engagement recognition by analyzing learners’ facial information from the dimension of mixed scenes.
In real-world scenarios, it is difficult to have learners wear expensive measurement devices such as eye trackers for classroom learning. Cameras offer convenience, timeliness, and richness in data collection. The widespread use of facial recognition also attests to the reliability and accuracy of image data. The process of capturing data via cameras does not infringe on learners, similarly to teachers observing learners’ participation in instructional activities without interrupting them. Images extracted from cameras can be used to analyze learners’ gazes, facial expressions, and classroom behavior, and thus determine their participation in the class. A person’s gazes can provide crucial clues for analyzing attention, intent, and motivation. Liu et al. [23] explored the relationship between head posture and human attention. Chen et al. [24] used a single image head posture to project students’ gazes onto the video image of the teacher’s lecture, enabling a visual analysis of students’ learning attention. Singh et al. [25] also estimated people’s head posture through deep learning and used it to judge attention. These studies indicate that learners’ attention can be mapped through image analysis of head posture.
Zhou et al. [26] pointed out that “learning emotion is an important factor affecting students’ cognitive processing and learning outcomes”. Xu et al. [27] also believed that the usability of learning emotions based on facial expression recognition is high. Based on control value theory, Loderer et al. [28] conducted a detailed study of 186 articles from 1965 to 2018 concerning learning emotions, concluding that emotion is an important driving factor for learning and can reflect learners’ engagement. Zhao et al. [29] examined students’ learning engagement in the classroom from the perspective of classroom behavior. Alkabbany et al. [30] recognized classroom engagement through the analysis of students’ facial key points, head posture, eye gaze, and learning features.
Therefore, by using a camera as the data collection device and applying deep learning methods to recognize and analyze various biological feature data of learners (face, expression, behavior, and posture), we aim to explore the intrinsic link between learners’ various physiological and behavioral characteristics and their learning engagement. This project seeks to establish a recognition method for student learning engagement based on multimodality visual domains, to promote the evaluation of learning participation towards automation, intelligence, and refinement.

3. Proposed Method

3.1. Hierarchical Analysis Framework for Online Classroom Student Engagement

We can obtain the learner’s class video data through the built-in camera of the online learner’s learning device or peripheral camera (placed in the middle, above the computer screen). Our proposed method can evaluate classroom learning engagement by analyzing multimodal visual information such as gazes, facial expressions, and classroom behavior during the learning process. We apply a hierarchical strategy in our evaluation method: firstly, we perform person detection and identity recognition on the input data. If there is no person or the detected face is not the learner himself, it is judged as not engaged; secondly, we estimate the learner’s gazes. The yaw and pitch angles obtained are calculated through a function. If the function value is greater than the set threshold, it is considered as facing the screen; otherwise, it is not facing the screen. The learner facing the screen is a prerequisite for paying attention, and further through facial expression recognition, it can be judged whether the learner is engaged in learning; when the learner is not facing the screen, it may not necessarily mean that they are not engaged in learning. For example, writing and reading are expressions of engagement in learning. Therefore, our method recognizes the learner’s actions when they are not facing the screen to judge whether they are engaged in learning. The overall research plan is shown in Figure 1.

3.2. The Gaze Estimation Model

The direction of human visual attention is highly correlated with head posture and the direction of gaze. Langton et al. [31] found that, in most cases, people’s attention direction can be obtained by analyzing the angle of head posture. However, in the scenario of short-distance online classrooms, gaze estimation is a better method to represent the direction of people’s attention. In the online learning environment, the direction of human eyes’ gaze is the key factor to judge whether learners are engaged or not. When learners are listening to lectures, their gaze’s yaw and pitch angles stay within a certain range. When learners are engaged, their line of sight should be on the screen. If the gaze deviates from the screen for a long time, it may be a sign of disengagement.
The L2CS-Net [32] model has made significant progress in the field of gaze estimation. It uses a method based on a convolutional neural network (CNN), particularly ResNet50 as the backbone network, to effectively handle complex visual data, predict the yaw, and pitch gaze angles separately and improve prediction accuracy. The structure of this model is shown in Figure 2.
Compared with directly regressing the two angles, for each angle, L2CS applies two different loss functions, combining the regression and classification tasks to obtain an exact value. Specifically, for each yaw and pitch angle, intervals are divided into 90 sections at 4-degree intervals, thereby obtaining the label for each angle. Firstly, the classification loss is calculated using cross-entropy loss. The interval label obtained for each classification is then restored as angle data, and the result of Softmax is multiplied by the corresponding angle to calculate the predicted angle. MSE(mean squared error) is used as the loss, and the two losses are balanced by the alpha parameter. The loss function is defined as:
l o s s G a z e = i y i log p i + α 1 N 0 N y p 2 ,
where p is the predicted value, y is the ground-truth value, and α is the regression coefficient.

3.3. The Proposed Facial Expression Recognition Model

Deep learning has achieved significant results in computer vision tasks and has made great strides in facial expression recognition. However, in an online learning environment, the facial images of learners present issues such as interclass differences, image blur, pose changes, and occlusions. These challenges cause the recognition accuracy of existing algorithms to be generally low. Furthermore, due to privacy concerns, collecting facial expression data from learners in a laboratory-controlled environment is both difficult and time-consuming, resulting in smaller datasets (in the thousands or even hundreds). Therefore, the generalization and practical application capabilities of facial expression analysis in solving real-world problems, such as online learning, require further research and development.
RepVGG [33] is an improved VGG network model. Its core idea is to use the concept of structural reparameterization to transform the multipath structure of the training network into the single-path structure of the inference network, ultimately improving the network’s inference efficiency. During the training process of the FER(Facial Expression Recognition) network, RepVGG introduces direct connection residual branches and 1 × 1 convolution branches, enhancing the efficiency of training. The initial feature size input into the RepVGG network model is C × H × W (where C represents the number of channels, and H and W are the height and width of the frame image, respectively). The model extracts features through several convolutional modules composed of multipath branches, including 3 × 3 convolution + BN, 1 × 1 convolution + BN, and residual connection + BN, each embedded with a max pooling layer and the activation function ReLU. The max pooling layer is used to reduce redundant spatial information from the convolution operations, while integrating normalization operation BN into the convolution layer can reduce the number of model layers, decrease the model’s dependency on parameter initialization, and enhance the performance of the network model.
Our FER method uses RepVGG as the backbone network and pretrains it using the FER2013 facial expression recognition dataset. When it is applied to the student engagement evaluation model, the Multi-Task Cascaded Convolutional Network (MTCNN) is first used to obtain the face cropping box. The main network can output data features and pre-classification results, which become input into an MLP for further training. Finally, the network is optimized using a weighted classification cross-entropy (Softmax) loss function. The structure of this model is shown in Figure 3.

3.4. The Proposed Action Recognition Model

Action recognition research focuses on the actions of individuals in video clips. This is because action recognition not only needs to extract the spatial features of human actions, but also the features in the time dimension. When using video data for action recognition, 3D convolutional neural networks can extract action information from multiple consecutive video frames. It is an improvement on the 2D convolutional neural network, used to calculate spatial and temporal dimension features, and can make good use of the sequential information in the video. The 3D convolution is achieved by stacking multiple consecutive frames and then using a cube-like convolution kernel to capture the action features in the temporal information.
ResNet [34] is a classic network widely used in image classification. The method we propose uses the Inflated 3D ConvNet idea to improve ResNet into a 3D convolutional neural network, enabling ResNet to be applied to video data. Compared with image-based convolutional neural networks, it inputs continuous video data and introduces spatial dimension information at different times, which can more completely extract video features.
For video data, their temporal features will not change too quickly or too slowly in the time dimension. Therefore, the spatio-temporal receptive field is very important for model construction. If the time dimension receptive field exceeds the spatial dimension, it may cause the edge feature information of different objects to merge, thereby destroying the early feature extraction. Conversely, if the size of the spatial dimension receptive field exceeds the time dimension, it may not be able to capture the dynamic features of the scene well. Therefore, in the method we propose, the maximum pooling of the first stage does not perform time pooling, and the spatial scale remains unchanged, to better extract spatio-temporal features. The structure of this model is shown in Figure 2C.
The second to fifth stages are composed of Res1. and Res2. submodules, with the structural details of the two submodules shown in Figure 3.

4. Design of Experiments

4.1. Datasets

Since there is currently no publicly available online classroom student behavior dataset that includes action information, we have collected our own video dataset that includes 50 undergraduate students as research subjects. Among them, there are 28 males and 22 females, all aged between 18 and 22. These 50 learners are in a normal online classroom learning state and have turned on the video recording mode to collect online classroom behavior information, which is used to construct an online classroom student behavior dataset. Figure 4 shows part of the dataset.
The single class session lasts for 100 min. Each student’s online learning video was preprocessed, removing the duration data outside of class time, and the original videos were trimmed. Every 10 s was considered a segment for engagement analysis, and ultimately 30,000 video segments were separated out. We invited four experts to annotate the data, which included the teacher who taught the class when the data were collected and three teachers who taught the same course. Each segment of the video was annotated by two teachers. If the two teachers disagreed about the annotation of the same video segment, a third teacher would provide additional marking. If all four teachers’ annotation results were different, they would collectively annotate it. Each dataset’s labels were marked in the format of “expression–action–engagement level”. The engagement level refers to the classification method of Gupta et al. [35] for data annotation. After the data are labeled, seven common classroom actions are statistically obtained, including listening to lectures, writing, reading, eating, looking around, sleeping, playing with mobile phones, etc., along with six expressions, including surprise, confusion, calm, happy, tired, bored, etc., to complete the construction of the dataset. Figure 5 shows the distribution of the number of samples at four engagement levels in the dataset. Level 0 indicates very low learner engagement, while level 3 means the learner’s engagement is very high.

4.2. Analysis of Student Engagement Based on Gaze Estimation

The degree of students’ engagement in online classrooms can be determined by analyzing whether their gaze is focused on the display area of the learning device, which indicates the direction of their attention. When students are listening to lectures through online learning devices, the center of the screen is considered the focal point of their vision. The range of their eye movement from this central point to the screen’s edges left and right is termed as the yaw angle, and the up-and-down movement towards the screen’s edges is considered the pitch angle.
As shown in Figure 6, when a student in an online classroom focuses on a point within the screen, such as P 1 , P 2 , their level of learning engagement can be judged by their facial expression. Conversely, when a student focuses outside the screen for a long time, such as P 3 , their level of learning engagement needs to be determined by recognizing their behavior.
Students at university mainly use laptop computers for online learning. Based on this, with the center point of the upper boundary of the display area (camera) as the origin, the width of the display area as W , the height of the display area as H , and the distance between the student’s head and the learning device as L , the gaze boundary angle can be determined as shown in Formulas (2)–(5):
θ P i t c h U p = 0 ,
θ P i t c h D o w n = arctan H L ,
θ Y a w L e f t = arctan W 2 L ,
θ Y a w L e f t = arctan W 2 L
After collecting statistics on the learning devices and learning environment of current online students, and standardizing the display area of the learning device and the distance between the device and the person, the distance L is 72 cm, the height of the display area H is 33.5 cm, and the width of the display area W is 59.6 cm. The range of gaze angles can then be determined: pitch direction [0°, −25°] and yaw direction [−22.5°, 22.5°]. That is, students whose gaze angle is within this range are focused on the learning display area of the classroom; otherwise, they have left the display area. This way, the attention of each video data can be calibrated based on the proportion of focus on the display area of the learning device. The gaze estimation method state detection is shown in Figure 7.

4.3. Analysis of Student Engagement Based on Facial Expression Recognition

The renowned international psychologist Ekman proposed that joy, sadness, fear, anger, surprise, and disgust make up the basic emotions of humans, who can express many types of emotions through different combinations. Learning emotions refer to the positive or negative emotions students express during the learning process, which can display the student’s interest in and experience of learning.
Emotional investment is a crucial component of the learning process. Positive emotions such as happiness, concentration, and curiosity not only indicate a high degree of learner engagement but also aid in improving the learner’s performance; negative emotions like boredom and distraction not only suggest less learner engagement but also predict a drop in the learner’s performance. Although a confused expression may show that the teaching content is inconsistent with the learner’s prior cognition, it precisely demonstrates that the learner is engaged in learning. Referencing the research findings of D’Mello et al. [36] and Loderer et al. [28], we provide a mapping relationship between learning emotions and learning engagement, as shown in Table 1.
When students appear happy or calm during the learning process, it can be assumed that they are attentively listening to the lecture; when students display expressions of surprise or confusion, it is believed that they are more interested in the learning content, that is, they are more engaged in learning; when the expression is tired or bored, the student may be in a state of disengagement from learning. The effect diagram of student facial expression analysis is shown in Figure 8.

4.4. Analysis of Student Engagement Based on Action Recognition

At present, many researchers distinguish the learning engagement of classroom behavior based on two dimensions: engagement and disengagement. Engagement is a positive learning behavior, characterized by effort, attention, and persistence, whereas disengagement is a negative input, often manifested as passivity, lack of initiative, and giving up. This project refers to the S-T analysis method and research results such as [29], and statistically selects the main behaviors of online classroom students not looking at the screen, preliminarily providing the mapping relationship between online classroom learning behavior and learning engagement, as shown in Table 2.
The natural listening state can be judged by the degree of engagement through the methods of gaze estimation and facial expression recognition, so it is not necessary to consider the natural listening state in the final engagement judgment. The visualization effects of other classifications are shown in Figure 9.

4.5. Online Classroom Student Engagement Rating

Considering the strong subjectivity in rating students’ engagement, different experts may have significant discrepancies in the engagement scores given to the same student behavior. Therefore, in light of the content above, we summarize the approach of evaluating students’ learning participation by rating the engagement level of students in online classrooms. We reference [35] and make four levels of engagement rating, from level 0 to level 4 in increasing order. When the classroom is empty, it is assumed that the student has no engagement. Other rating methods for engagement detection indicators are shown in Figure 10.

5. Experiments and Results Analysis

5.1. Additional Datasets

Due to the continuity of the gaze point of the human eye, it is difficult to mark the gaze angles of each student through expert judgment. Therefore, we use the MPIIGaze public dataset to train the gaze estimation model.

5.2. Experiment Details

The experiment uses the Pytorch1.12 deep learning framework and is conducted on 4 HYGON Z100SM DCUs. All three models use Stochastic Gradient Descent (SGD), with a momentum of 0.9 and a weight decay of 1 × 10−5. The gaze estimation model is trained on the public dataset for 60 epochs, with a batch size set to 32. Both the facial expression recognition model and the action recognition model are trained on our self-constructed online classroom engagement dataset for 80 epochs, with a batch size set to 32.
To verify the effectiveness of these proposed methods, we conduct comparative experiments with the current advanced methods. As can be seen in Table 3, L2CS-Net outperforms other gaze estimation methods, achieving the state-of-the-art with mean angular errors only reaching 3.96° and 3.92° when α is set to 1 and 2, respectively.
Table 4 presents the accuracy results of the facial expression recognition model proposed in this article, along with EfficientFace, EAC, and HSEmotion on the dataset. It can be seen that our method has a high accuracy rate.
Table 5 presents the comparative experimental results of different action recognition methods on our established dataset, showing that the action recognition method we proposed still maintains a high accuracy.
According to the above three tables, it can be concluded that the gaze estimation model, facial expression recognition model, and action recognition model that we used all showed significant improvements compared to the classical state-of-the-art (SOTA) models. Also, our three methods achieved high precision in their respective areas, enough to support the process of assessing the student engagement level in online classrooms.
Deep learning is widely used in video classification. At the same time, our built online classroom student behavior dataset includes learning engagement level labels. Action recognition based on video data is also a type of video classification. Therefore, while verifying the feasibility of the process for assessing the student engagement level in online classrooms, we have simultaneously validated the accuracy of video classification models in directly grading the student engagement level for online classrooms. The experimental results are shown in Table 6.
As can be seen, the model we proposed has shown a significant improvement compared to other models, achieving an accuracy of 83.7%. Moreover, when directly grading the engagement level of students in online classrooms, the accuracy of LRCN, C3D, and Two Stream networks significantly decrease compared to the results of action recognition. This is because the features included in learning engagement are more extensive than those in action recognition. Different videos at the same engagement level could present different features, which inevitably leads to a decrease in accuracy.
To observe the performance of our framework at various engagement levels in an intuitive way, Figure 11 shows the confusion matrix for the classification of engagement levels at various grades.

5.3. Discussion of the Experimental Results

For educators, the assessment of students’ engagement in online classes is based on various aspects. Therefore, using computer vision methods for evaluation also requires a combination of multiple features of online student behavior. This is why our proposed method is more efficient than traditional end-to-end deep learning methods.
Learning behavior is a continuous and multifaceted process. Standalone methods of gaze estimation, facial expression recognition, and action recognition all ignore the multifaceted nature of learning behavior. Although they can simply predict the state of students in online classrooms, they are ineffective in accurately detecting whether students are focused on learning. Therefore, they perform poorly in predicting student engagement. Also, because learning behavior is a behavior that integrates multiple features such as gazes, facial expression, and action, traditional end-to-end deep learning methods are not very effective, and they are unable to accurately judge the engagement level of students.
The method we proposed fully utilizes various feature information of online classroom student behavior data, and the accuracy of grading student learning engagement reaches 83.7%. It is suitable for analyzing online classroom scenarios and can be used to identify fatigued and hyperactive learning behaviors. Our model can be used to provide feedback on students’ learning conditions, thereby improving the quality of teaching.

6. Conclusions

Based on gazes, facial expression recognition, and action recognition, this study built a cascade analysis network model to analyze the engagement degree of students in online classrooms. We not only considered the continuity and complexity of learning behaviors but also achieved good results in the constructed dataset of online classroom student learning behaviors. Furthermore, by comparing other relevant algorithms, we verified the effectiveness of the proposed method in grading the engagement degree of students in online classroom scenarios, with the accuracy of commitment level recognition reaching 83.7%. In the practice of online teaching, through the cascade analysis network model constructed in our study to analyze students’ engagement in the online classroom, teachers can understand students’ engagement throughout the teaching period in a timely manner. This will assist teachers in adjusting their teaching plans and improving their teaching schemes, providing a basis for reference, and helping to enhance the efficiency of online classroom teaching.

Author Contributions

Conceptualization, Y.Q.; experiments, L.Z.; data curation, X.H. and A.L.; writing—original draft preparation, L.Z.; writing—review and editing, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 62267007 and Gansu Provincial Department of Education Industrial Support Plan Project under Grant 2022CYZC-16.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, X.; Wang, D. On the development process and main characteristic of MOOC. Mod. Educ. Technol. 2013, 23, 5–10. [Google Scholar]
  2. Reich, J.; Ruipérez-Valiente, J.A. The MOOC Pivot. Science 2019, 363, 130–131. [Google Scholar] [CrossRef] [PubMed]
  3. Jordan, K. MOOC Completion Rates: The Data. Available online: http://www.katyjordan.com/MOOCproject.html (accessed on 5 August 2023).
  4. Jiang, Z.; Zhang, Y.; Li, X. Learning behavior analysis and prediction based on MOOC data. J. Comput. Res. Dev. 2015, 52, 614–628. [Google Scholar]
  5. Hughes, M.; Salamonson, Y.; Metcalfe, L. Student Engagement Using Multiple-Attempt ‘Weekly Participation Task’ Quizzes with Undergraduate Nursing Students. Nurse Educ. Pract. 2020, 46, 102803. [Google Scholar] [CrossRef] [PubMed]
  6. Brozina, C.; Knight, D.B.; Kinoshita, T.; Johri, A. Engaged to Succeed: Understanding First-Year Engineering Students’ Course Engagement and Performance Through Analytics. IEEE Access 2019, 7, 163686–163699. [Google Scholar] [CrossRef]
  7. Wang, W.; Guo, L.; He, L.; Wu, Y.J. Effects of Social-Interactive Engagement on the Dropout Ratio in Online Learning: Insights from MOOC. Behav. Inf. Technol. 2019, 38, 621–636. [Google Scholar] [CrossRef]
  8. Oh, C.; Roumani, Y.; Nwankpa, J.K.; Hu, H.-F. Beyond Likes and Tweets: Consumer Engagement Behavior and Movie Box Office in Social Media. Inf. Manag. 2017, 54, 25–37. [Google Scholar] [CrossRef]
  9. Sun, Y.; Ni, L.; Zhao, Y.; Shen, X.; Wang, N. Understanding Students’ Engagement in MOOCs: An Integration of Self-determination Theory and Theory of Relationship Quality. Br. J. Educ. Technol. 2019, 50, 3156–3174. [Google Scholar] [CrossRef]
  10. Fredricks, J.A.; Blumenfeld, P.C.; Paris, A.H. School Engagement: Potential of the Concept, State of the Evidence. Rev. Educ. Res. 2004, 74, 59–109. [Google Scholar] [CrossRef]
  11. Wu, F.; Zhang, Q. Learning behavioral engagement: Definition, analysis framework and theoretical model. China Educ. Technol. 2018, 372, 35–41. [Google Scholar]
  12. Zhang, F.; Zhang, T.; Mao, Q.; Xu, C. Geometry Guided Pose-Invariant Facial Expression Recognition. IEEE Trans. Image Process. 2020, 29, 4445–4460. [Google Scholar] [CrossRef]
  13. Zhang, H. The Literature Review of Action Recognition in Traffic Context. J. Vis. Commun. Image Represent. 2019, 58, 63–66. [Google Scholar] [CrossRef]
  14. Larson, R.W.; Richards, M.H. Boredom in the Middle School Years: Blaming Schools versus Blaming Students. Am. J. Educ. 1991, 99, 418–443. [Google Scholar] [CrossRef]
  15. Shernoff, D.J.; Csikszentmihalyi, M.; Schneider, B.; Shernoff, E.S. Student Engagement in High School Classrooms from the Perspective of Flow Theory. Sch. Psychol. Q. 2003, 18, 158. [Google Scholar] [CrossRef]
  16. Whitehill, J.; Serpell, Z.; Lin, Y.-C.; Foster, A.; Movellan, J.R. The Faces of Engagement: Automatic Recognition of Student Engagementfrom Facial Expressions. IEEE Trans. Affect. Comput. 2014, 5, 86–98. [Google Scholar] [CrossRef]
  17. Grafsgaard, J.F.; Wiggins, J.B.; Boyer, K.E.; Wiebe, E.N.; Lester, J.C. Automatically Recognizing Facial Expression: Predicting Engagement and Frustration. In Proceedings of the 6th International Conference on Educational Data Mining (EDM 2013), Memphis, TN, USA, 6–9 July 2013. [Google Scholar]
  18. Monkaresi, H.; Bosch, N.; Calvo, R.A.; D’Mello, S.K. Automated Detection of Engagement Using Video-Based Estimation of Facial Expressions and Heart Rate. IEEE Trans. Affect. Comput. 2017, 8, 15–28. [Google Scholar] [CrossRef]
  19. Zhang, Z.; Li, Z.; Liu, H.; Cao, T.; Liu, S. Data-Driven Online Learning Engagement Detection via Facial Expression and Mouse Behavior Recognition Technology. J. Educ. Comput. Res. 2020, 58, 63–86. [Google Scholar] [CrossRef]
  20. Zhan, Z.H. An emotional and cognitive recognition model for distance learners based on intelligent agent-the coupling of eye tracking and expression recognition techniques. Mod. Dist. Educ. Res. 2013, 5, 100–105. [Google Scholar]
  21. Alkabbany, I.; Ali, A.M.; Foreman, C.; Tretter, T.; Hindy, N.; Farag, A. An Experimental Platform for Real-Time Students Engagement Measurements from Video in STEM Classrooms. Sensors 2023, 23, 1614. [Google Scholar] [CrossRef]
  22. Zhang, Y.H.; Pan, M.; Zhong, G.C.; Cao, X.M. Learning Engagement Detection Based on Face Dataset in the Mixed Scene. Mod. Educ. Technol. 2021, 31, 84–92. [Google Scholar]
  23. Liu, H.; Nie, H.; Zhang, Z.; Li, Y.-F. Anisotropic Angle Distribution Learning for Head Pose Estimation and Attention Understanding in Human-Computer Interaction. Neurocomputing 2021, 433, 310–322. [Google Scholar] [CrossRef]
  24. Chen, P.; Huangpu, D.P.; Luo, Z.Y.; Li, D.X. Visualization analysis of learning attention based on single-image PnP head posture estimation. J. Commun. 2018, 39, 141–150. [Google Scholar]
  25. Singh, T.; Mohadikar, M.; Gite, S.; Patil, S.; Pradhan, B.; Alamri, A. Attention Span Prediction Using Head-Pose Estimation with Deep Neural Networks. IEEE Access 2021, 9, 142632–142643. [Google Scholar] [CrossRef]
  26. Zhou, J.; Ye, J.M.; Li, C. Multimodal Learning Affective Computing: Motivations, Frameworks, and Recommendations. e-Educ. Res. 2021, 42, 26–32+46. [Google Scholar] [CrossRef]
  27. Xu, X.; Zhao, W.; Liu, H. Research on application and model of emotional analysis in blended learning environment: From perspective of meta-analysis. e-Educ. Res. 2018, 39, 70–77. [Google Scholar] [CrossRef]
  28. Loderer, K.; Pekrun, R.; Lester, J.C. Beyond Cold Technology: A Systematic Review and Meta-Analysis on Emotions in Technology-Based Learning Environments. Learn. Instr. 2020, 70, 101162. [Google Scholar] [CrossRef]
  29. Zhao, C.; Shu, H.; Gu, X. The Measurement and Analysis of Students’ Classroom Learning Behavior Engagement Based on Computer. Mod. Educ. Technol. 2021, 31, 96–103. [Google Scholar]
  30. Alkabbany, I.; Ali, A.; Farag, A.; Bennett, I.; Ghanoum, M.; Farag, A. Measuring student engagement level using facial information. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 3337–3341. [Google Scholar] [CrossRef]
  31. Langton, S.R.H.; Watt, R.J.; Bruce, V. Cues to the Direction of Social Attention. Trends Cogn. Sci. 2000, 4, 50–59. [Google Scholar] [CrossRef]
  32. Abdelrahman, A.A.; Hempel, T.; Khalifa, A.; Al-Hamadi, A. L2CS-Net: Fine-grained gaze estimation in unconstrained environments. arXiv 2022, arXiv:2203.03339. [Google Scholar]
  33. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-Style ConvNets Great Again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 13728–13737. [Google Scholar]
  34. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016; pp. 770–778. [Google Scholar]
  35. Gupta, A.; D’Cunha, A.; Awasthi, K.; Balasubramanian, V. DAiSEE: Towards User Engagement Recognition in the Wild. arXiv 2022, arXiv:1609.01885. [Google Scholar]
  36. D’Mello, S.K.; Craig, S.D.; Graesser, A.C. Multimethod Assessment of Affective Experience and Expression during Deep Learning. Int. J. Learn. Technol. 2009, 4, 165. [Google Scholar] [CrossRef]
  37. Chen, Z.; Shi, B.E. Appearance-based gaze estimation using dilated-convolutions. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 309–324. [Google Scholar]
  38. Cheng, Y.; Zhang, X.; Lu, F.; Sato, Y. Gaze estimation by exploring two-eye asymmetry. IEEE Trans. Image Process. 2020, 29, 5259–5272. [Google Scholar] [CrossRef] [PubMed]
  39. Cheng, Y.; Huang, S.; Wang, F.; Qian, C.; Lu, F. A coarse-to-fine adaptive network for appearance-based gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10623–10630. [Google Scholar]
  40. Wang, G.; Li, J.; Wu, Z.; Xu, J.; Shen, J.; Yang, W. EfficientFace: An Efficient Deep Network with Feature Enhancement for Accurate Face Detection. Multimed. Syst. 2023, 29, 2825–2839. [Google Scholar] [CrossRef]
  41. Zhang, Y.; Wang, C.; Ling, X.; Deng, W. Learn From All: Erasing Attention Consistency for Noisy Label Facial Expression Recognition Supplementary Material. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 418–434. [Google Scholar]
  42. Savchenko, A.V. HSEmotion: High-Speed Emotion Recognition Library. Softw. Impacts 2022, 14, 100433. [Google Scholar] [CrossRef]
  43. Donahue, J.; Hendricks, L.A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
  44. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: Santiago, Chile, 2015; pp. 4489–4497. [Google Scholar]
  45. Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; Volume 27. [Google Scholar]
Figure 1. Flow chart of the hierarchical framework for online classroom student engagement.
Figure 1. Flow chart of the hierarchical framework for online classroom student engagement.
Electronics 13 00149 g001
Figure 2. Model of the hierarchical framework for online classroom student engagement. (A) The gaze estimation model. (B) The proposed facial expression recognition model. (C) The proposed action recognition model.
Figure 2. Model of the hierarchical framework for online classroom student engagement. (A) The gaze estimation model. (B) The proposed facial expression recognition model. (C) The proposed action recognition model.
Electronics 13 00149 g002
Figure 3. Two detailed submodule of Inflated ResNet. (a) Detailed structure of the Res1. submodule. (b) Detailed structure of the Res2. submodule.
Figure 3. Two detailed submodule of Inflated ResNet. (a) Detailed structure of the Res1. submodule. (b) Detailed structure of the Res2. submodule.
Electronics 13 00149 g003
Figure 4. Partial online classroom student behavior dataset.
Figure 4. Partial online classroom student behavior dataset.
Electronics 13 00149 g004
Figure 5. Distribution of the number of samples at four engagement levels in the dataset.
Figure 5. Distribution of the number of samples at four engagement levels in the dataset.
Electronics 13 00149 g005
Figure 6. Online classroom student gaze situation.
Figure 6. Online classroom student gaze situation.
Electronics 13 00149 g006
Figure 7. Gaze estimation Euler angle prediction visualization. (a) Watching the screen. (b) Not looking at the screen.
Figure 7. Gaze estimation Euler angle prediction visualization. (a) Watching the screen. (b) Not looking at the screen.
Electronics 13 00149 g007
Figure 8. Visualization of facial recognition classification results. (a) Surprised. (b) Confused. (c) Happy. (d) Neutral. (e) Tired. (f) Boredom.
Figure 8. Visualization of facial recognition classification results. (a) Surprised. (b) Confused. (c) Happy. (d) Neutral. (e) Tired. (f) Boredom.
Electronics 13 00149 g008
Figure 9. Visualization of action recognition classification results. (a) Writing. (b) Reading. (c) Eating. (d) Looking around. (e) Sleeping. (f) Playing with mobile phone.
Figure 9. Visualization of action recognition classification results. (a) Writing. (b) Reading. (c) Eating. (d) Looking around. (e) Sleeping. (f) Playing with mobile phone.
Electronics 13 00149 g009
Figure 10. Visualization of action recognition classification results.
Figure 10. Visualization of action recognition classification results.
Electronics 13 00149 g010
Figure 11. Confusion matrix for the classification of engagement levels at various grades.
Figure 11. Confusion matrix for the classification of engagement levels at various grades.
Electronics 13 00149 g011
Table 1. Mapping of online classroom student learning emotion and attitude towards learning.
Table 1. Mapping of online classroom student learning emotion and attitude towards learning.
Attitude towards LearningExpression
PositiveNeutral
Surprised
Confused
Happy
NegativeTired
Boredom
Table 2. Mapping of student behavior and attitude towards learning when not looking at the screen in online classes.
Table 2. Mapping of student behavior and attitude towards learning when not looking at the screen in online classes.
Attitude towards LearningExpression
PositiveWriting
Reading
NegativeEating
Looking around
Sleeping
Playing with mobile phone
Table 3. Comparison of mean angular error between L2CS-Net and SOTA methods.
Table 3. Comparison of mean angular error between L2CS-Net and SOTA methods.
MethodsMPIIGaze
Dilated-Net [37]4.8°
FAR-Net [38]4.3°
CA-Net [39]4.1°
L 2 CS - Net   ( α = 1 ) [32]3.96°
L 2 CS - Net   ( α = 2 ) [32]3.92°
Table 4. Comparison of experimental results for different facial expression recognition methods.
Table 4. Comparison of experimental results for different facial expression recognition methods.
MethodsAccuracy
EfficientFace [40]82.2
EAC [41]80.8
HSEmotion [42]84.7
Ours88.4
Table 5. Comparison of experimental results for different action recognition methods.
Table 5. Comparison of experimental results for different action recognition methods.
MethodsAccuracy
LRCN [43]82.7
C3D [44]85.2
Two Stream [45]88.0
Ours89.5
Table 6. Comparison of experimental results for different engagement grading methods.
Table 6. Comparison of experimental results for different engagement grading methods.
AlgorithmAccuracy
LRCN76.8
C3D74.9
Two Stream80.4
Ours83.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qi, Y.; Zhuang, L.; Chen, H.; Han, X.; Liang, A. Evaluation of Students’ Learning Engagement in Online Classes Based on Multimodal Vision Perspective. Electronics 2024, 13, 149. https://doi.org/10.3390/electronics13010149

AMA Style

Qi Y, Zhuang L, Chen H, Han X, Liang A. Evaluation of Students’ Learning Engagement in Online Classes Based on Multimodal Vision Perspective. Electronics. 2024; 13(1):149. https://doi.org/10.3390/electronics13010149

Chicago/Turabian Style

Qi, Yongfeng, Liqiang Zhuang, Huili Chen, Xiang Han, and Anye Liang. 2024. "Evaluation of Students’ Learning Engagement in Online Classes Based on Multimodal Vision Perspective" Electronics 13, no. 1: 149. https://doi.org/10.3390/electronics13010149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop