Evaluation of Students’ Learning Engagement in Online Classes Based on Multimodal Vision Perspective

Qi, Yongfeng; Zhuang, Liqiang; Chen, Huili; Han, Xiang; Liang, Anye

doi:10.3390/electronics13010149

Open AccessArticle

Evaluation of Students’ Learning Engagement in Online Classes Based on Multimodal Vision Perspective

by

Yongfeng Qi

^*,

Liqiang Zhuang

^*,

Huili Chen

,

Xiang Han

and

Anye Liang

College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(1), 149; https://doi.org/10.3390/electronics13010149

Submission received: 31 October 2023 / Revised: 23 December 2023 / Accepted: 26 December 2023 / Published: 29 December 2023

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The method of evaluating student engagement in online classrooms can provide a timely alert to learners who are distracted, effectively improving classroom learning efficiency. Based on data from online classroom scenarios, a cascaded analysis network model integrating gaze estimation, facial expression recognition, and action recognition is constructed to recognize student attention and grade engagement levels, thereby assessing the level of student engagement in online classrooms. Comparative experiments with the LRCN model, C3D network model, etc., demonstrate the effectiveness of the cascaded analysis network model in evaluating engagement, with evaluations being more accurate than other models. The method of evaluating student engagement in online classrooms compensates for the shortcomings of single-method evaluation models in detecting student engagement in classrooms.

Keywords:

computer vision; gaze estimation; facial expression recognition; action recognition; online classroom; learning engagement

1. Introduction

Online education provides ubiquitous learning opportunities for learners, making the learning process more learner-centered. However, it has been found in practice that most online learning platforms have issues with high dropout rates and low completion rates [1,2]. Jordan [3] studied the learning situation of courses on various MOOC (massive open online courses) platforms and found that the average completion rate of MOOC courses was only 15%, and could only reach 40% at most. By analyzing the learning behavior data of approximately 80,000 participants in six MOOCs offered by Peking University on Coursera, Jiang et al. [4] also found that learners had a rapid decrease in course engagement in the early stage and a gentler decrease in the later stage, resulting in a low completion rate. Enhancing the completion rate of online learning, reducing dropout rates, and improving the effectiveness of online learning have become concerns for many researchers. Several studies have found a positive relationship between learners’ learning outcomes and the degree of their participation in learning activities [5,6], and learners with lower engagement are likely to drop out [7]. Therefore, there is an urgent need to identify learners’ learning engagement and to provide timely interventions for those who are not highly engaged in order to ensure learning effectiveness. At the same time, a precise assessment of learners’ online learning conditions can help promote the iterative development of various online teaching platforms, providing technical support to create a personalized and intelligent teaching environment.

Learning engagement refers to the amount of physical and psychological energy that learners devote to learning [8] and reflects the psychological state of learners’ emotional and cognitive activities in course study, as well as the behavioral participation they invest in the course [9]. According to [10], learning engagement can be divided into three dimensions: emotional involvement, cognitive involvement, and behavioral involvement. Learning behavioral involvement is the basic dimension of learning engagement and is the carrier of emotional involvement and cognitive involvement [11]. Traditional methods for evaluating learning engagement mainly include self-reporting, experience sampling, teacher grading, interview methods, and observation. These methods are based on the measurement and recording of student performance, using frequency data analysis as a basic technique, and often involve pen-and-paper observation records or small-scale learning log data. Their efficiency and credibility are greatly compromised. Moreover, due to factors such as teachers and students being separated by time and space, a large number of students, and weakened teacher constraints, traditional evaluation methods in the context of education are not well suited to the online learning environment. In recent years, artificial intelligence technology has made significant progress in recognizing human emotions and behaviors [12,13], and the importance of applying AI technology to smart education has been agreed upon in the industry. The recognition of learning engagement should be one of its most important application scenarios. Based on this, a multimodal vision-based method for assessing student learning engagement in online classes was proposed, which compensates for the deficiencies of a single evaluation model in detecting student learning participation in the classroom.

2. Related Work

Since the 1980s, student learning engagement has become an important topic in the field of educational research. Part of the reason for researchers’ interest in learning engagement in the early days was concern about high dropout rates and the fact that 25–60% of students reported feeling bored and disengaged in class [14,15]. Evaluating students’ learning engagement can assist teachers in understanding students’ participation in learning, so as to intervene in a timely manner, help students reflect on their own learning, and promote their active participation in learning.

Methods based on computer vision can measure learners’ engagement by studying clues such as gestures, postures, eye movements, and facial expressions. These clues can be perceived by external observers and are the basis for teachers to adjust their teaching behavior in traditional classroom settings. Whitehill et al. [16] manually labeled the expressions of 7574 frames from the HBCU dataset and 16,722 frames from the UC dataset as disengaged, pretending to be engaged, engaged, and highly engaged, and then used Boost (BF), SVM (Gabor), and MLR (CERT) methods for automatic recognition of learning engagement. The experimental results showed that machine learning methods have a high accuracy in recognizing learning engagement. Grafsgaard et al. [17] also used the Computer Expression Recognition Toolbox (CERT) to analyze facial movements during computer-mediated tutoring, and the study showed that upper facial movements can predict learning engagement, frustration, and learning ability. Monkaresi et al. [18] used Kinect facial tracker and heart rate to detect learners’ engagement in educational activities and updatable Naive Bayes, Bayesian networks, K-means clustering, Rotating Forest, and Dagging classifiers for decision-level and feature-level fusion. The study found that accuracy based on facial expressions is higher than that based on heart rate. Zhang et al. [19] used cameras to capture students’ facial images and simultaneously captured mouse movement data. The research results showed that the recognition accuracy of learning engagement using both expression and mouse datasets was 94.6%, while the recognition accuracy using only expression data was 91.5%. Zhan [20] combined expression recognition and eye tracking technology to construct an emotion and cognition recognition model for remote learners based on an intelligent Agent, iteratively recognizing eye tracking and expression monitoring, coupling the emotion and cognition recognition process, improving the recognition accuracy of remote learners’ learning status, and improving the Agent’s emotional and cognitive support for learners. With the successful application of deep learning in the field of computer vision, some researchers have also used it for student learning engagement detection. Alkabbany et al. [21] proposed a real-time automatic measurement method for student engagement by analyzing students’ behaviors and emotions. Zhang et al. [22] conducted research on classroom learning engagement recognition by analyzing learners’ facial information from the dimension of mixed scenes.

In real-world scenarios, it is difficult to have learners wear expensive measurement devices such as eye trackers for classroom learning. Cameras offer convenience, timeliness, and richness in data collection. The widespread use of facial recognition also attests to the reliability and accuracy of image data. The process of capturing data via cameras does not infringe on learners, similarly to teachers observing learners’ participation in instructional activities without interrupting them. Images extracted from cameras can be used to analyze learners’ gazes, facial expressions, and classroom behavior, and thus determine their participation in the class. A person’s gazes can provide crucial clues for analyzing attention, intent, and motivation. Liu et al. [23] explored the relationship between head posture and human attention. Chen et al. [24] used a single image head posture to project students’ gazes onto the video image of the teacher’s lecture, enabling a visual analysis of students’ learning attention. Singh et al. [25] also estimated people’s head posture through deep learning and used it to judge attention. These studies indicate that learners’ attention can be mapped through image analysis of head posture.

Zhou et al. [26] pointed out that “learning emotion is an important factor affecting students’ cognitive processing and learning outcomes”. Xu et al. [27] also believed that the usability of learning emotions based on facial expression recognition is high. Based on control value theory, Loderer et al. [28] conducted a detailed study of 186 articles from 1965 to 2018 concerning learning emotions, concluding that emotion is an important driving factor for learning and can reflect learners’ engagement. Zhao et al. [29] examined students’ learning engagement in the classroom from the perspective of classroom behavior. Alkabbany et al. [30] recognized classroom engagement through the analysis of students’ facial key points, head posture, eye gaze, and learning features.

Therefore, by using a camera as the data collection device and applying deep learning methods to recognize and analyze various biological feature data of learners (face, expression, behavior, and posture), we aim to explore the intrinsic link between learners’ various physiological and behavioral characteristics and their learning engagement. This project seeks to establish a recognition method for student learning engagement based on multimodality visual domains, to promote the evaluation of learning participation towards automation, intelligence, and refinement.

3. Proposed Method

3.1. Hierarchical Analysis Framework for Online Classroom Student Engagement

We can obtain the learner’s class video data through the built-in camera of the online learner’s learning device or peripheral camera (placed in the middle, above the computer screen). Our proposed method can evaluate classroom learning engagement by analyzing multimodal visual information such as gazes, facial expressions, and classroom behavior during the learning process. We apply a hierarchical strategy in our evaluation method: firstly, we perform person detection and identity recognition on the input data. If there is no person or the detected face is not the learner himself, it is judged as not engaged; secondly, we estimate the learner’s gazes. The yaw and pitch angles obtained are calculated through a function. If the function value is greater than the set threshold, it is considered as facing the screen; otherwise, it is not facing the screen. The learner facing the screen is a prerequisite for paying attention, and further through facial expression recognition, it can be judged whether the learner is engaged in learning; when the learner is not facing the screen, it may not necessarily mean that they are not engaged in learning. For example, writing and reading are expressions of engagement in learning. Therefore, our method recognizes the learner’s actions when they are not facing the screen to judge whether they are engaged in learning. The overall research plan is shown in Figure 1.

3.2. The Gaze Estimation Model

The direction of human visual attention is highly correlated with head posture and the direction of gaze. Langton et al. [31] found that, in most cases, people’s attention direction can be obtained by analyzing the angle of head posture. However, in the scenario of short-distance online classrooms, gaze estimation is a better method to represent the direction of people’s attention. In the online learning environment, the direction of human eyes’ gaze is the key factor to judge whether learners are engaged or not. When learners are listening to lectures, their gaze’s yaw and pitch angles stay within a certain range. When learners are engaged, their line of sight should be on the screen. If the gaze deviates from the screen for a long time, it may be a sign of disengagement.

The L2CS-Net [32] model has made significant progress in the field of gaze estimation. It uses a method based on a convolutional neural network (CNN), particularly ResNet50 as the backbone network, to effectively handle complex visual data, predict the yaw, and pitch gaze angles separately and improve prediction accuracy. The structure of this model is shown in Figure 2.

Compared with directly regressing the two angles, for each angle, L2CS applies two different loss functions, combining the regression and classification tasks to obtain an exact value. Specifically, for each yaw and pitch angle, intervals are divided into 90 sections at 4-degree intervals, thereby obtaining the label for each angle. Firstly, the classification loss is calculated using cross-entropy loss. The interval label obtained for each classification is then restored as angle data, and the result of Softmax is multiplied by the corresponding angle to calculate the predicted angle. MSE(mean squared error) is used as the loss, and the two losses are balanced by the alpha parameter. The loss function is defined as:

{l o s s}_{G a z e} = - \sum_{i} y_{i} \log p_{i} + α \frac{1}{N} \sum_{0}^{N} {(y - p)}^{2},

(1)

where

p

is the predicted value,

y

is the ground-truth value, and

α

is the regression coefficient.

3.3. The Proposed Facial Expression Recognition Model

Deep learning has achieved significant results in computer vision tasks and has made great strides in facial expression recognition. However, in an online learning environment, the facial images of learners present issues such as interclass differences, image blur, pose changes, and occlusions. These challenges cause the recognition accuracy of existing algorithms to be generally low. Furthermore, due to privacy concerns, collecting facial expression data from learners in a laboratory-controlled environment is both difficult and time-consuming, resulting in smaller datasets (in the thousands or even hundreds). Therefore, the generalization and practical application capabilities of facial expression analysis in solving real-world problems, such as online learning, require further research and development.

RepVGG [33] is an improved VGG network model. Its core idea is to use the concept of structural reparameterization to transform the multipath structure of the training network into the single-path structure of the inference network, ultimately improving the network’s inference efficiency. During the training process of the FER(Facial Expression Recognition) network, RepVGG introduces direct connection residual branches and 1 × 1 convolution branches, enhancing the efficiency of training. The initial feature size input into the RepVGG network model is C × H × W (where C represents the number of channels, and H and W are the height and width of the frame image, respectively). The model extracts features through several convolutional modules composed of multipath branches, including 3 × 3 convolution + BN, 1 × 1 convolution + BN, and residual connection + BN, each embedded with a max pooling layer and the activation function ReLU. The max pooling layer is used to reduce redundant spatial information from the convolution operations, while integrating normalization operation BN into the convolution layer can reduce the number of model layers, decrease the model’s dependency on parameter initialization, and enhance the performance of the network model.

Our FER method uses RepVGG as the backbone network and pretrains it using the FER2013 facial expression recognition dataset. When it is applied to the student engagement evaluation model, the Multi-Task Cascaded Convolutional Network (MTCNN) is first used to obtain the face cropping box. The main network can output data features and pre-classification results, which become input into an MLP for further training. Finally, the network is optimized using a weighted classification cross-entropy (Softmax) loss function. The structure of this model is shown in Figure 3.

3.4. The Proposed Action Recognition Model

Action recognition research focuses on the actions of individuals in video clips. This is because action recognition not only needs to extract the spatial features of human actions, but also the features in the time dimension. When using video data for action recognition, 3D convolutional neural networks can extract action information from multiple consecutive video frames. It is an improvement on the 2D convolutional neural network, used to calculate spatial and temporal dimension features, and can make good use of the sequential information in the video. The 3D convolution is achieved by stacking multiple consecutive frames and then using a cube-like convolution kernel to capture the action features in the temporal information.

ResNet [34] is a classic network widely used in image classification. The method we propose uses the Inflated 3D ConvNet idea to improve ResNet into a 3D convolutional neural network, enabling ResNet to be applied to video data. Compared with image-based convolutional neural networks, it inputs continuous video data and introduces spatial dimension information at different times, which can more completely extract video features.

For video data, their temporal features will not change too quickly or too slowly in the time dimension. Therefore, the spatio-temporal receptive field is very important for model construction. If the time dimension receptive field exceeds the spatial dimension, it may cause the edge feature information of different objects to merge, thereby destroying the early feature extraction. Conversely, if the size of the spatial dimension receptive field exceeds the time dimension, it may not be able to capture the dynamic features of the scene well. Therefore, in the method we propose, the maximum pooling of the first stage does not perform time pooling, and the spatial scale remains unchanged, to better extract spatio-temporal features. The structure of this model is shown in Figure 2C.

The second to fifth stages are composed of Res1. and Res2. submodules, with the structural details of the two submodules shown in Figure 3.

4. Design of Experiments

4.1. Datasets

Since there is currently no publicly available online classroom student behavior dataset that includes action information, we have collected our own video dataset that includes 50 undergraduate students as research subjects. Among them, there are 28 males and 22 females, all aged between 18 and 22. These 50 learners are in a normal online classroom learning state and have turned on the video recording mode to collect online classroom behavior information, which is used to construct an online classroom student behavior dataset. Figure 4 shows part of the dataset.

The single class session lasts for 100 min. Each student’s online learning video was preprocessed, removing the duration data outside of class time, and the original videos were trimmed. Every 10 s was considered a segment for engagement analysis, and ultimately 30,000 video segments were separated out. We invited four experts to annotate the data, which included the teacher who taught the class when the data were collected and three teachers who taught the same course. Each segment of the video was annotated by two teachers. If the two teachers disagreed about the annotation of the same video segment, a third teacher would provide additional marking. If all four teachers’ annotation results were different, they would collectively annotate it. Each dataset’s labels were marked in the format of “expression–action–engagement level”. The engagement level refers to the classification method of Gupta et al. [35] for data annotation. After the data are labeled, seven common classroom actions are statistically obtained, including listening to lectures, writing, reading, eating, looking around, sleeping, playing with mobile phones, etc., along with six expressions, including surprise, confusion, calm, happy, tired, bored, etc., to complete the construction of the dataset. Figure 5 shows the distribution of the number of samples at four engagement levels in the dataset. Level 0 indicates very low learner engagement, while level 3 means the learner’s engagement is very high.

4.2. Analysis of Student Engagement Based on Gaze Estimation

The degree of students’ engagement in online classrooms can be determined by analyzing whether their gaze is focused on the display area of the learning device, which indicates the direction of their attention. When students are listening to lectures through online learning devices, the center of the screen is considered the focal point of their vision. The range of their eye movement from this central point to the screen’s edges left and right is termed as the yaw angle, and the up-and-down movement towards the screen’s edges is considered the pitch angle.

As shown in Figure 6, when a student in an online classroom focuses on a point within the screen, such as

P_{1}

,

P_{2}

, their level of learning engagement can be judged by their facial expression. Conversely, when a student focuses outside the screen for a long time, such as

P_{3}

, their level of learning engagement needs to be determined by recognizing their behavior.

Students at university mainly use laptop computers for online learning. Based on this, with the center point of the upper boundary of the display area (camera) as the origin, the width of the display area as

W

, the height of the display area as

H

, and the distance between the student’s head and the learning device as

L

, the gaze boundary angle can be determined as shown in Formulas (2)–(5):

θ_{P i t c h U p} = 0,

(2)

θ_{P i t c h D o w n} = - \arctan \frac{H}{L},

(3)

θ_{Y a w L e f t} = - \arctan \frac{W}{2 L},

(4)

θ_{Y a w L e f t} = \arctan \frac{W}{2 L}

(5)

After collecting statistics on the learning devices and learning environment of current online students, and standardizing the display area of the learning device and the distance between the device and the person, the distance L is 72 cm, the height of the display area H is 33.5 cm, and the width of the display area W is 59.6 cm. The range of gaze angles can then be determined: pitch direction [0°, −25°] and yaw direction [−22.5°, 22.5°]. That is, students whose gaze angle is within this range are focused on the learning display area of the classroom; otherwise, they have left the display area. This way, the attention of each video data can be calibrated based on the proportion of focus on the display area of the learning device. The gaze estimation method state detection is shown in Figure 7.

4.3. Analysis of Student Engagement Based on Facial Expression Recognition

The renowned international psychologist Ekman proposed that joy, sadness, fear, anger, surprise, and disgust make up the basic emotions of humans, who can express many types of emotions through different combinations. Learning emotions refer to the positive or negative emotions students express during the learning process, which can display the student’s interest in and experience of learning.

Emotional investment is a crucial component of the learning process. Positive emotions such as happiness, concentration, and curiosity not only indicate a high degree of learner engagement but also aid in improving the learner’s performance; negative emotions like boredom and distraction not only suggest less learner engagement but also predict a drop in the learner’s performance. Although a confused expression may show that the teaching content is inconsistent with the learner’s prior cognition, it precisely demonstrates that the learner is engaged in learning. Referencing the research findings of D’Mello et al. [36] and Loderer et al. [28], we provide a mapping relationship between learning emotions and learning engagement, as shown in Table 1.

When students appear happy or calm during the learning process, it can be assumed that they are attentively listening to the lecture; when students display expressions of surprise or confusion, it is believed that they are more interested in the learning content, that is, they are more engaged in learning; when the expression is tired or bored, the student may be in a state of disengagement from learning. The effect diagram of student facial expression analysis is shown in Figure 8.

4.4. Analysis of Student Engagement Based on Action Recognition

At present, many researchers distinguish the learning engagement of classroom behavior based on two dimensions: engagement and disengagement. Engagement is a positive learning behavior, characterized by effort, attention, and persistence, whereas disengagement is a negative input, often manifested as passivity, lack of initiative, and giving up. This project refers to the S-T analysis method and research results such as [29], and statistically selects the main behaviors of online classroom students not looking at the screen, preliminarily providing the mapping relationship between online classroom learning behavior and learning engagement, as shown in Table 2.

The natural listening state can be judged by the degree of engagement through the methods of gaze estimation and facial expression recognition, so it is not necessary to consider the natural listening state in the final engagement judgment. The visualization effects of other classifications are shown in Figure 9.

4.5. Online Classroom Student Engagement Rating

Considering the strong subjectivity in rating students’ engagement, different experts may have significant discrepancies in the engagement scores given to the same student behavior. Therefore, in light of the content above, we summarize the approach of evaluating students’ learning participation by rating the engagement level of students in online classrooms. We reference [35] and make four levels of engagement rating, from level 0 to level 4 in increasing order. When the classroom is empty, it is assumed that the student has no engagement. Other rating methods for engagement detection indicators are shown in Figure 10.

5. Experiments and Results Analysis

5.1. Additional Datasets

Due to the continuity of the gaze point of the human eye, it is difficult to mark the gaze angles of each student through expert judgment. Therefore, we use the MPIIGaze public dataset to train the gaze estimation model.

5.2. Experiment Details

The experiment uses the Pytorch1.12 deep learning framework and is conducted on 4 HYGON Z100SM DCUs. All three models use Stochastic Gradient Descent (SGD), with a momentum of 0.9 and a weight decay of 1 × 10⁻⁵. The gaze estimation model is trained on the public dataset for 60 epochs, with a batch size set to 32. Both the facial expression recognition model and the action recognition model are trained on our self-constructed online classroom engagement dataset for 80 epochs, with a batch size set to 32.

To verify the effectiveness of these proposed methods, we conduct comparative experiments with the current advanced methods. As can be seen in Table 3, L2CS-Net outperforms other gaze estimation methods, achieving the state-of-the-art with mean angular errors only reaching 3.96° and 3.92° when

α

is set to 1 and 2, respectively.

Table 4 presents the accuracy results of the facial expression recognition model proposed in this article, along with EfficientFace, EAC, and HSEmotion on the dataset. It can be seen that our method has a high accuracy rate.

Table 5 presents the comparative experimental results of different action recognition methods on our established dataset, showing that the action recognition method we proposed still maintains a high accuracy.

According to the above three tables, it can be concluded that the gaze estimation model, facial expression recognition model, and action recognition model that we used all showed significant improvements compared to the classical state-of-the-art (SOTA) models. Also, our three methods achieved high precision in their respective areas, enough to support the process of assessing the student engagement level in online classrooms.

Deep learning is widely used in video classification. At the same time, our built online classroom student behavior dataset includes learning engagement level labels. Action recognition based on video data is also a type of video classification. Therefore, while verifying the feasibility of the process for assessing the student engagement level in online classrooms, we have simultaneously validated the accuracy of video classification models in directly grading the student engagement level for online classrooms. The experimental results are shown in Table 6.

As can be seen, the model we proposed has shown a significant improvement compared to other models, achieving an accuracy of 83.7%. Moreover, when directly grading the engagement level of students in online classrooms, the accuracy of LRCN, C3D, and Two Stream networks significantly decrease compared to the results of action recognition. This is because the features included in learning engagement are more extensive than those in action recognition. Different videos at the same engagement level could present different features, which inevitably leads to a decrease in accuracy.

To observe the performance of our framework at various engagement levels in an intuitive way, Figure 11 shows the confusion matrix for the classification of engagement levels at various grades.

5.3. Discussion of the Experimental Results

For educators, the assessment of students’ engagement in online classes is based on various aspects. Therefore, using computer vision methods for evaluation also requires a combination of multiple features of online student behavior. This is why our proposed method is more efficient than traditional end-to-end deep learning methods.

Learning behavior is a continuous and multifaceted process. Standalone methods of gaze estimation, facial expression recognition, and action recognition all ignore the multifaceted nature of learning behavior. Although they can simply predict the state of students in online classrooms, they are ineffective in accurately detecting whether students are focused on learning. Therefore, they perform poorly in predicting student engagement. Also, because learning behavior is a behavior that integrates multiple features such as gazes, facial expression, and action, traditional end-to-end deep learning methods are not very effective, and they are unable to accurately judge the engagement level of students.

The method we proposed fully utilizes various feature information of online classroom student behavior data, and the accuracy of grading student learning engagement reaches 83.7%. It is suitable for analyzing online classroom scenarios and can be used to identify fatigued and hyperactive learning behaviors. Our model can be used to provide feedback on students’ learning conditions, thereby improving the quality of teaching.

6. Conclusions

Based on gazes, facial expression recognition, and action recognition, this study built a cascade analysis network model to analyze the engagement degree of students in online classrooms. We not only considered the continuity and complexity of learning behaviors but also achieved good results in the constructed dataset of online classroom student learning behaviors. Furthermore, by comparing other relevant algorithms, we verified the effectiveness of the proposed method in grading the engagement degree of students in online classroom scenarios, with the accuracy of commitment level recognition reaching 83.7%. In the practice of online teaching, through the cascade analysis network model constructed in our study to analyze students’ engagement in the online classroom, teachers can understand students’ engagement throughout the teaching period in a timely manner. This will assist teachers in adjusting their teaching plans and improving their teaching schemes, providing a basis for reference, and helping to enhance the efficiency of online classroom teaching.

Author Contributions

Conceptualization, Y.Q.; experiments, L.Z.; data curation, X.H. and A.L.; writing—original draft preparation, L.Z.; writing—review and editing, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 62267007 and Gansu Provincial Department of Education Industrial Support Plan Project under Grant 2022CYZC-16.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, X.; Wang, D. On the development process and main characteristic of MOOC. Mod. Educ. Technol. 2013, 23, 5–10. [Google Scholar]
Reich, J.; Ruipérez-Valiente, J.A. The MOOC Pivot. Science 2019, 363, 130–131. [Google Scholar] [CrossRef] [PubMed]
Jordan, K. MOOC Completion Rates: The Data. Available online: http://www.katyjordan.com/MOOCproject.html (accessed on 5 August 2023).
Jiang, Z.; Zhang, Y.; Li, X. Learning behavior analysis and prediction based on MOOC data. J. Comput. Res. Dev. 2015, 52, 614–628. [Google Scholar]
Hughes, M.; Salamonson, Y.; Metcalfe, L. Student Engagement Using Multiple-Attempt ‘Weekly Participation Task’ Quizzes with Undergraduate Nursing Students. Nurse Educ. Pract. 2020, 46, 102803. [Google Scholar] [CrossRef] [PubMed]
Brozina, C.; Knight, D.B.; Kinoshita, T.; Johri, A. Engaged to Succeed: Understanding First-Year Engineering Students’ Course Engagement and Performance Through Analytics. IEEE Access 2019, 7, 163686–163699. [Google Scholar] [CrossRef]
Wang, W.; Guo, L.; He, L.; Wu, Y.J. Effects of Social-Interactive Engagement on the Dropout Ratio in Online Learning: Insights from MOOC. Behav. Inf. Technol. 2019, 38, 621–636. [Google Scholar] [CrossRef]
Oh, C.; Roumani, Y.; Nwankpa, J.K.; Hu, H.-F. Beyond Likes and Tweets: Consumer Engagement Behavior and Movie Box Office in Social Media. Inf. Manag. 2017, 54, 25–37. [Google Scholar] [CrossRef]
Sun, Y.; Ni, L.; Zhao, Y.; Shen, X.; Wang, N. Understanding Students’ Engagement in MOOCs: An Integration of Self-determination Theory and Theory of Relationship Quality. Br. J. Educ. Technol. 2019, 50, 3156–3174. [Google Scholar] [CrossRef]
Fredricks, J.A.; Blumenfeld, P.C.; Paris, A.H. School Engagement: Potential of the Concept, State of the Evidence. Rev. Educ. Res. 2004, 74, 59–109. [Google Scholar] [CrossRef]
Wu, F.; Zhang, Q. Learning behavioral engagement: Definition, analysis framework and theoretical model. China Educ. Technol. 2018, 372, 35–41. [Google Scholar]
Zhang, F.; Zhang, T.; Mao, Q.; Xu, C. Geometry Guided Pose-Invariant Facial Expression Recognition. IEEE Trans. Image Process. 2020, 29, 4445–4460. [Google Scholar] [CrossRef]
Zhang, H. The Literature Review of Action Recognition in Traffic Context. J. Vis. Commun. Image Represent. 2019, 58, 63–66. [Google Scholar] [CrossRef]
Larson, R.W.; Richards, M.H. Boredom in the Middle School Years: Blaming Schools versus Blaming Students. Am. J. Educ. 1991, 99, 418–443. [Google Scholar] [CrossRef]
Shernoff, D.J.; Csikszentmihalyi, M.; Schneider, B.; Shernoff, E.S. Student Engagement in High School Classrooms from the Perspective of Flow Theory. Sch. Psychol. Q. 2003, 18, 158. [Google Scholar] [CrossRef]
Whitehill, J.; Serpell, Z.; Lin, Y.-C.; Foster, A.; Movellan, J.R. The Faces of Engagement: Automatic Recognition of Student Engagementfrom Facial Expressions. IEEE Trans. Affect. Comput. 2014, 5, 86–98. [Google Scholar] [CrossRef]
Grafsgaard, J.F.; Wiggins, J.B.; Boyer, K.E.; Wiebe, E.N.; Lester, J.C. Automatically Recognizing Facial Expression: Predicting Engagement and Frustration. In Proceedings of the 6th International Conference on Educational Data Mining (EDM 2013), Memphis, TN, USA, 6–9 July 2013. [Google Scholar]
Monkaresi, H.; Bosch, N.; Calvo, R.A.; D’Mello, S.K. Automated Detection of Engagement Using Video-Based Estimation of Facial Expressions and Heart Rate. IEEE Trans. Affect. Comput. 2017, 8, 15–28. [Google Scholar] [CrossRef]
Zhang, Z.; Li, Z.; Liu, H.; Cao, T.; Liu, S. Data-Driven Online Learning Engagement Detection via Facial Expression and Mouse Behavior Recognition Technology. J. Educ. Comput. Res. 2020, 58, 63–86. [Google Scholar] [CrossRef]
Zhan, Z.H. An emotional and cognitive recognition model for distance learners based on intelligent agent-the coupling of eye tracking and expression recognition techniques. Mod. Dist. Educ. Res. 2013, 5, 100–105. [Google Scholar]
Alkabbany, I.; Ali, A.M.; Foreman, C.; Tretter, T.; Hindy, N.; Farag, A. An Experimental Platform for Real-Time Students Engagement Measurements from Video in STEM Classrooms. Sensors 2023, 23, 1614. [Google Scholar] [CrossRef]
Zhang, Y.H.; Pan, M.; Zhong, G.C.; Cao, X.M. Learning Engagement Detection Based on Face Dataset in the Mixed Scene. Mod. Educ. Technol. 2021, 31, 84–92. [Google Scholar]
Liu, H.; Nie, H.; Zhang, Z.; Li, Y.-F. Anisotropic Angle Distribution Learning for Head Pose Estimation and Attention Understanding in Human-Computer Interaction. Neurocomputing 2021, 433, 310–322. [Google Scholar] [CrossRef]
Chen, P.; Huangpu, D.P.; Luo, Z.Y.; Li, D.X. Visualization analysis of learning attention based on single-image PnP head posture estimation. J. Commun. 2018, 39, 141–150. [Google Scholar]
Singh, T.; Mohadikar, M.; Gite, S.; Patil, S.; Pradhan, B.; Alamri, A. Attention Span Prediction Using Head-Pose Estimation with Deep Neural Networks. IEEE Access 2021, 9, 142632–142643. [Google Scholar] [CrossRef]
Zhou, J.; Ye, J.M.; Li, C. Multimodal Learning Affective Computing: Motivations, Frameworks, and Recommendations. e-Educ. Res. 2021, 42, 26–32+46. [Google Scholar] [CrossRef]
Xu, X.; Zhao, W.; Liu, H. Research on application and model of emotional analysis in blended learning environment: From perspective of meta-analysis. e-Educ. Res. 2018, 39, 70–77. [Google Scholar] [CrossRef]
Loderer, K.; Pekrun, R.; Lester, J.C. Beyond Cold Technology: A Systematic Review and Meta-Analysis on Emotions in Technology-Based Learning Environments. Learn. Instr. 2020, 70, 101162. [Google Scholar] [CrossRef]
Zhao, C.; Shu, H.; Gu, X. The Measurement and Analysis of Students’ Classroom Learning Behavior Engagement Based on Computer. Mod. Educ. Technol. 2021, 31, 96–103. [Google Scholar]
Alkabbany, I.; Ali, A.; Farag, A.; Bennett, I.; Ghanoum, M.; Farag, A. Measuring student engagement level using facial information. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 3337–3341. [Google Scholar] [CrossRef]
Langton, S.R.H.; Watt, R.J.; Bruce, V. Cues to the Direction of Social Attention. Trends Cogn. Sci. 2000, 4, 50–59. [Google Scholar] [CrossRef]
Abdelrahman, A.A.; Hempel, T.; Khalifa, A.; Al-Hamadi, A. L2CS-Net: Fine-grained gaze estimation in unconstrained environments. arXiv 2022, arXiv:2203.03339. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-Style ConvNets Great Again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 13728–13737. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016; pp. 770–778. [Google Scholar]
Gupta, A.; D’Cunha, A.; Awasthi, K.; Balasubramanian, V. DAiSEE: Towards User Engagement Recognition in the Wild. arXiv 2022, arXiv:1609.01885. [Google Scholar]
D’Mello, S.K.; Craig, S.D.; Graesser, A.C. Multimethod Assessment of Affective Experience and Expression during Deep Learning. Int. J. Learn. Technol. 2009, 4, 165. [Google Scholar] [CrossRef]
Chen, Z.; Shi, B.E. Appearance-based gaze estimation using dilated-convolutions. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 309–324. [Google Scholar]
Cheng, Y.; Zhang, X.; Lu, F.; Sato, Y. Gaze estimation by exploring two-eye asymmetry. IEEE Trans. Image Process. 2020, 29, 5259–5272. [Google Scholar] [CrossRef] [PubMed]
Cheng, Y.; Huang, S.; Wang, F.; Qian, C.; Lu, F. A coarse-to-fine adaptive network for appearance-based gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10623–10630. [Google Scholar]
Wang, G.; Li, J.; Wu, Z.; Xu, J.; Shen, J.; Yang, W. EfficientFace: An Efficient Deep Network with Feature Enhancement for Accurate Face Detection. Multimed. Syst. 2023, 29, 2825–2839. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, C.; Ling, X.; Deng, W. Learn From All: Erasing Attention Consistency for Noisy Label Facial Expression Recognition Supplementary Material. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 418–434. [Google Scholar]
Savchenko, A.V. HSEmotion: High-Speed Emotion Recognition Library. Softw. Impacts 2022, 14, 100433. [Google Scholar] [CrossRef]
Donahue, J.; Hendricks, L.A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: Santiago, Chile, 2015; pp. 4489–4497. [Google Scholar]
Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; Volume 27. [Google Scholar]

Figure 1. Flow chart of the hierarchical framework for online classroom student engagement.

Figure 2. Model of the hierarchical framework for online classroom student engagement. (A) The gaze estimation model. (B) The proposed facial expression recognition model. (C) The proposed action recognition model.

Figure 3. Two detailed submodule of Inflated ResNet. (a) Detailed structure of the Res1. submodule. (b) Detailed structure of the Res2. submodule.

Figure 4. Partial online classroom student behavior dataset.

Figure 5. Distribution of the number of samples at four engagement levels in the dataset.

Figure 6. Online classroom student gaze situation.

Figure 7. Gaze estimation Euler angle prediction visualization. (a) Watching the screen. (b) Not looking at the screen.

Figure 8. Visualization of facial recognition classification results. (a) Surprised. (b) Confused. (c) Happy. (d) Neutral. (e) Tired. (f) Boredom.

Figure 9. Visualization of action recognition classification results. (a) Writing. (b) Reading. (c) Eating. (d) Looking around. (e) Sleeping. (f) Playing with mobile phone.

Figure 10. Visualization of action recognition classification results.

Figure 11. Confusion matrix for the classification of engagement levels at various grades.

Table 1. Mapping of online classroom student learning emotion and attitude towards learning.

Attitude towards Learning	Expression
Positive	Neutral
	Surprised
	Confused
	Happy
Negative	Tired
Negative	Boredom

Table 2. Mapping of student behavior and attitude towards learning when not looking at the screen in online classes.

Attitude towards Learning	Expression
Positive	Writing
Positive	Reading
Negative	Eating
	Looking around
	Sleeping
	Playing with mobile phone

Table 3. Comparison of mean angular error between L2CS-Net and SOTA methods.

Methods	MPIIGaze
Dilated-Net [37]	4.8°
FAR-Net [38]	4.3°
CA-Net [39]	4.1°
$L 2 CS - Net (α = 1$ ) [32]	3.96°
$L 2 CS - Net (α = 2$ ) [32]	3.92°

Table 4. Comparison of experimental results for different facial expression recognition methods.

Methods	Accuracy
EfficientFace [40]	82.2
EAC [41]	80.8
HSEmotion [42]	84.7
Ours	88.4

Table 5. Comparison of experimental results for different action recognition methods.

Methods	Accuracy
LRCN [43]	82.7
C3D [44]	85.2
Two Stream [45]	88.0
Ours	89.5

Table 6. Comparison of experimental results for different engagement grading methods.

Algorithm	Accuracy
LRCN	76.8
C3D	74.9
Two Stream	80.4
Ours	83.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qi, Y.; Zhuang, L.; Chen, H.; Han, X.; Liang, A. Evaluation of Students’ Learning Engagement in Online Classes Based on Multimodal Vision Perspective. Electronics 2024, 13, 149. https://doi.org/10.3390/electronics13010149

AMA Style

Qi Y, Zhuang L, Chen H, Han X, Liang A. Evaluation of Students’ Learning Engagement in Online Classes Based on Multimodal Vision Perspective. Electronics. 2024; 13(1):149. https://doi.org/10.3390/electronics13010149

Chicago/Turabian Style

Qi, Yongfeng, Liqiang Zhuang, Huili Chen, Xiang Han, and Anye Liang. 2024. "Evaluation of Students’ Learning Engagement in Online Classes Based on Multimodal Vision Perspective" Electronics 13, no. 1: 149. https://doi.org/10.3390/electronics13010149

APA Style

Qi, Y., Zhuang, L., Chen, H., Han, X., & Liang, A. (2024). Evaluation of Students’ Learning Engagement in Online Classes Based on Multimodal Vision Perspective. Electronics, 13(1), 149. https://doi.org/10.3390/electronics13010149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluation of Students’ Learning Engagement in Online Classes Based on Multimodal Vision Perspective

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Hierarchical Analysis Framework for Online Classroom Student Engagement

3.2. The Gaze Estimation Model

3.3. The Proposed Facial Expression Recognition Model

3.4. The Proposed Action Recognition Model

4. Design of Experiments

4.1. Datasets

4.2. Analysis of Student Engagement Based on Gaze Estimation

4.3. Analysis of Student Engagement Based on Facial Expression Recognition

4.4. Analysis of Student Engagement Based on Action Recognition

4.5. Online Classroom Student Engagement Rating

5. Experiments and Results Analysis

5.1. Additional Datasets

5.2. Experiment Details

5.3. Discussion of the Experimental Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI