*3.1. Features Extraction*

2DCNN means that the convolution kernel performs a sliding window operation in the two-dimensional space of the input image, which preserves the spatial features of individual video frames. The fact is that a 2DCNN can only receive one frame for convolution. While this can help us identify students in an image, we are now seeking to identify students at varying degrees of engagement, which requires numerous frames in a sequence to decide. If we train a convolutional network stream for each image, this requires a lot of computation time, so for sequential video frame sequences, TimeDistributed wrappers can focus on temporal features. The TimeDistributed wrapper enables the wrapped CNN layers to be applied to each time slice of the input, which allows the spatial features extracted by the convolutional network to preserve the temporal feature well. VGG16 inspired us to extract features with convolution-pooling-convolution-pooling, which allows for more nonlinear variations in the data. We employed tiny convolutional kernels (3 × 3) for feature extraction in the CNN section of the model, which inevitably deepens the network depth but also reduces parameters and improves model generalization ability.

### *3.2. Sequence Learning*

Although BiLSTM is mostly used in NLP domains, such as sentence classification, the learning engagement is considered continuous, based upon which we also used BiLSTM for engagement classification. The BiLSTM network may consider the contextual information of learning engagement because it adds the inverse operation to the traditional LSTM, which enables the network to assess depending on students' pre- and post-learning states. BiLSTM will prevent us from classifying the video in real time. Still, it more follows our annotation work on the data than LSTM, i.e., utilizing one label to represent a whole 10 s video.

### **4. Experiment**

### *4.1. Experimental Setting*

The experimental environment for this study was configured with NVIDIA GeForce RTX 3070 8 G (GPU), intel i7-11700 (CPU), and Windows 10 (OS), with Keras (deep learning framework). The input dimension was 40 × 80 × 80 × 3 (NHWC), where 40 represents the length of the input video frame sequence, 80 × 80 represents the image resolution, and 3 represents the three channels of RGB color image; the output dimension was threedimensional, representing its possibility for three engagement levels, respectively, and the dimension in which the maximum value was taken as the final prediction result.

### *4.2. Evaluation Metrics*

The precision (*P*) and recall (*R*) metrics were used to measure model performance in this experiment, and they were computed as given in Equations (1) and (2), respectively.

$$P = \frac{TP}{TP + FP'} \tag{1}$$

$$R = \frac{TP}{TP + FN} \tag{2}$$

where *TP* (True Positives) indicates the number of properly predicted target engagements, *FP* (False Positives) represents the number of mistakenly predicted target engagements, and *FN* (False Negatives) represents the number of target engagements that were not successfully identified.

### *4.3. Experimental Results*

In the same experimental setting, this study compared the performance of different methods on our dataset, and the results are shown in Table 4 and Figure 4. In Figure 4, the horizontal rows show the real category of the video, the vertical columns show the model's predicted category, and the brackets represent the recall R of the current category.

**Table 4.** Classification results obtained by BiLRCN, LRCN, ResTcn, C3D, Xception, and SlowFast. The best results are expressed in bold. 0: Low engagement; 1: Engagement; 2: High engagement.


**Figure 4.** Confusion matrix for experimental results, (**a**) BiLRCN, (**b**) LRCN, (**c**) ResTCN, (**d**) C3D, (**e**) Xception, (**f**) SlowFast.

Compared to other methods, the accuracy of BiLRCN and LRCN is higher, and the accuracy and recall of BiLRCN in different categories are higher than those of LRCN, demonstrating that learning engagement as a process performance and considering its temporal features for assessment can effectively improve the accuracy. However, there is still space for improvement in deep learning, which could be due to the following factors: the first is that manual features were not extracted before the experiment, which will increase the noise in the training data; and the second is that the videos in the dataset are all adult learning videos, and the performance of adults when they learn may be more implicit, which will also raise the difficulty of judgment.

When comparing the results of different engagements, the results are more likely to be high engagement because the model's learning effects for different learning engagements are also imbalanced due to the uneven distribution of the dataset. The precision and recall of high engagement are generally and significantly higher than those of low engagement and engagement. The high precision means that the model is more accurately judged for high engagement because the continuous performance of students in the high engagement state is less variable than that of low engagement and engagement. Although high recall might lead to false detection, it also implies that the model will detect every conceivable high engagement, making the distinction between high learning engagement (high engagement) and low learning engagement (low engagement, engagement) more obvious. Teachers' instructional interventions are primarily addressed to students with low learning engagement [36] in practice, making the use of recall to evaluate the results of learning engagement measures more appropriate. Furthermore, most engagement false detection is high rather than low because of judging the students' internal cognitive processes solely through external observation of video data.

### **5. Discussion**

Some of the arguments for the experimental results will be explained and discussed in this section. We found that the accuracy results of the automatic recognition of video-based learning engagement do not perform as well as the classification in other domains. Although the experimental results have been analyzed, a few points still need to be discussed after comparison with other studies.

### *5.1. Discussion of Experimental Results*

Sample imbalance promotes unbalanced results. Regardless of the method used, the results show that the category with more training samples performs better in the testing stage. While we believe that the sample imbalance in the self-built dataset is realistic, the phenomenon would also show that misclassifying uncommon samples does not significantly impact the overall precision. So a general phenomenon in the experimental results is that the categories with a higher number in the training set perform better in the recognition results. Therefore, improving the precision performance from the algorithmic perspective requires focusing on categories with large data. It is not always proper to enhance numerical performance from an educational standpoint. For example, in [26], Abedi et al. conducted a study on learning engagement recognition methods based on DAiSEE. Although their method showed a good improvement in accuracy, it did not work well for the recognition of the few labels. Recall metrics indicate that high engagement is more efficiently and accurately detected in this experiment and that the meaning of recall is more consistent with actual instructional needs, which illustrates the validity of using recall to evaluate our results.

Samples of similar categories are prone to be misclassified. Our dataset identifies different learning engagements based on the proportion and timing of learning engagement cues, which annotators understand, but machines do not. Therefore, the data classification in the dataset is not a complete discrete value, especially between engagement and high engagement, making it difficult for deep learning models to distinguish between them. Therefore, misdetections of 0 (low engagement) and 2 (high engagement) are more likely

than 1 (engagement). Although the study by Bergdahl et al. [37] concluded that engagement lacks intrinsic boundaries, it is uncertain for us to judge whether learning engagement is classified as a discrete or continuous value. The advantage of labeling learning videos with engagement cues is that learning engagements are treated as discrete values while maintaining continuity, which facilitates the development of subsequent extension studies. There is no convincing study of learning engagement as continuous values with labeled annotations and open-access datasets.

Learning engagement is appropriate as a process-based, comprehensive assessment. The method considering temporal features outperforms other approaches. Additionally, our method focuses more on temporal contextual features and produces the best results. In our annotation work, annotators make judgments based on continuous behavior. In this study, learning engagement was reported in ten-second intervals, and there is no accepted standard for exactly how long is most accurate. Video-based learning engagement recognition models are mostly end-to-end, reducing labor costs but exposing the model to more noise while learning. Although the difference in the three learning engagements' sequential performance has already been mentioned in earlier sections, they are still negligible compared to other study domains. Since learning engagement can be represented in three dimensions: behavioral, affective, and cognitive, it is more acceptable to consider more fine-grained features for engagement recognition in cases where observable cues are not evident from students during online learning. For example, we can achieve this by extracting more manual features from the video or integrating other modal data.

### *5.2. Discussion of Learning Engagement Application*

Effective and diversified application of learning engagement is required. Teachers' attention cannot match the recognition of learning engagement when teaching online, which requires the integration or adjustment of individual students' online learning engagement. Applying learning engagement efficiently in diverse ways is also a future research direction for us. At the individual level, the evolution of student engagement can be a good indicator of how engaged students are throughout the classroom. We can alert teachers to chronically low-engagement students. At the class level, we can aggregate the overall engagement of all students in the class and promptly notify the teacher when class engagement is generally low. The teacher can increase class interest by selecting simple, attractive learning media or preparing short, clear, and easy-to-understand learning materials. In addition, teachers should regularly assess students' status and intervene with students who have lost interest in learning for a long time.

The relationship between learning engagement and learning performance is worth exploring. Before the study, we thought his performance would probably be higher if a student's engagement status were consistently high and stable. Taking 0082 (student 008 for the second task) and 0032 as examples, 0082's learning engagement was more erratic than 0032's, with 0082 mostly low engagement (0) and engagement (1). At the same time, 0032 was mostly high engagement (2), and, as a result, 0082 did not complete the questions, while 0032 answered eight out of nine questions correctly. In addition, among low-performing students, most of them have low or fluctuating engagement statuses. Some data show the opposite result, and we believe there is an incompleteness in judging students' internal mental activity solely through external performance due to the cognitive dimension of learning engagement.

### *5.3. Discussion of Future Development*

We envision a way to recognize learning engagement without recording raw data, and its application, in reality, requires more in-depth research on the automatic recognition of learning engagement.

More comprehensive datasets rely on more accurate automatic identification methods. The comprehensiveness here refers not only to the comprehensiveness of data types but also to the comprehensiveness of research subjects. There was no significant overall change

in adult student performance during this study, making accurate annotation of labels more difficult. The research topic of learning engagement recognition should be expanded to explore the learning engagement characteristics of students at different levels and to create a comprehensive dataset with more explanatory labeling criteria. We believe that improving the performance of the results can be optimized in two parts: data feature extraction and deep learning network construction. In this study, we have shown that considering the temporal features of the learning engagement can improve recognition accuracy, but again we found that some data were wasted. The granularity of feature extraction and the complexity of deep learning networks are the future directions of automatic learning engagement recognition methods based on the video.

We must acknowledge that there are some risks in recognizing learning engagement. The security and ethical issues of educational data are of great importance, which not only requires researchers to maintain the confidentiality of data throughout the process but also requires stronger legislative efforts at the policy level. The data security risk mainly occurs during the data upload and storage process. Therefore, we envisage that in the data upload phase, students download the program with recognition on their computers, perform the recognition locally and upload only the recognition results; in the storage phase, the data should be encrypted and then stored in a private cloud which reduces the data risk since it can be built inside a firewall. Ethical issues of video data are mainly related to collection, representation, storage, and analysis [38]. The authenticity and objectivity of the data collection and representation process need to be paid attention to. We constructed a more realistic experimental environment for the characteristics of the Chinese online learning environment. We referred to the students' self-report for labeling, which, to a certain extent, does not have ethical problems. Storage and analytics are mainly concerned with security-oriented issues, which were also discussed previously.

### **6. Conclusions**

This paper aims to deal with teachers' difficulty in perceiving students' online learning engagement in a timely and accurate way. In this paper, many online learning videos have been collected, and an online learning engagement dataset has been constructed. Furthermore, a deep learning-based engagement recognition method was also introduced in this paper, and we compared the performance of different methods on the dataset based on this method. Finally, we discussed experimental results, learning engagement applications, and future developments. This study could provide teachers with reliable assistance with evaluating student engagement and conducting learning interventions.

However, the present study also has some limitations. We will continue to improve the learning engagement automatic recognition research in the future. First, we will gather data from various stages, situations, and engagement categories, implement data annotation work using more interpretable standards and build a comprehensive learning engagement dataset. Second, we will constantly modify the model to enhance precision by dealing with data imbalance, extracting finer-grained features for learning, and so on. Finally, we will use multimodal data (such as physiological signals) for engagement recognition in the future.

**Author Contributions:** Conceptualization, Y.M., Y.S. and Y.W.; methodology, Y.W. and Y.M.; software, Y.M.; validation, Y.M., X.L. and Y.W.; formal analysis, Y.M., Y.T., Z.Z. and Y.W.; resources, Y.M., Y.S. and Y.W.; data curation, Y.M., Y.T. and Z.Z.; writing—original draft preparation, Y.M.; writing review and editing, Y.W., Y.S. and X.L.; visualization, Y.M. and Y.S.; supervision, X.L. and Y.W.; project administration, Y.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the National Natural Science Foundation of China under Grant 62277029, the National Collaborative Innovation Experimental Base Construction Project for Teacher Development of Central China Normal University under Grant CCNUTEIII-2021-19, the Humanities and Social Sciences of China MOE under Grants 20YJC880100 and 22YJC880061, the Fundamental Research Funds for the Central Universities under Grant CCNU22JC011, and Knowledge Innovation Project of Wuhan under Grant 2022010801010274.

**Institutional Review Board Statement:** This research study was conducted in accordance with the ethical standards of the Helsinki Declaration. The Central China Normal University Institutional Review Board (CCNU IRB) usually exempts educational research from the requirement of ethical approval.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
