**1. Introduction**

### *1.1. Research Background*

Since the breakout of COVID-19, online learning has garnered considerable attention from schools [1]. Online learning moves face-to-face classes online, which allows real-time interaction between instructors and students, even if they are not in the same classroom. In addition to helping students learn from home, online technology also increases the flexibility of learning, and the use of subsidies [2]. Yet, its widespread utilization has been accompanied by several problems. Online learning does not encourage meaningful relationships between teachers and students or between students themselves [2–4], which could lead to online education has a higher dropout rate than offline education [5]. Online learning in certain subjects can increase the anxiety of some students with a negative view of their abilities, which does not help them achieve better academic results [6,7]. These problems are very detrimental to the education and growth of students.

Teaching strategies can benefit students in online learning, and teachers can help students regain interest in studying in various ways, such as by offering instructional materials [8]. Due to the limitations of devices and networks, it is difficult for teachers to accurately assess each student's performance in the online learning environment, thus

**Citation:** Ma, Y.; Wei, Y.; Shi, Y.; Li, X.; Tian, Y.; Zhao, Z. Online Learning Engagement Recognition Using Bidirectional Long-Term Recurrent Convolutional Networks. *Sustainability* **2023**, *15*, 198. https:// doi.org/10.3390/su15010198

Academic Editor: Hao-Chiang Koong Lin

Received: 15 November 2022 Revised: 15 December 2022 Accepted: 16 December 2022 Published: 22 December 2022

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

making it hard for them to effectively intervene in the classroom to ensure the quality of student learning [9,10]. A large number of students in Chinese classrooms also makes it difficult for teachers to pay attention to each student. Therefore, it is important to be able to help teachers obtain the status of their students' online learning so that they can target their teaching strategies.

Monitoring the quality of students' online learning to save those who are about to dropout will become the future entry point of online education. Students learning can be evaluated by learning engagement, which is directly related to learning performance [11]. The effective recognition of students' online learning engagement has become an essential consideration of teachers' intervention in student learning and improving teaching quality. Initially, manual methods were used to assess student engagement, but this method is timeconsuming and labor-intensive, and the results can be significantly subjectively influenced. There are also methods for assessing student engagement through external observations that have high demands for the observer. Since the development of information technology, automatic recognition methods based on learning data have received much attention from researchers. Automatic recognition methods are non-intrusive and do not interrupt the student learning process compared to other methods.

Currently, most of the automated methods used to learn engagement recognition are based on deep learning models [7,12]. Data drive deep learning, but data on learning engagement now face problems, including complicated data modalities, a shortage of openaccess datasets, and uneven data annotation standards, which directly limit the results of learning engagement for automatic recognition. In addition, differences in learning performance across ethnic groups make it more difficult to systematically advance the automatic recognition of learning engagement appropriate for China.

There are two main methods of learning engagement in recognition: using physiological signals (e.g., heart rate, brainwave, skin electricity, etc.) and using behaviors (e.g., posture, gestures, facial expressions, etc.) [12]. However, collecting physiological signals in an online learning context requires wearable equipment, which is more difficult to achieve. Instead, it is more feasible to use student behavior in learning videos recorded via webcam because this method allows data to be collected without invading the student learning process.

### *1.2. Learning Engagement and Its Measurement Methods*

Learning engagement often appears as an antithesis to learning burnout, which was introduced in 1985 by Meier et al. [13]. They believe that learning burnout is a state of physical and mental exhaustion that originates from a vicious cycle between the learning environment and the learner, including three aspects of emotional exhaustion, behavioral misconduct, and low personal achievement. In 2004, Fredricks et al. [14] provided a widely accepted definition of learning engagement, asserting that learning engagement consists of a multidimensional structure of emotional, behavioral, and cognitive engagement.

The level of student engagement in learning is directly correlated with the learning's quality [15]. Learning engagement refers to the learner's positive and engaged mind in the learning situation and activity. Learning engagement measurement dates back as far as 1980. The main methods of learning engagement measurement include self-feedback reports, external observations, and automated recognition. Self-feedback reporting methods mark student engagement through student self-report or questionnaires, such as Greene et al. [16], who used the Likert scale to investigate student engagement. Self-feedback reports often depend on the learner's apparent understanding of the learning engagement, their level of compliance, and their memory of the learning process, even if they are frequently convenient and useful. External observation is another important method for assessing learning engagement [17]. Still, it requires a certain level of expertise from the observer, which makes it difficult to deal with large amounts of data. The automatic recognition method aims to evaluate student engagement by utilizing trimming technologies such as machine learning and computer vision, which can successfully address

the aforementioned shortcomings [18]. Although the automatic recognition method of learning engagement currently has problems such as difficult data collection and annotation, low recognition performance, and low interpretability, we still believe that automatic recognition is promising for the future.

### *1.3. Video-Based Recognition of Learning Engagement*

Compared to the constraints of self-feedback reporting methods and external observation methods (e.g., time-consuming and labor-intensive, unable to handle huge amounts of data, etc.), automated recognition systems perform better. The automated recognition method collects many performance indicators from the student's learning process and evaluates the learning engagement based on the gathered data without interfering with the student's learning process. Video data have evolved into the primary modality used in learning engagement recognition studies due to their convenience of collection and heavy information content [19,20]. There are additional examples of engagement recognition studies that utilize other modal data types such as images [21], audio [22], and physiological data [23].

In this study, we concentrated on learning engagement recognition work based on video data (see Table 1). Gupta et al. [24] proposed the DaiSEE dataset and used traditional Long-term Recurrent Convolutional Networks, C3D, and other networks for four classification learning engagement predictions. Zaletelj et al. [19] proposed a large-scale analysis mechanism of student classroom behavior data obtained by the Kinect One sensor, which can estimate the level of student attention and engagement in the classroom and give teachers feedback on instructional evaluation based upon which teachers can adjust instruction in a way that is tailored to students to support learning performance. Huang et al. [25] proposed an engagement recognition network (DERN) based on temporal convolution, Bi-directional Long Short Term Memory, and an attention mechanism for the DAiSEE dataset. Abedi et al. [26] used the hybrid end-to-end network of ResNet (Residual Network) and TCN (Temporal Convolution Network) to analyze the original video sequences, and the results outperformed other approaches for the same dataset. Sümer et al. [20] collected facial video data from 128 students in grades 5–12 in a classroom and utilized three methods—SVM, MLP, and LSTM—to predict student learning engagement using a scale of −2 to 2 to represent off-task to on-task and to compare the engagement levels of students in different grades. Liao et al. [27] extracted facial features from the DAiSEE dataset using a pre-trained SENet and then utilized an LSTM network with a global attention mechanism to predict learning engagement. Mehta et al. [28] proposed a three-dimensional DenseNet self-attentive network, compared the results to current methods for two- and four-classification metrics, and verified the network's robustness using the EmotiW dataset.

Although learning engagement measurement had garnered attention before the pandemic, its widespread growth was nonetheless a result of the epidemic. Deep learning, a branch of machine learning that uses data for feature learning with artificial neural networks, has emerged as a major feasible approach to learning engagement automated recognition. The models chosen by current deep learning methods for learning engagement mostly focus on the temporal features of students' behavioral performance, but it is also a simple use. It has become a consensus among researchers that videos of students' facial expressions are the most representative data of student learning engagement. Still, there is no standard paradigm for handling the data. In addition, the evaluation of existing methods is mostly based on the accuracy and mean square error, which is intuitive but lacks a certain degree of comprehensiveness.


**Table 1.** Automatic recognition of learning engagement.

### *1.4. Dataset for Engagement Recognition*

Most studies based on open-access database learning engagement measures are based on HBUC [30], DAiSEE (Dataset for Affective States in E-Environments) [24] and in-thewild datasets [29]. The HBUC data were collected from thirty-four people from two distinct pools; nine men and thirty-five women. Individuals in both pools participated in Cognitive Skills Training research coordinated by a Historically Black College/University (HBCU) and the University of California (UC). The DAiSEE dataset is a multi-label video classification dataset made up of 9068 video clips from 112 subjects with labels for boredom, confusion, engagement, and frustration. Each label is represented by level 0 (very low), level 1 (low), level 2 (high), and level 3 (very high). Of these, the number of engagements is 61, 459, 4477, and 4071, respectively. The in-the-wild dataset included 78 people and 195 movies (each lasting around 5 min), collected in unrestricted settings such as computer laboratories, dorm rooms, open spaces, and so on. Labels of in-the-wild are disengaged, barely engaged, normally engaged, and highly engaged. The labels in the DAiSEE and "in-the-wild" were determined by crowdsourcing, while human experts were used for labeling in the HBCU. Considering differing labeling standards might result in unclear engagement labels; some research excludes data with ambiguous labels, which improves the results' accuracy but reduces the data's amount and variety. In addition, there is no dataset of learning engagement for Chinese students.

Additionally, it is more difficult to collect data because various learning environments, learning tasks, and student objects have different data-gathering and processing methods. So, the learning engagement data gathered via the collection are typically small samples. In conclusion, the data utilized in current learning engagement research range in terms of data unit duration, annotators differences, labeling criteria, and data collection processes, making it difficult to develop learning engagement recognition systematically. Data may be aligned by researchers using a variety of data processing techniques, but before undertaking a study, the researcher must discuss and set up the data annotation process.

### *1.5. Problem Statement*

After the above description, the main problems of learning engagement recognition are currently as follows:


### *1.6. Contributions*

This study built an online learning engagement dataset of videos of students recruited from a university in Wuhan, Hubei Province. Learning engagement cues were used to establish the tri-categorized label for this dataset. Based on this dataset, we investigated the automatic recognition method of students' learning engagement in online learning scenarios through the BiLRCN network. Finally, we analyze and discuss the results, explore the feasible methods for the automatic recognition of learning engagement, and propose future research directions. The contributions of our work can be summarized as follows:


The rest of this article is structured as follows: Section 2 describes the processing of the collected dataset, which includes the collection process, annotation criteria, etc. Section 3 describes the bidirectional long-time convolutional network introduced in this study. Section 4 shows this experiment's experimental metrics and results and the comparison with the results of the other five state-of-the-art methods. Section 5 is a discussion of these results. Finally, conclusions and possible future research directions are given in Section 6.

### **2. Dataset Construction**

### *2.1. Data Collection*

An HD webcam (Logitech C930c, 1920\*1080, 30 Fps) was mounted on a laptop computer and utilized to collect video data from students engaging in online learning. For this study, 58 undergraduate or graduate students between the ages of 21 and 25 were recruited; 42 females and 16 males. They spanned six majors, including educational technology, computer science, psychology, and more. We used OBS software (Open Broadcaster Software) to perform screen recordings of students' computers during the experiment to guarantee that they executed the learning tasks assigned. Additionally, we used a unique custom software program to record videos of the students' faces and their body parts, and the recorded films served as the raw data. Apart from being required to be in front of the computer, students were not constrained in any other manner.

Three online learning tasks were given to the participants in this experiment:


Besides completing the learning tasks, participants were also required to rate the difficulty of the learning tasks and indicate whether they had been exposed to the tasks' material. Before the experiment, student subjects were informed of the experimental procedure and the requirements of the experiment. It can reduce the Hawthorne effect by allowing students breaks before the experiment begins and between each task. Five staff members directed the experiment, but they did not interfere with the participants' experiment operation. After the experiment, students were asked to use the recorded video to recall each full minute of engagement. All student subjects featured in the video signed an informed consent form before the experiment, and each student subject provided only age, gender, and major as their identifying information. The experiment is shown in Figure 1.

Excluding the lost or missed videos during the experiment, we gathered 1073 min of raw video data (saved in avi format and encoded in H264 format). To align the data for subsequent data annotation, we used the FFmpeg tool to crop the videos after the initial data screening to obtain 6308 raw 10-s videos.

**Figure 1.** Data Collection. (**a**) Experimental procedure; (**b**) Experimental example.

### *2.2. Data Annotation*

Three annotators with experience in learning engagement annotation performed the annotation work in this study. Before starting the data annotation work, the annotators systematically learned the annotation classification and criteria and completed the reliability assessment before the annotation. Based on the common trichotomous classification used in contemporary learning engagement recognition research, the data labels in this study were separated into three groups, namely low engagement, engagement, and high engagement (marked by 0, 1, and 2 from low to high, respectively). Low engagement indicates that the student is not engaged in the learning content or there is a clear indication of engagement in other non-learning content; engagement shows that learners are involved in the learning process, but this involvement is limited and susceptible to interruption; high engagement entails complete engagement in the learning process and a high level of stability under disturbance. Annotators were asked to use the intensity and proportion of brief learning engagement cues (see Table 2) observed in students' videos to determine the level of learning engagement.



Eye movement-related cues were used as the primary cue in labeling, followed by facial expressions, body movements, and pre-and post-temporal states to determine the degree of engagement. Given the annotation's subjective nature, the annotators considered the student subjects' self-feedback results on the annotation. Annotation quality control was performed through an MV (Most Voting) strategy and repeated annotation through collective discussion.

To facilitate the subsequent training, validation, and testing of the model, we disorganized the order of the labeled video data and divided the training, validation, and testing sets in the ratio of 6:2:2. The final distribution of the data is shown in Table 3. Unique subject numbers index the data, and subject 042 is used as an example; Figure 2 shows the partial performance of different engagements of this subject (chosen 10 frames).


**Table 3.** Distribution of online learning engagement dataset.

**Figure 2.** Example of online learning engagement dataset. (**a**) disengagement; (**b**) engagement; (**c**) high engagement.

### **3. Online Learning Engagement Recognition Method**

The Long-term Recurrent Convolutional Network (LRCN) [31] is a deep learning network that combines Convolutional Neural Networks (CNN) with Long Short-Term Memory (LSTM) networks. It can process temporal video or single-frame image inputs as well as single-value prediction and temporal prediction, making it an agglomeration network for processing sequential inputs or outputs. LRCN has been widely used in activity recognition, image description, video description, etc. Due to the excellent performance of LRCN, some improvements have been made. For example, Yan et al. [32] proposed bidirectional LRCN for stress recognition, and their results also show that bidirectional LSTM is helpful for video classification.

Given that learning engagement is a process performance, this paper utilized a Bidirectional Long-Term Recurrent Convolutional Network (BiLRCN) that combines a twodimensional convolutional neural network (2DCNN) packed by a TimeDistributed layer with BiLSTM for learning engagement recognition. Taking video 0191012 in the dataset as an example, the BiLRCN network structure used in this study is shown in Figure 3. The network consists of four main parts, from left to right, video frame input layer, spatiotemporal feature extraction layer, time series learning layer, and determination layer. The model takes the video frame sequence as input and uses a 2D convolutional neural network (2DCNN) wrapped by a TimeDistributed layer to extract the spatio-temporal features. The extracted features are passed through a BiLSTM network for temporal feature learning to obtain the temporal output of the network. Then the final output of the network is obtained through a fully connected layer with softmax as the activation function.

**Figure 3.** Architecture for BiLRCN.
