2.1. Facial Features vs. Body Activity for Affect Recognition
Most of the existing vision-based methods use facial features as the main cue for engagement, even when combined with another input modality. For example, Ref. [
10] implemented an engagement detection model using facial expressions, full-body motion, and game events. The game was a pro-social game. It was designed in an uninteresting version and a more entertaining version. Every volunteer participated in the two versions of the game by playing for 10–15 min in the game session. Labeling was conducted using retrospective self-report and game engagement questionnaire (GEQ). Data were labeled either “engaged” or “not engaged”. They presented the ANN model, which produced an accuracy of 85%.
In another work [
11], the authors proposed a new framework to assess engagement based on facial expressions. They used a lightweight CNN. The light network was used to reduce the effect of the diversity of the backgrounds and the resolutions on the prediction process. The proposed CNN comprised two parts: (1) feature extraction and (2) classification. The model used three public datasets. The RAF-DB was used as the source domain-training data. On the other hand, JAFFE and ck+ were used as the target domain datasets. These datasets included seven types of emotion: anger, disgust, fear, happiness, sadness, surprise, and neutral. Owing to an imbalance issue in some of these datasets, the authors applied under-sampling methods and data enhancement methods in order to balance the data. The model considers four types of emotion: “understanding”, “doubt”, “neutral”, and “disgust”. The authors reported that their model achieved better results than other competitors did, with 54% accuracy on the ck+ and 51% on the JAFFEE.
In another study [
12], the researchers attempted to measure engagement intensity by fusing both face and body features into a single long short-term memory (LSTM) model. They used the dataset EmotiW 2018 [
13], which contains 195 videos, with 147 clips for training and 48 clips for testing. Their fusion approach achieved a comparable performance to the state-of-the-art methods, with an accuracy rate of 75.47%.
Another recent work [
14], developed a two-stage algorithm using behavioral information [on-task and off-task] and emotional information [satisfied, confused, and bored]. They incorporated these two dimensions to detect whether the student is engaged or not. They used facial expressions for the emotional dimension to decide if the student was feeling satisfied, confused, or bored. As for the behavioral dimension, they used head-pose information to see whether the student was on-task or off-task. The algorithm was tested using five different CNN models applied to the DAiSEE dataset [
15]. For the training and testing phases, they used 1500 and 300 frames of students’ faces, respectively. The reported performance of the five models ranges between 76.8% and 92.5%. In [
7], the proposed framework combined facial expression, eye gaze, and mouse dynamics. The data were recorded from the subjects in real time during reading sessions and classified into three levels of attention, “low”, “medium”, or “high”, using SVM for classification, with an accuracy rate of 75.5%.
As seen from some of the previous work, body activities are one of the main cues of human affect and a way to convey messages between people [
2,
8,
16,
17], and this is a well-researched and well-established field [
18]. Nevertheless, few methods consider investigating body activities combined with other modalities for detecting engagement in students. Therefore, we next discuss different works that show the significance of body activities and how to relate them to engagement detection.
It has been proven by many studies that, in the same way as facial expressions, body expressions are very effective at expressing emotions and feelings [
19,
20,
21,
22]. Scientists have demonstrated the effective transition through body expressions. The recognition of body expressions is more difficult due to the shape of the human body, which has more points of freedom than the face. However, recent automatic affect recognition systems have started taking into consideration the analysis of body activities and expressions. Several recent works have had a similar aim to ours, focusing on the upper-body gestures of students using e-learning systems. For example, Ref. [
23] implemented a detection method for learner engagement that considered nonverbal behaviors, such as hand-over-face (HoF) gestures, along with head and eye movements and facial expressions during learning sessions. They proposed a novel dataset and detection method for HoF gestures, as they can emphasize affective cues, in addition to the effect of time duration. However, the study did not correlate the expressed emotions with the HoF gestures.
Another work [
24] considered the effect of emotional experience on the head and the position and motion of the participant’s upper body. This was studied during participation in a serious game of financial education. The study included 70 undergraduate students. A Microsoft Kinect device was used to collect depth-image data on body gestures. Researchers found that bodily expressions changed during the session as an indicator of emotional state. The aforementioned studies related body gestures to emotional states in an e-learning environment but did not explicitly relate them to engagement levels.
However, in our previous work [
9], we proposed a new approach for measuring engagement levels. It is based on analyzing the body activities of the student. First, we collected a new realistic video dataset. It contains 2476 video clips (about 1240 min of recording) of undergraduate students during their attendance in real online courses during the COVID-19 lockdown. The collected clips were recorded in an uncontrolled environment. The volunteers used their built-in webcams to record the sessions. Therefore, working with the dataset was very challenging. Based on our research, we established two categories of body activity: (1) macro-body activities and (2) micro-body activities. We defined these categories as follows: macro activities or actions require major physical change and movement and are most likely to be voluntary; micro-body activities are involuntary actions and most likely do not require noticeable physical change. Accordingly, the data analysis and annotation phase was simplified using these two definitions. After the data analysis and labeling phase, we made several preprocessing pipelines to improve dataset quality by reducing redundancy and unnecessary data. This also accelerated the work by reducing the computation time. After that, we prepared the data to fit into the chosen pre-trained model for training. Finally, we generated the engagement detection model. The generated model was evaluated through several experiments. Moreover, we achieved an accuracy rate of 94%. In this paper, we aim to extend our previous work [
9] in connecting e-learners’ emotions and engagement states to their bodily activities by proposing two new prediction models that can detect more precise engagement levels based on the affective model in [
8].
2.3. Current Approach
Based on our recent work [
9], our proposed method assumes that body activities can be mapped into and correlated with engagement levels based on the emotions they are conveying. As seen in
Figure 2, we analyzed students’ emotions based on their activities while taking into consideration different factors, such as student preferences, time duration of the recording session, and many others, and then we used the emotion-based model proposed by the authors in [
8] to map these emotions into engagement levels. For better and more precise analysis and detection, we categorized human activities into two categories: macro-actions and micro-actions. We deduced the definition of these two categories, relying on the two definitions of facial expression in [
25]. In our most recent work [
9], we proposed a prediction model that can classify and detect engagement levels into two classes: Positive engagement and negative engagement. However, in this work, following the model in [
8], we aimed to propose new prediction models that can broaden the detection of more engagement levels from the model described in [
8]. In the following section, we provide more details about the new datasets used in this work.