MultiFusedNet: A Multi-Feature Fused Network of Pretrained Vision Models via Keyframes for Student Behavior Classification

Nindam, Somsawut; Na, Seung-Hoon; Lee, Hyo Jong

doi:10.3390/app14010230

Open AccessArticle

MultiFusedNet: A Multi-Feature Fused Network of Pretrained Vision Models via Keyframes for Student Behavior Classification

by

Somsawut Nindam

^†,

Seung-Hoon Na

^*

and

Hyo Jong Lee

^*

Division of Computer Science and Engineering, CAIIT, Jeonbuk National University, Jeonju 54896, Republic of Korea

^*

Authors to whom correspondence should be addressed.

^†

Current address: Division of Computer Science, Surindra Rajabhat University, Surin 32000, Thailand.

Appl. Sci. 2024, 14(1), 230; https://doi.org/10.3390/app14010230

Submission received: 14 November 2023 / Revised: 18 December 2023 / Accepted: 20 December 2023 / Published: 26 December 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

This research proposes a deep learning method for classifying student behavior in classrooms that follow the professional learning community teaching approach. We collected data on five student activities: hand-raising, interacting, sitting, turning around, and writing. We used the sum of absolute differences (SAD) in the LUV color space to detect scene changes. The K-means algorithm was then applied to select keyframes using the computed SAD. Next, we extracted features using multiple pretrained deep learning models from the convolutional neural network family. The pretrained models considered were InceptionV3, ResNet50V2, VGG16, and EfficientNetB7. We leveraged feature fusion, incorporating optical flow features and data augmentation techniques, to increase the necessary spatial features of selected keyframes. Finally, we classified the students’ behavior using a deep sequence model based on the bidirectional long short-term memory network with an attention mechanism (BiLSTM-AT). The proposed method with the BiLSTM-AT model can recognize behaviors from our dataset with high accuracy, precision, recall, and F1-scores of 0.97, 0.97, and 0.97, respectively. The overall accuracy was 96.67%. This high efficiency demonstrates the potential of the proposed method for classifying student behavior in classrooms.

Keywords:

lesson study; professional learning community; student behaviors classification; video classification; keyframe selection; multi-feature fusion; hybrid model; deep recurrent neural networks

1. Introduction

An essential part of a teacher’s role is the effective communication of the teaching material to students. The conventional approaches to learning, such as rote learning, are ingrained in our educational culture [1] and routinely employed in general classrooms. The classroom activities of the traditional educational methods involve a variety of exercises, including memorizing, creating rules and formulas, and ensuring that pupils complete a large amount of homework or take exams. However, despite the students’ listening to and memorizing an excessive amount of information, the traditional educational method does not result in an increased interest level among students in their studies. Furthermore, these traditional learning techniques could result in additional adverse effects on students, such as a lack of critical thinking abilities, inadequate problem-solving strategies, and strained relationships in small groups or the entire classroom. So, we use the professional learning community (PLC) technique [2] to improve student teaching in a classroom environment and overcome the constraints of the traditional approach [3]. However, when teachers use the PLC approach for schooling, they will observe different student behaviors during the learning process. In this stage, it is challenging to simultaneously distinguish an individual student’s behavior from that of the whole class.

In recent years, artificial intelligence (AI) has become more powerful than ever [4], and many industries, including education, are using it to become leaders in their fields. Many student behavior and classroom assessment researchers have recently created and deployed machine learning and deep learning models [5] to help resolve difficulties such as the difficulties in understanding student behavior, student engagement in the classroom, and students’ emotional facial expressions [6,7,8,9,10,11]. AI technology in the computer vision domain can help teachers analyze student behavior better. Furthermore, this technology can be used to develop an improved teaching approach and learning model that will help students enhance their enthusiasm for studying. In addition, this approach and model improve the capacity of students to solve issues and form stronger relationships among themselves.

While AI can potentially analyze student behavior, limitations persist in this analysis due to data complexity, dynamic classroom environments, and behavior variability. The multifaceted nature of student behavior poses challenges for models to capture behaviors precisely. AI model accuracy is also influenced by classroom size, as larger classes can overwhelm the model’s ability to track individual behaviors effectively. Additionally, varying classroom contexts, such as in terms of teaching approaches, classroom layouts, lighting conditions, and classroom learning culture, can significantly impact student behavior.

To overcome these limitations, we propose AI technology [12], based on a novel hybrid approach involving a combination of convolutional neural networks (CNN) and recurrent neural networks (RNN) [13,14], for the classification of student behavior. Our framework utilizes keyframe selection in video preprocessing and multiple-feature fusion from various pretrained CNN families to extract features from each keyframe and fused features. We then leverage a deep sequence neural network with an attention mechanism to classify student behavior in the classroom. We collected the dataset from mathematics classrooms in a primary school at Sisaket Rajabhat University Demonstration School, Thailand. The student’s behavior dataset consists of five categories: hand raising, interacting, sitting, turning around, and writing. Our work uses a student behavior classification approach that is different from that of other researchers, who typically use behavior detection to alert teachers about inappropriate learning behaviors. Student behavior classification, a critical component of student analysis, is essential for designing learning units in each subject. The use of the PLC approach in the classroom makes it even more essential to analyze student behavior in relation to learning outcomes.

The major contributions of this paper are as follows:

We have designed an efficient keyframe selection method based on the K-means and the sum of absolute differences (SAD) of the L channel in the LUV color space to extract keyframes from a random length of video clips. The proposed method guarantees the extraction of a unique sequence for any given video clip.
Four features are fused to characterize the behavior type efficiently. High-level features are extracted from the natural RGB images, while motion features are extracted from the optical flow. To overcome problems regarding various luminance conditions and data deficiency, features from Lab color components and data augmented (DA) are also extracted and fused with the other high-level features. We propose this multi-feature fusion, ensuring that the feature extraction process captures the most relevant information from the video keyframes covering spatial and temporal features.
To achieve efficient classification, we introduce an innovative hybrid network that combines a bi-directional long short-term memory (BiLSTM) model with an attention mechanism (AT) to classify student behavior in a classroom setting effectively. The BiLSTM component excels at capturing long-range temporal dependencies within the student behavior data. Simultaneously, the integrated AT empowers the model to identify and prioritize the most relevant data segments for precise classification. Our architectural enhancements include the use of 512 LSTM cells, equally distributed into 256 cells each for both forward and backward directions. This architecture is robust and can support 7168 feature dimensions in each of the ten keyframe feature fusions.

The remainder of this paper is organized as follows: Section 2 reviews the related literature regarding student behavior in the classroom. Section 3 describes the materials and methods, including the student behavior data collection, keyframe extraction, and selection methods, the multiple pretrained models for feature extraction and the fusion approach, and the proposed overall framework using a deep sequence model based on the BiLSTM-AT network to classify students’ behaviors. Section 4 describes the experimental setup, dataset setup, and evaluation metrics. Section 5 presents the experimental results and discusses the findings. Section 6 concludes the paper and presents directions for future work.

2. Literature Review

Analyzing student behavior is important for enhancing teaching approaches and improving student learning. In this context, Zheng et al. [15] proposed a system to analyze student behavior by automatically detecting hand-raising, standing, and sleeping behaviors based on object detection. The researchers adopt ResNet-101 into a Faster R-CNN network with different sizes of regions of interest, including low-level and high-resolution feature maps and high-level semantic information. The results showed an mAP of 57.6%, a 3.4% increase in accuracy compared to the original model mAP of 54.2%, and a fast speed. Jisi et al. [16] presented a new feature fusion network that outperforms recognition algorithms for student behavior recognition in education. The network combines spatial affine transformation and CNNs to extract detailed features from video data and uses a weighted sum method to fuse spatial-temporal features for classification. They evaluate the model with the human behavior dataset on HMDB51 and UCF101, with better results than other state-of-the-art recognition algorithms. Hu et al. [17] proposed a bimodal learning engagement recognition method for large-scale classrooms. They used deep learning techniques to recognize three levels of engagement (high, medium, and low) based on emotional and behavioral features. In addition, they designed a bimodal network using ResNet50 and CoAtNet with a KNN classifier and obtained an accuracy of 93.94%. Their method outperformed state-of-the-art techniques.

Zhou et al. [18] proposed a deep learning network for student behavior recognition in the classroom using a 10-layer deep CNN (CNN-10) to extract vital information from the human skeleton. The experimental results showed that their method could effectively exclude irrelevant information such as students’ physique, dress, and classroom background and focus on key effective information such as hands up, head down, listening, and class standing, resulting in higher recognition accuracy. Lin et al. [19] presented a student behavior recognition system based on skeleton pose estimation and person detection using the OpenPose framework. To reduce errors, they applied pose estimation and person detection techniques, followed by skeleton data preprocessing to eliminate several joints with weak effects on behavior. Feature vectors of human postures were extracted from normalized joint locations, joint distances, and bone angles. This work applied the deep neural network constructed to recognize student behaviors, such as looking, asking, bowing, and bored behavior. The scheme outperformed the skeleton-based one in terms of complexity, with an average precision of 15.15% and an average recall of 12.50%. Fu et al. [20] also conducted learning behavior analysis in classroom teaching, which included different behaviors such as listening, fatigue, hand-up, turning sideways, and read-write. They applied OpenPose to extract key points from human-skeleton faces and fingers for detecting the human body. Afterwards, they classify each behavior using the CNN network. The model produces a result with a 92.86% accuracy in a real-life classroom teaching environment.

Huang et al. [21] presented a novel student action recognition method for classrooms using YOLOv3. Their method could accurately and quickly identify student behaviors such as fatigue, hand-ups, bowing, turning sideways, body position, and facial expressions. Zhang et al. [22] proposed an improved YOLOv3 to detect student behavior in the classroom, such as sleeping, using mobile phones, taking notes, and others. The model added an attention mechanism CBAM module to the original network. The experiment presented was quick and effective, and YOLO-CBAM can also detect small targets. The method achieves high accuracy, even though detecting student behavior is challenging. Ren et al. [23] improved YOLOv4 to detect students who sleep or use a smartphone by enhancing a feature extraction network that combined top-down and bottom-up paths. The original YOLOv4 was compared with an improved network (PANet and Faster R-CNN), and the results showed that the enhanced network can achieve better results and is suitable for student detection and recognition tasks. Tang et al. [24] presented a classroom behavior detection algorithm based on an improved YOLOv5 object detection model. They combined a weighted bidirectional feature pyramid network (BiFPN) with the feature pyramid network and the path aggregation network (FPN+PAN) structure of YOLOv5. This allowed them to extract the fine-grained features of various behaviors from different scales of the object. They also added a CBAM to the neck network to improve the detection accuracy. Finally, they improved the original non-maximum suppression algorithm using a distance-based intersection ratio to enhance the discrimination of occluded objects. Yang et al. [25] presented a student classroom behavior detection method based on an improved YOLOv7 to address the low accuracy of the existing methods. To improve the detection accuracy in crowded scenes, they integrated the biformer attention module and Wise-IoU into the YOLOv7 network. Their method achieved a [email protected] of 79% using the SCB dataset, which includes 18.4 k of labels and 4.2 k of images covering three behaviors: hand-raising, reading, and writing. This represents a 1.8% improvement over previous results.

Wang et al. [26] proposed a novel student classroom behavior detection system that combines deformable DETR with a Swin Transformer and a lightweight Feature Pyramid Network (FPN). To address the limitations of CNN-based target detection methods, their system employs a feature pyramid structure and the CARAFE lightweight operator to process multi-scale feature maps effectively and enhance the network’s detection accuracy. This system achieved a significant 6.1% improvement in detection accuracy compared with the original deformable DETR model and outperforms Faster R-CNN, SSD, and YOLOv3 [21], v5 [24], and v7 [25].

According to recent papers, the majority of research has focused on improving accuracy through feature extraction, model combining, or both. For instance, ResNet-101 has been integrated with a Faster R-CNN network to detect student behavior [15]. In addition, the spatial affine transformation network has been combined with CNNs for feature extraction and the detection of student behavior [16]. ResNet50 and CoAtNet have been employed to create a bimodal network for feature extraction [17]. CNN-10, a 10-layer deep CNN, has been used for feature extraction to recognize student behaviors [18], while the OpenPose framework has been utilized to focus on key information [19,20]. YOLOv3 uses multiple convolutional layers to isolate important features [21], and CBAM has been incorporated into YOLOv3 for feature extraction [22]. For enhanced accuracy, a weighted BiFPN combined with FPN+PAN has been used for feature extraction [23], and a CBAM has been added to it [24]. In YOLOv7, the biformer attention module and Wise-IoU have been applied for feature extraction [25]. To enhance feature extraction, the deformable DETR, Swin Transformer, FPN, and the CARAFE operator have been combined [26]. The use of integrated models and combination methods for feature extraction is crucial to improving the accuracy of student behavior detection and classification tasks.

Previous diverse research has demonstrated that a combination of sophisticated feature extraction techniques and model integration strategies can enhance the accuracy of student behavior detection and classification. Our comprehensive analysis of existing approaches found that no single algorithm or model can consistently achieve the highest levels of exactness across all aspects of student behavior analysis. Therefore, it is essential to employ diverse techniques to capture the multifaceted nature of student behavior and achieve the most accurate and comprehensive results.

In addressing the aforementioned limitations and leveraging the strengths of diverse approaches, we apply a critical feature extraction and fusing methodology using multiple pretrained models [27]. To optimize feature extraction, we leverage the unique strengths of each model [28]. We then use deep sequence neural networks to classify and recognize student behavior. While existing research has introduced various methods for feature extraction, a robust feature extraction technique is still required that can be used with our dataset and in a different context. Our work takes a distinctive and innovative approach to enhance the accuracy and efficiency of the feature extraction and classification processes.

3. Data Collection

The dataset was collected in an actual classroom environment [29] using a purposive sampling technique focusing on a first-grade primary school at Sisaket Rajabhat University Demonstration School, Thailand. Specifically, data was obtained for students studying mathematics in the second semester of 2020. The number of students was 24 per classroom, and a PLC teaching approach was used. For the data collection, the educational researcher assigns the teams, designs the layout for the student seating, and sets up a video camera to record all students. The video cameras were placed in front of the whole class on the left and right sides. The top view of the seating layout is depicted in Figure 1a. When a class starts, the photographic specialist records the video until the course ends. Figure 1b shows an actual classroom scene. A SONY Handy Cam 60X (Tokyo, Japan) was used to record the images with a frame rate of 30 frames per second.

The expert analyzed and collected behavioral clips by cropping the original video file and uploading it to the cloud server storage. This research focused on the student’s behavior during lesson activities, and we labeled each clip with one of five categories: hand-raising, interacting, sitting, turning around, and writing. The dataset collection sample for each behavior is shown in Figure 2.

The PLC teaching approach aims to encourage collaboration among educators to enhance teaching, improve student outcomes, and foster continuous improvement within the school. PLCs strive to create an environment that supports educators in optimizing teaching for diverse student needs, promoting academic success through data-driven decisions, ongoing development, and a student-centered focus [2]. By observing how a particular behavior influences and positively impacts students’ learning [1,2,3], teachers gain firsthand insights into the effectiveness of student learning. This observation-based approach encourages teachers to analyze the correlation between student behaviors and the resulting outcomes in students’ understanding. Our collected dataset encompasses five distinct classes of student behaviors observed within a classroom environment employing the PLC teaching method. The classification of student behaviors in our dataset aligns with the core objectives of the PLC [2].

4. Methods and Experiment

4.1. Overview of Framework

Our overall framework can be divided into three parts: pre-processing, pretrained models, and the deep sequence model. In the pre-processing phase, we choose ten keyframes by applying the SADLK algorithm. Next, we use multiple pretrained models for feature extraction based on the feature fusion approach. During this process, we apply data augmentation (DA) to RGB keyframes. The original RGB keyframe extracts the feature from Inception-V3. For keyframe augmentation techniques such as rotation, width shift, height shift, and horizontal flip, we use the ResNet50-V2 for feature extraction. Color injection applies the color transform technique from RGB to L*a*b* color space and extracts the feature by using VGG-16 in the optical flow keyframe using EfficientNet-B7 to extract features. Our overall process pipeline is depicted in Figure 3.

The MultiFusedNet has been designed based on a deep sequence model with the BiLSTM-AT network to classify the students’ behaviors.

Our model consists of 158 million parameters, 14 layers with different output sizes, and trainable parameters. The models use frame features as input and a mask input to handle keyframe sequence features. The layer configuration has two BiLSTMs with attention machine unit layers with 512 and 256 hidden units, each followed by a dropout layer with a 0.3 rate to prevent overfitting. The dropout layers ensure the model learns to rely on the different hidden units, preventing it from becoming too specialized for the training data. In addition, a dense layer with 64 hidden units, relu activation, and L2 regularization are added, followed by a third dropout layer with a 0.3 rate. These additions further improve the model’s generalization capability. The output layer uses softmax activation for multi-class classification. We then feed the features into the BiLSTM-AT network with a 64-batch size. The keyframe selection and multi-feature fusion with the BiLSTM-AT networks are called “KMFF-BiLSTM-AT”, and the same without the keyframe are called “MFF-BiLSTM-AT”. Figure 4 illustrates the overall framework of MultiFusedNet.

4.2. Keyframe Extraction

Analyzing videos often involves processing a large number of frames, many of which may contain redundant information. This leads to a time-intensive undertaking, resulting in low accuracy. Nevertheless, using keyframes [30,31] can substantially enhance the processing speed of video analysis. We adopt the SAD [32] in the LUV color space [33], focusing on the L channel and the K-mean algorithm (SADLK) for the frame extraction pre-processing and select a keyframe before transferring it to the multiple pretraining model for feature extraction. The keyframe selection is performed according to the Algorithm 1 given below [34].

Algorithm 1: SADLK Keyframe Selection
Input: Video frames ( $R G B$ )
Keyframe count ( $K$ )
Output: Selected keyframes
Procedure:
1.	Step 1: Convert RGB to LUV in the L channel
2.	For each frame $i$ in range( $1, N + 1$ ):
3.	// $Given α = 0.2126729, β = 0.7151522, γ = 0.072175$
4.	$Y_{i} = α \cdot R_{i} + β \cdot G_{i} + γ \cdot B_{i}$
5.	$L_{i}^{*} = 116 \cdot f (Y_{i} / Y_{n}) - 16$
6.	// Store $L_{i}^{*}$ values in $L U V_{L}$ array
7.	End for
8.	Step 2: SAD Calculation
9.	For each frame $i$ in range( $1, N + 1$ ):
10.	$S A D_{i} = \sum_{(x, y)} ∣ L U V_{L} [i] (x, y) - L U V_{L} [i + 1] (x, y) ∣$
11.	End for
12.	Step 3: Keyframe Selection using K-means
13.	// Initialize cluster centroids C with K random SAD values
14.	While centroids are not converged
15.	For each $S A D_{j}$ in $S A D_{L}$
16.	// Assign $S A D_{j}$ to the closest cluster $C_{i}$
17.	$C_{i} = a r g m i n \| \| S A D_{j} - C_{i} \| \|$
18.	End for
19.	End while
20.	Step 4:
21.	Sort selected keyframes by frame sequence number (ID value)
22.	// Output the selected keyframes
23.	Return Selected keyframes in chronological order
End Procedure

4.3. The Pretrained Models

The application of multiple features is beneficial for recognizing correct behaviors. In addition, color information is important, and is necessary to provide consistent scene luminosity, regardless of the condition of the captured image. Features of the optical flow also provide critical information regarding moving trajectories to distinguish behaviors. For each video, we extracted ten original RGB keyframes. We applied one technique of rotation, flipping, and shifting to each keyframe for geometric transformation and a color transformation for data augmentation purposes. Furthermore, we included an optical flow image for each keyframe. Thus, we obtained 40 keyframes per video clip in total. To obtain the necessary features for the classification of student behavior in the classroom, we use four pretrained models: Inception-V3, ResNet50-V, VGG-16, and EfficientNet-B7 for features including the RGB color, augmented data, luminosity in the L*a*b* color space, and optical flow, respectively. These multiple features are concatenated and fed into the classification network. Visualizations for RGB, DA, L*a*b*, and the optical flow keyframe are shown in Figure 5.

Inception-V3: The Inception-V3 is a highly regarded CNN architecture developed by Google AI that has made significant strides in the field of computer vision. It is known for its ability to extract high-level features from images, which has consistently led to superior performance compared to other CNN architectures. The establishment of Inception-V3 was a turning point in network scaling strategies. According to Szegedy et al. [35], the architects behind Inception-V3 aimed to leverage additional computational resources efficiently, resulting in innovative techniques such as factorized convolutions and advanced regularization. The architecture went through rigorous evaluation, particularly in the ILSVRC ImageNet classification challenge of 2012.

We utilize the key success of Inception-V3 for high-level feature extraction from an image frame. The design of Inception design incorporates inception modules, which are clusters of convolutional layers with different filter sizes. This innovation allows Inception-V3 to analyze image frames across multiple scales, capturing features of varying granularities and complexities in the student behavior dataset. This aspect makes Inception-V3 well-suited for extracting features from RGB keyframes, which contain information about the student’s appearance. The architecture for feature extraction from the video keyframe is depicted in Figure 6.

ResNet50-V2: The ResNet50-V2 is a highly effective CNN model developed to address the problem of vanishing gradients in deep networks. It is an improved version of ResNet50 that leverages innovative techniques to improve accuracy and performance. One of the key features of ResNet50-V2 is its residual connections, which enable direct information flow between layers, thus mitigating the vanishing gradient challenge. ResNet50-V2 uses bottleneck layers that reduce network parameters while maintaining accuracy via 1 × 1 convolutions followed by 3 × 3 convolutions. These advancements are supplemented by improved batch normalization and initialization techniques [36,37].

In the classroom, students’ behavior classification needs to detect the head and face of students, such as when they are sitting or turning around. Leveraging the capabilities of ResNet50-V2, we implemented this network to extract keyframe features. The main difference between Inception-V3 and ResNet50-V2 is that the latter focuses on keyframe augmentation techniques [38] to enhance various keyframe features, whereas Inception-V3 focuses on the original RGB keyframe. The model architecture designed for the feature extraction of DA is illustrated in Figure 7.

VGG-16: The VGG-16 excels at feature extraction from images [39], resulting in its popularity. With 16 layers, including 13 convolutional and 3 fully connected layers, VGG-16 processes input images, gradually reducing spatial resolution while deepening channel features. Trained on ImageNet, housing over a million images and 1000 classes [40], VGG-16 employs supervised learning to adapt its weights. This learning equips VGG-16 to recognize common patterns and features in images, making it invaluable for diverse image analysis tasks.

To extract features from each video keyframe, we first apply an image transform from RGB to CIELAB or L*a*b* color space [41]. We then feed the transformed image through a VGG-16 model, which utilizes a global average pooling layer for its final pooling operation. This layer significantly reduces the spatial dimensions of the feature maps to 1 × 1 by computing the average values across the remaining pixels in each map. As a result, the output of the VGG-16 model for each video frame is a single vector containing 512 features that encapsulate the global characteristics of the input image. The model takes as input a video frame with dimensions of 224 × 224 × 3 and outputs a vector of 512 features. This pretrained model extracts the features from the keyframe based on the CIELAB color space. Figure 8 shows the architecture of the feature extraction model for Lab color space.

EfficientNet-B7: EfficientNet is a family of CNNs that scale their depth, width, and resolution using a compound coefficient to ensure uniform scaling. EfficientNet models have been pretrained on ImageNet, a large-scale dataset with 1.3 million images from 1000 object classes [40]. According to Tan and Le (2019) [42], EfficientNet models outperform other CNN models in terms of accuracy while being faster and smaller. The EfficientNet-B7 module performed feature extraction with the transfer learning technique. Each input keyframe with dimensions of 224 × 224 × 3 is processed through a stack of 810 layers, without fully connected layers, responsible for making the final classification predictions for feature extraction. EfficientNet generates an output of feature descriptors with 2560 dimensions for each keyframe. These feature descriptors are input to the BiLSTM model for video representation and classification. EfficientNet-B7 is a pretrained model that is good at extracting features from optical flow keyframes, which contain information about the student’s movements. It is a relatively efficient model, which means that it can extract features from optical flow keyframes quickly and without using a lot of resources. This pretrained model is also the strongest for extracting the features, even with a few color differences [43]. We apply this pretrained model to extract the features from the optical flow keyframe [44]. The architecture of the feature extraction model for optical flow is shown in Figure 9.

4.4. BiLSTM-AT Deep Neural Network

RNNs are effective at modeling sequential patterns in time-series data such as video clips. However, the vanishing gradient problem can hinder RNN training. To overcome this issue, two RNN variations have been introduced: long short-term memory (LSTM) and gated recurrent units. The LSTM network has the same structure as RNNs but with an additional “memory cell” unit that maintains information for longer periods, helping to overcome the vanishing gradient problem. LSTM networks are used in a variety of applications, including video classification. However, LSTMs have a limitation in that they only capture past context. To fully understand any video, it is essential to consider both past and future contexts. Thus, bidirectional LSTMs (BiLSTMs) are a better option for video classification. BiLSTMs preserve information in both directions, allowing them to learn long-term dependencies in the data more effectively [45,46]. The BiLSTM model is highly efficient and more powerful when integrated with an attention mechanism for video classification tasks [47]. The BiLSTM model consists of two hidden layers: the forward hidden layer

(h_{t}^{f})

and the backward hidden layer

(h_{t}^{b})

. The forward hidden layer processes the input vector

x_{t}

in ascending order, with

t

values ranging from

1

to

T

. Conversely, the backward hidden layer processes the input vector in descending order, with

t

values ranging from

T

to

1

. The output

y_{t}

is generated by combining the results of

h_{t}^{f}

and

h_{t}^{b}

. Our work employs the BiLSTM-AT network, which comprises two modules: the BiLSTM and the attention mechanism. This network combines the current target forward

h_{t}^{f}

and backward

h_{t}^{b}

hidden layer with the context vector

C_{t}

to generate an attention vector output for classifying student behavior. The structure of BiLSTM-AT is illustrated in Figure 10.

We implement the BiLSTM-AT model to classify student behavior [47]. After preprocessing the video to extract keyframes, we employ feature fusion to concatenate the keyframe extraction features from Inception-V3, ResNet50-V2, VGG-16, and EfficientNet-B7 to create a fused feature vector, denoted by

X_{F} = [F_{i n c e p t i o n}, F_{r e s n e t}, F_{v g g}, F_{e f f i c i e n t n e t}]

for frame

F

. The sequence of feature vectors

{X = [X_{1}, X_{2}, X}_{3}, \dots, X_{m a x_s e q u e n c e}]

, represented by the vector

X_{m a x_s e q u e n c e}

, denote the length of a keyframe in each video clip. To capture temporal dependencies, we compute forward and backward hidden states

h_{t}^{f}

and

h_{t}^{b}

, and forward and backward cell states

c_{t}^{f}

and

c_{t}^{b}

for each keyframe feature vector

X_{t}

by LSTM processing. We implement all the step processes of BiLSTM-AT using the following equations:

Forward LSTM equations:

h_{t}^{f} = {L S T M}_{f} (X_{t}, h_{t - 1}^{f}, c_{t - 1}^{f})

(1)

c_{t}^{f} = {c e l l}_{f} (X_{t}, h_{t - 1}^{f}, c_{t - 1}^{f})

(2)

At time step

t

,

h_{t}^{f}

is the forward hidden state generated by

{L S T M}_{f}

, by processing

X_{t}

, an input feature vector typically from a video keyframe.

h_{t - 1}^{f}

and

c_{t - 1}^{f}

represent the prior forward hidden and cell states at

t

−1, capturing essential temporal information.

c_{t}^{f}

, the forward cell state at time step t, is the output of

{c e l l}_{f}

processing

X_{t}

, influenced by

h_{t - 1}^{f}

and

c_{t - 1}^{f}

, collectively preserving critical information in the video sequence.

Backward LSTM equations:

h_{t}^{b} = {L S T M}_{b} (X_{t}, h_{t + 1}^{b}, c_{t + 1}^{b})

(3)

c_{t}^{b} = {c e l l}_{b} (X_{t}, h_{t + 1}^{b}, c_{t + 1}^{b})

(4)

In time step t,

h_{t}^{b}

represents the backward hidden state, which

{L S T M}_{b}

generates by processing

X_{t}

in a reverse sequence, often starting from the end of the input sequence.

h_{t + 1}^{b}

and

c_{t + 1}^{b}

denote the subsequent backward hidden and cell states at

t

+ 1, capturing information concerning the sequences’ upcoming context.

c_{t}^{b}

, the backward cell state at time step t, results from

{c e l l}_{b}

processing

X_{t}

in reverse order, under the influence of

h_{t + 1}^{b}

and

c_{t + 1}^{b}

. This arrangement collectively preserves vital information about the sequence’s future context.

Concatenation of Hidden States: For each time step t, we combine the forward and backward hidden states to create the final representation,

h_{t}

:

h_{t} = [h_{t}^{f}, h_{t}^{b}] .

(5)

This concatenation ensures that we capture the context in both the forward

h_{t}^{f}

and backward

h_{t}^{b}

directions, enabling a more comprehensive understanding of the keyframe sequence.

Attention Mechanism: In the optional inclusion of an attention mechanism, we compute attention scores

A_{t}

for each keyframe

X_{t}

based on a query (

Q

), key (

K

), and value (

V

) vectors. This mechanism enhances the model’s ability to focus on essential frames within the sequence:

A_{t} = s o f t m a x (Q_{t} \cdot K_{t}^{T})

(6)

where

A_{t}

is the attention score for the keyframe

X_{t},

reflecting the model’s emphasis on the keyframe at time step t. Softmax is the mathematical function that converts the dot product of the query

Q_{t}

and transposed key

K_{t}

vectors into a probability distribution for easier interpretation.

Context Vector Calculation: The context vector

C

is generated as a weighted sum of the value vectors

V_{t}

using the attention scores

A_{t}

:

C = \sum_{t = 1}^{m a x_s e q u e n c e} A_{t} \cdot V_{t}

(7)

where

C

summarizes information from the selected keyframes by assigning more importance to those frames identified as crucial by the attention mechanism.

As part of the attention mechanism, each keyframe feature vector

X_{t}

is element-wise multiplied by the context vector

C

to produce an attended representation

Y_{t}

. This process allows the model to prioritize and enhance the significance of features within each keyframe, thus improving the accuracy of student behavior classification.

4.5. Experiments

Our methodology employs multiple stages: First, pre-processing by frame extraction and selection are performed using the SADLK algorithm. Second, features are extracted from various pretrained models to obtain feature fusion. Lastly, behavior classification based on a hybrid model using a deep sequence neural network is utilized to classify student behavior based on the extracted features.

4.5.1. Student Behavior Dataset

As detailed in Section 3, regarding student behavior data collection, we have compiled a dataset of student behavior and divided it into five distinct classes: hand-raising, interaction, sitting, turning around, and writing. Hand-raising represents students raising their hands to express opinions or ask questions. Interactions denote students engaged in group discussions and problem-solving. Sitting indicates students are listening to the teacher’s explanations during lectures. Turning around represents students observing their friends’ activities, and writing captures students solving math problems and recording their answers on paper.

The student behavior dataset comprises 578 video clips with a 30 frames per second (fps) frame rate. The videos encompass various durations, with a minimum of 0.7 s (21 frames) and a maximum of 83.87 s (2516 frames). Each clip contains a total of 40 keyframes, which amounts to a total of 23,120 keyframes. The dataset is divided into training, validation, and testing subsets with 70, 15, and 15% of the data, respectively. Each video clip has an original keyframe of ten RGB, ten geometric transformations, ten color transformations, and ten optical flow keyframes. To ensure that frame overlap is considered, we use the SADLK algorithm to select ten keyframes from each video clip at a consistent 30 fps frame rate, regardless of the frame length.

4.5.2. Experimental Environment

To implement our proposed framework, the software and hardware specifications are as follows: Python version 3.11, TensorFlow and Keras version 2.12, OpenCV 4.7 on an Ubuntu operating system version 22.04 LTS, server processor Intel Xeon Gold 6230R of 4.0 GHz with 128 GB RAM, and GPU Nvidia Quadro RTX 8000. We perform experiments on multiple pretrained models with the SADLK algorithm keyframe selection methods and BiLSTM-AT to classify student behavior in the classroom.

4.5.3. Model Hyperparameters Tuning

The model is compiled with sparse categorical cross-entropy loss, the Adam optimizer with a learning rate of 0.001, and accuracy as the evaluation metric. The model is trained for 50 epochs, and the layer configuration of the proposed multiple pretrained and BiLSTM-AT is used for classifying student behavior. To load the data into the model, we choose ten keyframes and then extract features using the multiple pretrained models. The image size of the keyframes is 224 × 224 with three channels of RGB. All of the features are fused, and the number of features is 7168. The complete layer configuration is listed in Table 1. Our model uses all the parameters of the multiple pretrained models during training, and none are non-trainable.

4.5.4. Evaluation Metrics

When assessing the efficacy of multiclass student behavior classification models, performance is evaluated by computing accuracy, precision, recall, and the f1 score based on confusion matrices. In precision metric measures, TP and FP represent the number of true-positive and false-positive samples in the positive class, respectively, while recall measures the proportion of correctly identified real positive samples. FN indicates false-negative counts. Accuracy is determined by comparing the number of correct predictions for each class with the total number of predictions for all classes and is expressed as a ratio. The formulae for calculation are [48]:

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

R e c a l l = \frac{T P}{T P + F N}

(9)

F 1 S c o r e = 2 \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(10)

A c c u r a c y = \frac{N u m b e r o f c o r r e c t c l a s s i f i c a t i o n s}{T o t a l n u m b e r o f c l a s s i f i c a t i o n s a t t e m p t e d}

(11)

5. Results and Discussion

In this section, we analyze the accuracy of the single-pretrained and multiple-pretrained models with keyframe selection methods and use the BiLSTM-AT to classify the students’ behavior. Furthermore, the performance of our overall framework has been evaluated with a large dataset consisting of the HMDB51 [49] and UCF101 [50] public datasets. The results of the single-pretrained model and the multiple-pretrained models are described in Section 5.1 and Section 5.2. Finally, the performance evaluation with a large dataset is explained in Section 5.3. A summary of the discussion results is presented in Section 5.4.

5.1. Single Pretrained Model Results with BiLSTM-AT

The BiLSTM-AT network, without the keyframe selection and using single pretrained models, was used for feature extraction; we call this model “S-BiLSTM-AT.” The models were compared with Inception-V3, ResNet50-V2, VGG-16, and EfficientNet-B7. The VGG-16 pretrained model has a higher accuracy than the other single pretrained models with different parameters, with an accuracy of 88.89%. However, the single pretrained model cannot obtain the necessary features to identify a similar activity and classify the students’ behavior, which is a complex task with our dataset. Nevertheless, the results show that all four models achieve relatively high accuracy, precision, recall, and F1 scores. This suggests that all four models are effective at student behavior recognition, but VGG-16 with BiLSTM-AT is the most effective. The results are displayed in Table 2.

We conducted experiments using BiLSTM-AT with different pretrained models. Specifically, we used two, three, and four pretrained models. The two pretrained models were Inception-V3 and ResNet50-V2. The three pretrained models were Inception-V3, ResNet50-V2, and VGG-16. The four pretrained models were Inception-V3, ResNet50-V2, VGG-16, and EfficientNet-B7. We referred to these models as “IR-BiLSTM-AT”, “IRV-BiLSTM-AT”, and “IRVE-BiLSTM-AT” or “MFF-BiLSTM-AT”, respectively. The results of these experiments are displayed in Table 3.

5.2. Multiple Pretrained Model Results with BiLSTM-AT

The BiLSTM-AT network uses the keyframe selection with multiple pretrained models for feature extraction. The model has the same hyperparameter tuning and training with the same parameters between MFF-BiLSTM-AT and KMFF-BiLSTM-AT in our overall frameworks. The results are shown in Table 4, and the training and validation accuracy is shown in Figure 11. The sample model prediction results are shown in Figure 12.

We compare the performance of the five models of our proposed framework: S-BiLSTM-AT, IR-BiLSTM-AT, IRV-BiLSTM-AT, MFF-BiLSTM-AT, and KMFF-BiLSTM-AT. The accuracy of the KMFF-BiLSTM-AT model was found to be higher than the accuracy of the other models. However, even though our framework exhibits a high level of performance, there are still some limitations. For example, there is some confusion when making predictions about actual classes; some students turned around, but the model predicted hand-raising behavior because the student also moved one hand. In addition, when students interacted with their friends, the model predicted that they would turn around as they moved their heads to respond to another student.

The results show that after fusing the features, the single pretrained model is less efficient than the multiple pretrained models, with an accuracy of 88.89 and 92.22%, respectively. However, when compared with applying the keyframe selection methods and using multi-feature fusion, the KMFF-BiLSTM-AT model was the most efficient, with an accuracy of 96.67%. The results can be found in Table 4.

5.3. Comparison with a Large Dataset

The overall performance of our proposed framework was evaluated using the large public UCF101 and HMDB51 datasets. We selected ten keyframes from each video clip using the SDALK algorithm and then used our proposed novel multi-feature extraction and fusion methodology. Finally, the BiLSTM-AT network is used to classify human action recognition. The dataset is split into three groups consisting of “Training”, “Validation”, and “Testing”, which have percentage data values of 70:15:15, respectively. The number of video clips is shown in Table 5.

Our proposed KMFF-BiLSTM-AT model obtained high accuracy compared with the state-of-the-art models. The model achieved the highest accuracy of 97.54% with the UCF101 dataset and 72.43% with the HMDB51 dataset. Table 6 summarizes the comparative results.

5.4. Discussion

When comparing the results of the student’s behavior dataset with only a single pretrained model, we found that VGG-16 with the BiLSTM-AT model had a higher accuracy than the others. However, even though the VGG-16 pretrained model is excellent for extracting features, it is still insufficient for identifying and classifying students’ behaviors in the classroom using the BiLSTM-AT model. When we apply the multiple pretraining models based on multi-feature fusion and keyframe selection methods to address this limitation, the MFF-BiLSTM-AT and KMFF-BiLSTM-AT models achieve a higher accuracy than the S-BiLSTM-AT model. Nevertheless, of the multiple pretrained models, the KMFF-BiLSTM-AT obtained an accuracy higher than the MFF-BiLSTM-AT. The model prediction results for KMFF-BiLSTM-AT are higher than all of our proposed models, indicating that this is an excellent, accurate, and robust model for classifying the complexity of our student behavior datasets.

Moreover, our proposed framework adopted keyframe extraction and selection methods using the SADLK algorithm with multiple feature fusion and applied the data augmentation technique combination with optical flow motion identification on BiLSTM-AT networks—the overall accuracy evaluation with the two largest datasets, HMDB51 and UCF101, of our proposed framework is excellent compared with state-of-the-art models.

6. Conclusions and Future Work

There have been several attempts to analyze students’ behavior in a classroom. However, it has been difficult to find an effective method for this analysis due to the limited classroom environment and the interactions between many students and a teacher. We propose a novel MultiFusedNet network model that can optimally perform individual behavior analysis in situations where lighting changes are diverse and groups of students are intermingled. To select keyframes that have representativeness from a video input of arbitrary length, the SADLK algorithm was designed as an efficient tool to select a specified number of keyframes by identifying the spatial and color differences between frames and applying the k-mean method. The MultiFusedNet network analyzed the form of a given action based on multi-features, which are extracted and mixed from the selected keyframe. The spatial and temporal features are extracted from RGB images, including L*a*b* color images that are robust to changes in lighting, geometric transformed images that ensure data diversity, and optical flow images that contain motion vectors.

We used the various spatial and temporal features in the depth network BiLSTM with an attention mechanism model. With highly complex student behavior datasets, the model can effectively analyze complex student behavior datasets, even when the data is similar across classes. By leveraging the capabilities of our framework, teachers or educators can develop novel techniques to help students learn more effectively. The model’s ability to analyze student behavior can identify relationships between learning outcomes, activities, and behaviors, enabling instructors to design more personalized and targeted learning plans based on the PLC approach. In future work, we aim to expand our collection of student behavior datasets with diverse labeling schemes while adding new categories to our existing datasets. To enhance the effectiveness and accuracy of our overall framework, we plan to integrate pose estimation of the human skeleton into our existing multiple features and use a BiLSTM-AT deep sequence neural network to classify students’ behaviors. The goal of this work is to comprehensively study student behavior in classrooms across diverse schools in Thailand. By partnering with researchers and using AI technology, the aim is to support teachers in analyzing and improving students’ behaviors. This approach will enhance the classroom environment, address learning needs, and improve outcomes in all subjects.

Author Contributions

Conception and design of the proposed method: H.J.L. and S.N.; performance of the experiments: S.N., H.J.L. and S.-H.N.; writing of the paper: S.N.; paper review and editing: H.J.L. and S.-H.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-02068, Artificial Intelligence Innovation Hub) and partially funded by the TRSI, Ministry of Higher Education, Science, Research and Innovation (MHESI) of Thailand (Project-ID: 63673).

Institutional Review Board Statement

The ethical approval was waived by Sisaket Rajabhat University Research Ethics Committee for this study because this research done in education settings is not likely to have adverse impact on students learning.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy restrictions.

Acknowledgments

In this research, we would like to express our gratitude to Thong-oon Manmai from Sisaket Rajabhat University in Thailand. He specializes in analyzing student behavior datasets, as well as creating specific category labels for datasets related to math class behavior.

Conflicts of Interest

The authors declare no conflict of interest.

References

Inprasitha, M. Lesson study and open approach development in Thailand: A longitudinal study. Int. J. Lesson Learn. Stud. 2022, 11, 1–15. [Google Scholar] [CrossRef]
Hord, S.M. Professional Learning Communities: Communities of Continuous Inquiry and Improvement; Southwest Educational Development Laboratory: Austin, TX, USA, 1997. [Google Scholar]
Manmai, T.O.; Inprasitha, M.; Changsri, N. Cognitive Aspects of Students’ Mathematical Reasoning Habits: A Study on Utilizing Lesson Study and Open Approach. Pertanika J. Soc. Sci. Humanit. 2021, 29, 2591–2614. [Google Scholar] [CrossRef]
Synced, G.; Shaoyou, L.; Baorui, C.; Qingyan, T.; Chenchen, Z.; Chen, T.; Meghan, H. Year of AI: How Did Global Public Company Adapt to the Wave of AI Transformation: A 2018 Report about Fortune Global 500 Public Company Artificial Intelligence Adaptivity, Kindle Edition; Synced Global Intelligence Research: Cambridge, MA, USA, 2018. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Li, X.; Wang, M.; Zeng, W.; Lu, W. A students’ action recognition database in smart classroom. In Proceedings of the IEEE 14th International Conference on Computer Science & Education (ICCSE), Toronto, ON, Canada, 19–21 August 2019; pp. 523–527. [Google Scholar]
Xie, Y.; Zhang, S.; Liu, Y. Abnormal Behavior Recognition in Classroom Pose Estimation of College Students Based on Spatiotemporal Representation Learning. Trait. Du Signal 2021, 38, 89–95. [Google Scholar] [CrossRef]
Che, B.; Li, X.; Sun, Y.; Yang, F.; Liu, P.; Lu, W. A database of students’ spontaneous actions in the real classroom environment. Comput. Electr. Eng. 2022, 101, 108075. [Google Scholar] [CrossRef]
Zheng, Z.; Liang, G.; Luo, H.; Yin, H. Attention assessment based on multi-view classroom behaviour recognition. IET Comput. Vis. 2022; early view. [Google Scholar] [CrossRef]
Sethi, K.; Jaiswal, V. PSU-CNN: Prediction of student understanding in the classroom through student facial images using convolutional neural network. Mater. Today Proc. 2022, 62, 4957–4964. [Google Scholar] [CrossRef]
Liu, T.; Wang, J.; Yang, B.; Wang, X. Facial expression recognition method with multi-label distribution learning for non-verbal behavior understanding in the classroom. Infrared Phys. Technol. 2021, 112, 103594. [Google Scholar] [CrossRef]
Wikipedia. Artificial Intelligence. Available online: https://en.wikipedia.org/wiki/Artificial_intelligence (accessed on 3 February 2023).
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Ur Rehman, A.; Belhaouari, S.B.; Kabir, M.A.; Khan, A. On the Use of Deep Learning for Video Classification. Appl. Sci. 2023, 13, 2007. [Google Scholar] [CrossRef]
Zheng, R.; Jiang, F.; Shen, R. Intelligent student behavior analysis system for real classrooms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 9244–9248. [Google Scholar]
Jisi, A.; Yin, S. A new feature fusion network for student behavior recognition in education. J. Appl. Sci. Eng. 2021, 24, 133–140. [Google Scholar]
Hu, M.; Wei, Y.; Li, M.; Yao, H.; Deng, W.; Tong, M.; Liu, Q. Bimodal learning engagement recognition from videos in the classroom. Sensors 2022, 22, 5932. [Google Scholar] [CrossRef]
Zhou, J.; Ran, F.; Li, G.; Peng, J.; Li, K.; Wang, Z. Classroom Learning Status Assessment Based on Deep Learning. Math. Probl. Eng. 2022, 2022, 7049458. [Google Scholar] [CrossRef]
Lin, F.C.; Ngo, H.H.; Dow, C.R.; Lam, K.H.; Le, H.L. Student behavior recognition system for the classroom environment based on skeleton pose estimation and person detection. Sensors 2021, 21, 5314. [Google Scholar] [CrossRef]
Fu, R.; Wu, T.; Luo, Z.; Duan, F.; Qiao, X.; Guo, P. Learning behavior analysis in classroom based on deep learning. In Proceedings of the Tenth International Conference on Intelligent Control and Information Processing (ICICIP), Marrakesh, Morocco, 14–19 December 2019; pp. 206–212. [Google Scholar]
You, J.; Huang, Y.; Zhai, S.; Liu, Y. Deep Learning Based a Novel Method of Classroom Behavior Recognition. In Proceedings of the 2nd International Conference on Educational Technology (ICET), Beijing, China, 25–27 June 2022; pp. 155–159. [Google Scholar]
Zhang, Y.; Wu, Z.; Chen, X.; Dai, L.; Li, Z.; Zong, X.; Liu, T. Classroom behavior recognition based on improved yolov3. In Proceedings of the International Conference on Artificial Intelligence and Education (ICAIE), Tianjin, China, 26–28 June 2020; pp. 93–97. [Google Scholar]
Ren, X.; Yang, D. Student behavior detection based on YOLOv4-Bi. In Proceedings of the International Conference on Computer Science, Artificial Intelligence and Electronic Engineering (CSAIEE), Beijing, China, 20–22 August 2021; pp. 288–291. [Google Scholar]
Tang, L.; Xie, T.; Yang, Y.; Wang, H. Classroom Behavior Detection Based on Improved YOLOv5 Algorithm Combining Multi-Scale Feature Fusion and Attention Mechanism. Appl. Sci. 2022, 12, 6790. [Google Scholar] [CrossRef]
Yang, F.; Wang, X. Student Classroom Behavior Detection based on Improved YOLOv7. arXiv 2023, arXiv:2306.03318. [Google Scholar]
Wang, Z.; Yao, J.; Zeng, C.; Li, L.; Tan, C. Students’ Classroom Behavior Detection System Incorporating Deformable DETR with Swin Transformer and Light-Weight Feature Pyramid Network. Systems 2023, 11, 372. [Google Scholar] [CrossRef]
Zhou, D.; Ma, X.; Feng, S. An Effective Plant Recognition Method with Feature Recalibration of Multiple Pretrained CNN and Layers. Appl. Sci. 2023, 13, 4531. [Google Scholar] [CrossRef]
Li, S.; Du, Y.; Tenenbaum, J.B.; Torralba, A.; Mordatch, I. Composing ensembles of pre-trained models via iterative consensus. arXiv 2022, arXiv:2210.11522. [Google Scholar]
Nindam, S.; Manmai, T.O.; Sung, T.; Wu, J.; Lee, H.J. Human Activity Classification Using Deep Transfer Learning. In Proceedings of the Korea Information Processing Society Conference (KIPS), Chuncheon, Republic of Korea, 3–5 November 2022; pp. 478–480. [Google Scholar]
Thepade, S.D.; Patil, P.H. Novel video keyframe extraction using KPE vector quantization with assorted similarity measures in RGB and LUV color spaces. In Proceedings of the 2015 International Conference on Industrial Instrumentation and Control (ICIC), Pune, India, 28–30 May 2015; pp. 1603–1607. [Google Scholar]
Sheng, L.; Xu, D.; Ouyang, W.; Wang, X. Unsupervised collaborative learning of keyframe detection and visual odometry towards monocular deep slam. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4302–4311. [Google Scholar]
Niitsuma, H.; Maruyama, T. Sum of absolute difference implementations for image processing on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications, Milan, Italy, 31 August–2 September 2010; pp. 167–170. [Google Scholar]
Wikipedia. CIELUV. Available online: https://en.wikipedia.org/wiki/CIELUV (accessed on 29 March 2023).
Dehariya, V.K.; Shrivastava, S.K.; Jain, R.C. Clustering of image data set using k-means and fuzzy k-means algorithms. In Proceedings of the 2010 International Conference on Computational Intelligence and Communication Networks, Bhopal, India, 26–28 November 2010; pp. 386–391. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
He, K.; Girshick, R.; Dollar, P. Rethinking imagenet pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4918–4927. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Rimiru, R.M.; Gateri, J.; Kimwele, M.W. GaborNet: Investigating the importance of color space, scale and orientation for image classification. PeerJ Comput. Sci. 2022, 8, e890. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Xu, R.; Lin, H.; Lu, K.; Cao, L.; Liu, Y. A forest fire detection system based on ensemble learning. Forests 2021, 12, 217. [Google Scholar] [CrossRef]
Farnebäck, G. Two-frame motion estimation based on polynomial expansion. In Proceedings of the Image Analysis: 13th Scandinavian Conference (SCIA), Halmstad, Sweden, 29 June–2 July 2003; pp. 363–370. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Yousaf, K.; Nawaz, T. A deep learning-based approach for inappropriate content detection and classification of youtube videos. IEEE Access 2022, 10, 16283–16298. [Google Scholar] [CrossRef]
Tanha, J.; Abdi, Y.; Samadi, N.; Razzaghi, N.; Asadpour, M. Boosting methods for multi-class imbalanced data classification: An experimental review. J. Big Data 2020, 7, 1–47. [Google Scholar] [CrossRef]
Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2556–2563. [Google Scholar]
Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
Peng, X.; Wang, L.; Wang, X.; Qiao, Y. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Comput. Vis. Image Underst. 2016, 150, 109–125. [Google Scholar] [CrossRef]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Han, Y.; Zhang, P.; Zhuo, T.; Huang, W.; Zhang, Y. Going deeper with two-stream ConvNets for action recognition in video surveillance. Pattern Recognit. Lett. 2018, 107, 83–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. arXiv 2014, arXiv:1406:2199. [Google Scholar]
Sun, L.; Jia, K.; Yeung, D.Y.; Shi, B.E. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4597–4605. [Google Scholar]
Wang, L.; Qiao, Y.; Tang, X. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4305–4314. [Google Scholar]
Wang, X.; Gao, L.; Song, J.; Shen, H. Beyond frame-level CNN: Saliency-aware 3-D CNN with LSTM for video action recognition. IEEE Signal Process. Lett. 2016, 24, 510–514. [Google Scholar] [CrossRef]
Li, Z.; Gavrilyuk, K.; Gavves, E.; Jain, M.; Snoek, C.G. Videolstm convolves, attends and flows for action recognition. Comput. Vis. Image Underst. 2018, 166, 41–50. [Google Scholar] [CrossRef]
Sun, L.; Jia, K.; Chen, K.; Yeung, D.Y.; Shi, B.E.; Savarese, S. Lattice long short-term memory for human action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2147–2156. [Google Scholar]
Liu, Z.; Li, Z.; Wang, R.; Zong, M.; Ji, W. Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition. Neural Comput. Appl. 2020, 32, 14593–14602. [Google Scholar] [CrossRef]
Chen, B.; Tang, H.; Zhang, Z.; Tong, G.; Li, B. Video-based action recognition using spurious-3D residual attention networks. IET Image Process. 2022, 16, 3097–3111. [Google Scholar] [CrossRef]
Dong, W.; Zhang, Z.; Song, C.; Tan, T. Identifying the key frames: An attention-aware sampling method for action recognition. Pattern Recognit. 2022, 130, 108797. [Google Scholar] [CrossRef]
Chen, B.; Meng, F.; Tang, H.; Tong, G. Two-level attention module based on spurious-3d residual networks for human action recognition. Sensors 2023, 23, 1707. [Google Scholar] [CrossRef]

Figure 1. (a) Top view of the classroom seating layout. (b) A sample image of real-class activities.

Figure 2. Sample images from the student behavior dataset; (a) Hand raising; (b) Interacting; (c) Sitting; (d) Turning around; (e) Writing.

Figure 3. Process pipeline of the proposed model.

Figure 4. The overall framework for student behavior classification that used a multi-feature fusion network with keyframe selection to classify a behavior type. A hybrid approach network based on deep sequence deep neuron network focusing on the BiLSTM-AT model has been used for the classification.

Figure 5. Shown in (a–d) are examples of RGB, DA, L*a*b*, and the optical flow keyframe, respectively.

Figure 6. Pretrained architecture for the feature extraction of the RGB keyframe. Xn meeans that the same block is repeated n-time.

Figure 7. Pretrained architecture for feature extraction from the keyframe augmentation. Xn meeans that the same block is repeated n-time.

Figure 8. Pretrained architecture extracting the feature from Lab color space in each keyframe. Xn meeans that the same block is repeated n-time.

Figure 9. Pretrained architecture for flow color visualization feature extraction.

Figure 10. BiLSTM-AT network architecture: the output from the BiLSTM module at each time step t is concatenated with the forward

h_{t}^{f}

and backward

h_{t}^{b}

hidden layers. This concatenated vector is then used as the input to the attention module. The attention module infers a variable length alignment weight vector

a_{t}

for each time step t. This vector is used to compute a weighted sum

C_{t}

of the hidden states

h_{t}

of the BiLSTM module at each time step

t

. This weighted sum is then used as the output

Y_{t}

of the attention module.

Figure 10. BiLSTM-AT network architecture: the output from the BiLSTM module at each time step t is concatenated with the forward

h_{t}^{f}

and backward

h_{t}^{b}

hidden layers. This concatenated vector is then used as the input to the attention module. The attention module infers a variable length alignment weight vector

a_{t}

for each time step t. This vector is used to compute a weighted sum

C_{t}

of the hidden states

h_{t}

of the BiLSTM module at each time step

t

. This weighted sum is then used as the output

Y_{t}

of the attention module.

Figure 11. Training and validation metrics of (a) MFF-BiLSTM-AT and (b) KMFF-BiLSTM-AT.

Figure 12. Example prediction results of the KMFF-BiLSTM-AT model.

Table 1. The layer configuration of the proposed BiLSTM-AT model for student behavior classification.

Layers	Type	Output Shape	Parameters
Layer 1	Multiple Pretrained Models	(10 × 224 × 224 × 3)	124,179,959
Layer 2	Input 1	(None, 10, 7168)	0
Layer 3	Input 2	(None, 10)	0
Layer 4	BiLSTM (unit = 512)	(None, 10, 512)	15,206,400
Layer 5	Attention	(None, 10, 512)	1
Layer 6	Multiply	(None, 10, 512)	0
Layer 7	GlobAveragePooling1D	(None, 512)	0
Layer 8	Dropout 1 (value = 0.3)	(None, 512)	0
Layer 9	Reshape	(None, 1, 512)	0
Layer 10	BiLSTM (unit = 256)	(None, 256)	656,384
Layer 11	Dropout 2 (value = 0.3)	(None, 256)	0
Layer 12	Dense 1	(None, 64)	16,448
Layer 13	Dropout 3 (value = 0.3)	(None, 64)	0
Layer 14	Dense 2	(None, 5)	325
Total parameters: (15,879,558)
Trainable parameters: (15,879,558)
Non-trainable parameters: (0)
Activation function (ReLU, softmax)
Loss function (sparse categorical cross-entropy)
Optimizer (Adam)
Epochs (50)
Batch-Size (64)
Learning rate (0.001)

Table 2. The results of various single pretrained models compared with the BiLSTM-AT network. The bold represents the best value.

Classifier	Total Parameters	Classification Overall Accuracy (%)
Classifier	Total Parameters	Precision	Recall	F1	Accuracy
Inception-V3 + BiLSTM-AT	5,393,798	0.85	0.85	0.85	84.44
ResNet50-V2 + BiLSTM-AT	5,393,798	0.90	0.88	0.88	87.78
EfficientNet-B7 + BiLSTM-AT	6,442,374	0.88	0.88	0.88	87.78
VGG-16 + BiLSTM-AT	2,248,070	0.90	0.89	0.89	88.89

Table 3. The results of IR-BiLSTM-AT, IRV-BiLSTM-AT, and MFF-BiLSTM-AT models.

Classifier	Total Parameters	Classification Overall Accuracy (%)
Classifier	Total Parameters	Precision	Recall	F1	Accuracy
IR-BiLSTM-AT	9,588,102	0.91	0.90	0.90	90.00
IRV-BiLSTM-AT	10,636,678	0.91	0.92	0.91	91.11
MFF-BiLSTM-AT	15,879,558	0.93	0.93	0.92	92.22

Table 4. Comparison of the overall accuracy results of our dataset with the five models of the proposed framework. The bold represents the best value.

Classifier	Total Parameters	Inference Time (min)	Classification Overall Accuracy (%)
Classifier	Total Parameters	Inference Time (min)	Precision	Recall	F1	Accuracy
S-BiLSTM-AT	2,248,070	04.74	0.90	0.89	0.89	88.89
IR-BiLSTM-AT	9,588,102	10.61	0.91	0.90	0.90	90.00
IRV-BiLSTM-AT	10,636,678	13.94	0.91	0.92	0.91	91.11
MFF-BiLSTM-AT	15,879,558	20.07	0.93	0.93	0.92	92.22
KMFF-BiLSTM-AT	15,879,558	21.80	0.97	0.97	0.97	96.67

Table 5. Dataset settings for training, validation, and testing.

Dataset	Classes	Training (70%)	Validation (15%)	Testing (15%)	Total Clips (100%)
UCF101	101	9664	1705	1951	13,320
HMDB51	51	4907	865	994	6766

Table 6. Comparison of our methods with the state-of-the-art models with the UCF101 and HMDB51 public datasets.

Methods	UCF101 (%)	HMDB51 (%)
iDT+HSV [51]	87.90	61.10
C3D [52]	85.20	-
Deeper temporal net [53]	84.90	-
Two-stream [54]	88.00	54.90
FstCN [55]	88.10	59.10
TDD+FV [56]	90.30	63.20
scLSTM [57]	84.00	55.10
VideoLSTM [58]	89.20	56.40
L2STM [59]	93.60	66.20
STS [60]	90.10	62.40
STS-ALSTM [60]	92.70	64.40
S3D+RANs [61]	93.30	71.20
TSN [62]	93.90	69.90
TSN+AS [62]	94.50	71.60
3DCNN+Two-level+AT [63]	95.68	72.60
KMFF-BiLSTM-AT (10 Keyframes) (Ours)	97.54	72.43

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nindam, S.; Na, S.-H.; Lee, H.J. MultiFusedNet: A Multi-Feature Fused Network of Pretrained Vision Models via Keyframes for Student Behavior Classification. Appl. Sci. 2024, 14, 230. https://doi.org/10.3390/app14010230

AMA Style

Nindam S, Na S-H, Lee HJ. MultiFusedNet: A Multi-Feature Fused Network of Pretrained Vision Models via Keyframes for Student Behavior Classification. Applied Sciences. 2024; 14(1):230. https://doi.org/10.3390/app14010230

Chicago/Turabian Style

Nindam, Somsawut, Seung-Hoon Na, and Hyo Jong Lee. 2024. "MultiFusedNet: A Multi-Feature Fused Network of Pretrained Vision Models via Keyframes for Student Behavior Classification" Applied Sciences 14, no. 1: 230. https://doi.org/10.3390/app14010230

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MultiFusedNet: A Multi-Feature Fused Network of Pretrained Vision Models via Keyframes for Student Behavior Classification

Abstract

1. Introduction

2. Literature Review

3. Data Collection

4. Methods and Experiment

4.1. Overview of Framework

4.2. Keyframe Extraction

4.3. The Pretrained Models

4.4. BiLSTM-AT Deep Neural Network

4.5. Experiments

4.5.1. Student Behavior Dataset

4.5.2. Experimental Environment

4.5.3. Model Hyperparameters Tuning

4.5.4. Evaluation Metrics

5. Results and Discussion

5.1. Single Pretrained Model Results with BiLSTM-AT

5.2. Multiple Pretrained Model Results with BiLSTM-AT

5.3. Comparison with a Large Dataset

5.4. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI