Student Behavior Recognition in Classroom Based on Deep Learning

Jia, Qingzheng; He, Jialiang

doi:10.3390/app14177981

Open AccessArticle

Student Behavior Recognition in Classroom Based on Deep Learning

by

Qingzheng Jia

^1,2 and

Jialiang He

^1,2,*

¹

Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou 571158, China

²

College of Information and Communication Engineering, Dalian Nationalities University, Dalian 116000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7981; https://doi.org/10.3390/app14177981

Submission received: 3 August 2024 / Revised: 26 August 2024 / Accepted: 29 August 2024 / Published: 6 September 2024

(This article belongs to the Special Issue Intelligent Techniques, Platforms and Applications of E-Learning)

Download

Browse Figures

Versions Notes

Abstract

:

With the widespread application of information technology in education, the real-time detection of student behavior in the classroom has become a key issue in improving teaching quality. This paper proposes a Student Behavior Detection (SBD) model that combines YOLOv5, the Contextual Attention (CA) mechanism and OpenPose, aiming to achieve efficient and accurate behavior recognition in complex classroom environments. By integrating YOLOv5 with the CA attention mechanism to enhance feature extraction capabilities, the model’s recognition performance in complex backgrounds, such as those with occlusion, is significantly improved. In addition, the feature map generated by the improved YOLOv5 is used to replace VGG-19 in OpenPose, which effectively improves the accuracy of student posture recognition. The experimental results demonstrate that the proposed model achieves a maximum mAP of 82.1% in complex classroom environments, surpassing Faster R-CNN by 5.2 percentage points and YOLOv5 by 4.6 percentage points. Additionally, the F1 score and R value of this model exhibit clear advantages over the other two traditional methods. This model offers an effective solution for intelligent classroom behavior analysis and the optimization of educational management.

Keywords:

deep learning; behavior detection; YOLOv5; students’ classroom behavior

1. Introduction

With the advancement of information technology, intelligent systems are increasingly being applied in the educational field, and intelligent monitoring technology is changing the traditional teaching model. In large-scale teaching scenarios, it is difficult for teachers to focus on each student’s learning status in real time. Traditional monitoring methods have obvious limitations and fail to meet the needs of precise teaching management [1]. Therefore, achieving efficient, real-time, and accurate detection of students’ classroom behavior has become a key issue in the field of educational technology. Building a behavior detection system based on intelligent analysis and dynamically detecting student behavior has profound implications for optimizing teaching processes, improving teaching quality, and enhancing student engagement in the classroom [2].

In recent years, object detection technology based on deep learning has made significant progress in this research field. The YOLO series of algorithms, known for their exceptional balance between speed and accuracy, are extensively employed in real-time detection tasks [3]. For instance, Chen et al. [4] proposed an improved YOLOv8-based model that enhances detection accuracy by incorporating the C2f_Res2block module, outperforming the original YOLOv8. However, this approach still faces high computational complexity in real-time applications. Additionally, another study [5] introduced a lightweight detection network, BiTNet, which focuses on real-time detection under occlusion. While BiTNet effectively balances detection accuracy and real-time performance by directly learning features from images, it still struggles with insufficient accuracy when processing fine details of student behavior. Although the OpenPose algorithm performs well in human pose recognition, it still faces challenges when applied to long-distance and complex backgrounds [6]. For example, Samkari et al. [7] proposed an enhanced deep-learning framework aimed at improving posture recognition accuracy in complex backgrounds. Their research demonstrated that incorporating multi-scale feature fusion significantly enhances the detection of key points in such environments. However, this method still faces limitations when addressing large-scale variations. Similarly, Lee et al. [8] investigated the use of body-cropping augmentation (BCA) for long-distance posture recognition. Through BCA, they successfully improved the model’s robustness in long-distance scenarios.

However, much of the current research emphasizes behavior recognition in static scenes, presenting challenges for effectively adapting these methods to complex and dynamic classroom environments. Existing systems struggle with complex scenarios, such as occlusion and lighting variations. To address the aforementioned challenges, this paper introduces a novel model called Student Behavior Detection (SBD). This model is designed to accurately detect the behavioral characteristics and real-time activities of students within the classroom environment. By doing so, it overcomes the limitations of traditional detection methods and fosters intelligent advancement of educational technology. Specifically, the model leverages a hybrid deep neural network that integrates the object detection capabilities of YOLOv5, the feature enhancement abilities of the Coordinate Attention (CA) mechanism, and the pose recognition capabilities of OpenPose. This integration enables comprehensive detection of students’ classroom behavior. The main contribution of this study is concluded as follows:

(i): We proposed a novel student behavior detection model specifically designed for complex classroom environments, which integrates YOLOv5, the CA mechanism, and OpenPose.
(ii): By combining the strengths of YOLOv5 and the CA attention mechanism, we constructed a hybrid neural network model. The CA module is integrated into the Backbone of YOLOv5 (referred to as the newly constructed network YOLOv5-A), replacing all convolutional modules with attention module structures to enrich the semantic information in the prediction layer and enhance the detection accuracy of small targets. OpenPose is then utilized based on the newly generated feature maps from YOLOv5-A, replacing the original VGG-19, which not only improves the accuracy of human posture recognition but also reduces computational costs.

The rest of this paper is organized as follows. Section 2 reviews the latest literature relevant to this study. Section 3 introduces the framework for smart classroom behavior detection. Section 4 discusses the implementation of the proposed model and object detection algorithm. In Section 5, we conduct experiments and evaluations using a real classroom dataset. Finally, Section 6 summarizes the study and explores future research directions.

2. Related Work

In this section, several related issues are reviewed separately, including the current research status of various technologies in the field of human behavior detection.

Human behavior detection is a significant research area in the fields of computer vision and deep learning. In recent years, with the maturity of deep learning technology, numerous researchers have proposed various methods for detecting human behavior [9].

2.1. Traditional Methods

Traditional methods for detecting human behavior typically rely on manual feature extraction and shallow machine learning models such as Support Vector Machines (SVMs) and Hidden Markov Models (HMMs) [10]. Under specific conditions, such as relatively simple backgrounds and distinct target behaviors, these conventional approaches can achieve satisfactory detection results. For instance, in controlled settings or specific laboratory environments, utilizing SVMs in conjunction with feature descriptors (such as HOG and SIFT) can effectively distinguish different basic actions [11]. However, these methods often underperform in more complex environments. This is primarily due to the limited expressive power of manual features, which struggle to adapt to dynamic environments and intricate behavior patterns.

2.2. Methods Based on Deep Learning

In recent years, the rapid development of deep learning has significantly transformed the field of behavior detection, particularly through the application of convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Deep learning models automatically learn and extract features from data via multi-layer neural network structures, allowing them to excel in complex scenarios. CNNs are proficient at extracting spatial features from image data, while RNNs can capture temporal dynamics changes in videos. The integration of these two models provides powerful tools for behavior detection [12].

Newsam S et al. [13] proposed a Two-Stream Network for behavior recognition that integrates optical flow and RGB information between frames simultaneously, thereby enhancing the detection performance. To overcome the limitation that CNN models can only process 2D inputs, Yuan C et al. [14] developed an innovative 3D CNN action recognition model. This model extracts features from both spatial and temporal dimensions through 3D convolutions. The final feature output integrates information across all channels, ultimately achieving strong recognition performance. Pengfei Zhang et al. [15] introduced a simple yet effective end-to-end Semantic Guided Neural Network (SGN) for skeleton-based action recognition. Their research incorporated high-level semantic information related to skeletons into a neural network, highlighting the importance of semantic information. They also developed a robust baseline that enhanced the performance while reducing the model size. Hongxia Ji et al. [16] proposed the T-TINY-YOLO network model, optimized from YOLO, for detecting abnormal behavior targets. They addressed the issue of numerous zero-weight parameters in YOLO by presenting a CNN pruning scheme and implementing real-time optimization. Aouaidjia Kamel et al. [17] utilized convolutional neural networks to extract features from depth images and pose data. These action features, along with three convolutional neural network channels, were used for maximal feature extraction to recognize human actions. Chakradhar Pabba et al. [18] proposed an intelligent system based on facial expression recognition, which utilizes a CNN to classify students’ facial expressions. The system automatically detects and analyzes these expressions to assess their engagement levels. Cao Z et al. [19] introduced the OpenPose network based on Part Affinity Fields for real-time multi-person pose estimation. This network can simultaneously detect body key points of multiple individuals and addresses the challenges of pose recognition in multi-person interaction environments.

Table 1 compares different detection methods in the relevant literature with the model in this paper.

However, current research seldom focuses on the dynamic characteristics of students’ behavior in classroom environments, particularly in complex and changing scenarios. To improve the robustness and reliability of behavior detection algorithms, it is essential to optimize feature extraction and fusion methods. This optimization will enhance the accuracy and efficiency of the model in processing student behavior.

3. Framework of Student Behavior Detection

In this section, we first introduce two critical issues related to object recognition in educational settings. We then discuss the fundamental framework of the intelligent student behavior detection model and its two core network modules.

3.1. Problem Scenario

In a teaching environment, accurately detecting and analyzing student behavior in the classroom is crucial for enhancing real-time responsiveness and optimizing the teaching process. As illustrated in Figure 1, the video stream recorded by classroom cameras provides a high-fidelity data input for the detection model. This video stream is essential for enabling the model to accurately identify student behavior characteristics and rapidly detect and recognize target behaviors. Specifically, to enhance the predictive accuracy of the model, we focus on the following two detection challenges in practical applications:

(i): The detection of multiple student behaviors. In real classroom settings, cameras are typically positioned at the front or on the ceiling to obtain a comprehensive view of student activities and interactions. This configuration results in smaller target sizes for students within the images, necessitating the detection of various behaviors at lower resolutions, such as paying attention, daydreaming, raising hands, reading, and resting. This scenario presents a typical challenge of small-object, multi-class behavior detection.
(ii): Long-distance human posture recognition. In the teaching environment, students’ posture and behavior serve as important dynamic parameters, characterized by high autonomy and variability. Variations in camera angles and distances often hinder the complete capture of students’ full-body posture characteristics. Traditional recognition algorithms, which rely on key points of the human skeleton, struggle to accurately reflect the actual physical space dynamics [20].

In summary, when target pixels are small and feature information is sparse, the convolutional computations of current deep learning algorithms are often applied to areas where target features are not prominent. This can result in significant computational resource waste and reduced detection efficiency. Small targets are particularly susceptible to losing key information during multi-layer convolution processing, making it challenging to achieve accurate detection and regression analysis, especially in complex classroom environments with varying lighting conditions and student occlusion. Therefore, enhancing the robustness of detection algorithms is essential to improve model accuracy in processing student behavior.

3.2. Framework of Student Behavior Detection

The detection framework proposed in this paper aims to provide an efficient solution for detecting student behavior in classrooms by integrating advanced object detection and posture recognition technologies.

Specifically, the SBD model proposed in this paper first adds the CA attention module to the Backbone of YOLOv5, replaces all convolutional modules with attention module structures, and provides richer semantic information for the prediction layer. Then, the model enhances small target detection accuracy through the fusion of deep and shallow features. In particular, the model extracts student behavior regions to serve as input for the OpenPose network, which is utilized for pose recognition. By eliminating background interference, the model significantly improves the detection of student behaviors.

As the basic framework in Figure 2 shows, our approach centers on two core modules to achieve student behavior detection in the classroom. First, we integrate an attention mechanism into the YOLOv5 model, forming a new network called YOLOv5-A for feature extraction in SBD. By replacing the original convolutional layers with a self-attention mechanism, the model can better focus on target areas, thereby improving the detection accuracy of small objects, such as student gestures and actions. This mechanism enhances the semantic information density of feature maps, allowing the model to maintain high accuracy even in occluded and complex backgrounds. Second, we utilize the high-quality features extracted by YOLOv5-A as inputs for the OpenPose network, replacing the original VGG-19 backbone. This substitution reduces computational demands and alleviates issues related to vanishing gradients and performance degradation in deep convolutions. The integration significantly minimizes the impact of background noise on detection accuracy and focuses on learning skeletal features to enhance the precision of human pose recognition.

4. Mechanism of Student Behavior Detection

In this section, we discuss the implementation of the proposed student behavior detection mechanism in detail, including the integration of YOLOv5 and the CA attention mechanism for feature fusion and extraction, as well as its combination with the OpenPose network for long-distance human pose recognition.

4.1. Integration of YOLOv5 and CA for Feature Fusion and Extraction

YOLOv5 is the fifth version of the YOLO series of object detection algorithms, developed by Jocher in 2020 [21]. This version inherits the basic concept of the YOLO series of algorithms, which is to extract features from the input image through a neural network and output the category and location of the target at the same time. YOLOv5 was chosen as the foundational architecture for the detection model due to its lightweight structure and high efficiency. Its structure is shown in Figure 3.

YOLOv5 is widely used in real-time object detection due to its speed, efficiency, and flexibility. However, in scenes with complex backgrounds and significant lighting variations, the model may ignore some important features, leading to reduced detection accuracy. To enhance model accuracy, this paper integrates the CA mechanism into the YOLOv5 framework. The CA mechanism combines spatial and channel attention, enabling the model to more accurately focus on target areas during feature extraction.

CA [22] stands for Contextual Attention, a mechanism designed to address these challenges by dividing the input sequence into multiple segments and applying self-attention computations to each segment. This approach effectively leverages contextual information within the sequence. As shown in Equation (1),

x_{1}, x_{2} \dots x_{c}

represents the input feature tensor, and

R^{H \times W \times C}

denotes the dimensions of the feature tensor.

X = [x_{1}, x_{2} \dots x_{c}] \in R^{H \times W \times C}

(1)

By transforming the input, an output tensor of the same size with enhanced representational capacity is produced, as shown in Equation (2).

Y = [y_{1}, y_{2} \dots y_{c}] \in R^{H \times W \times C}

(2)

The CA attention mechanism is shown in Figure 4.

To accurately obtain the width and height information of an image and focus attention, the CA module performs global average pooling on the input feature map in both the width and height directions, thereby generating two feature maps. This method effectively encodes the positional information of the image, thereby enhancing the accuracy of the model. As shown in Equations (3) and (4), given an input

x_{c}

, two spatial encodings are applied to each channel using pooling kernels

(h, i)

and

(j, w)

along the horizontal and vertical axes, respectively.

z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)

(3)

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w)

(4)

Next, the feature maps obtained from the width and height directions are concatenated and processed through a shared 1 × 1 convolutional module to reduce their dimensionality. Then, the feature map

F_{1}

, after batch normalization, is passed through a Sigmoid activation function to generate a new feature map

f

, as shown in Equation (5).

f = δ (F_{1} ([z^{h}, z^{w}]))

(5)

Subsequently, along the spatial dimension, the feature map

f

is divided into parts

f^{h}

and

f^{w}

. Each part undergoes an up-dimension operation using 1 × 1 convolutions, followed by the application of the Sigmoid activation function to generate the final attention vectors

g^{h}

and

g^{w}

.

g^{h} = δ (F_{h} (f^{h}))

(6)

g^{w} = δ (F_{w} (f^{w}))

(7)

Finally, by applying multiplicative weighting to the original feature map, we obtain the final feature map

y_{c} (i, j)

with attention weights along both the width and height dimensions, as shown in Equation (8):

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(8)

This paper integrates an attention module into the backbone of YOLOv5, replacing all convolutional modules with attention module structures. The specific architecture, referred to as YOLOv5-A, is shown in Figure 5.

4.2. Student Behavior Recognition

OpenPose can accurately locate key points of the human body, particularly in multi-person scenarios, which is crucial for detecting specific student behaviors in the classroom. Additionally, OpenPose offers efficient real-time processing capabilities for posture estimation in multiple individuals, meeting the real-time requirements of classroom behavior detection. As a modular posture estimation framework, OpenPose can be seamlessly integrated with other deep-learning models to enhance the overall performance of the detection system. Based on these strengths, this paper adopted the OpenPose network structure as the foundation for our research.

OpenPose can be viewed as a parallel convolutional network model, where one convolutional network is responsible for locating the key points of the human body, while the other is responsible for connecting the candidate key points to form limbs [23]. The original OpenPose employs the VGG-19 network to extract features, which are subsequently input into the parallel convolutional network. However, as the number of convolutional layers increases, this approach usually suffers from the problem of gradient vanishing and performance degradation. Therefore, as mentioned above, we adopt a new feature map

f_{n e w}

, which is based on the YOLOv5-A network, as the extracted feature input into OpenPose to replace the original VGG-19. This modification effectively enhances the network’s nonlinear fitting ability and further improves the accuracy of long-range recognition.

As shown in Figure 6, the whole learning framework of OpenPose can be viewed as a “dual-branch and multi-stage CNN”. In the first stage, Branch 1 generates a set of confidence maps

S_{1}

based on the input

f_{n e w}

, which describe the detected human key points. Branch 2 produces a set of part affinity fields

L_{1}

, which are used to assemble connected joints to predict the human skeleton. In each subsequent stage

t

, the inputs consist of three components: the original

f_{n e w}

and the outputs

S_{t - 1}

and

L_{t - 1}

from the previous stage. Predictions from the two branches:

S_{t}

and

L_{t}

, along with

f_{n e w}

, are combined in the next stage to progressively refine human pose recognition. During this process,

f_{n e w}

, integrating the advantages from the shallow and deep layers, can be further refined to highlight skeletal characteristics in complex scenes.

The entire detection process of the SBD module is shown in Figure 7. The module has efficient and accurate posture recognition capabilities and can achieve accurate human key point detection in complex backgrounds. In addition, by integrating multi-stage feature extraction, the module improves the detection robustness in long-distance and multi-target scenarios.

5. Experiment and Analysis

In this section, we test the object detection performance of the proposed model and compare it with other learning algorithms.

5.1. Dataset

In deep learning, the quality of the dataset construction directly impacts the effectiveness of the model recognition. The selection criteria for the dataset are as follows: ① the chosen dataset must have clear pixel resolution; ② it should avoid overexposure; ③ interference from poor camera angles (such as severe tilt or the lens facing a wall) must be eliminated; and ④ priority should be given to datasets that feature large groups, excluding scenarios where there are too few or no students present in the classroom.

Currently, a comprehensive public dataset for classroom behavior detection has not yet been fully established. The original data used in this study come from a classroom teaching surveillance video, with a total duration of 550 min, that is publicly available on the Internet. Most of the students in the footage are primary school students. From this, 3650 datasets were selected. To address the issue of model overfitting due to the high correlation between image frames, we preprocessed the images by removing correlations using techniques such as principal component analysis and then constructed the final dataset.

The purpose of this paper is to analyze student behavior in the classroom. Five common behaviors have been selected to assess students’ engagement: Raising_hand, Standing_up, Writing, Slippage, and Listening. The criteria for evaluating these behaviors are presented in Table 2.

After data cleaning and labeling, the number of tags in each category was as follows: Raising_hand (643), Standing_up (165), Writing (278), Slippage (458), and Listening (3845). In order to better display the prediction results, each category tag was encoded accordingly, ensuring that the results were presented intuitively. The encoding scheme is shown in Table 3 below.

Two well-known machine learning algorithms, Faster R-CNN and YOLOv5, were used as benchmark methods for performance comparison.

5.2. Metrics

In this paper, to comprehensively evaluate the performance of the proposed classroom behavior detection model, the following metrics are selected:

(1): Precision: Precision measures the proportion of correctly identified positive samples among all the positive predictions made by the model. This metric is chosen because it effectively assesses the model’s ability to minimize false positives. The calculation is represented in Equation (9):

P r e c i s i o n = \frac{T P}{T P + F P}

(9)

Among them,

T P

stands for True Positives and

F P

stands for False Positives.

(2): Recall (R): The recall rate measures the proportion of true positive samples that the model correctly identifies, reflecting its sensitivity. This metric is crucial for evaluating the model’s ability to detect all relevant targets, ensuring comprehensive performance assessment. As shown in Equation (10):

R e c a l l = \frac{T P}{T P + F N}

(10)

F N

stands for False Negatives.

(3): F1-Score: The F1-Score is the harmonic mean of precision and recall, balancing accuracy and sensitivity. It is particularly valuable for evaluating model performance on imbalanced data, as shown in Equation (11).

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

(4): mAP: mAP measures the average detection accuracy of the model on multiple categories. This metric is chosen because it can comprehensively evaluate the overall performance of the model in multi-category tasks. As shown in Equation (12):

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(12)

Among them,

A P_{i}

stands for the average precision of class

i

, and

N

stands for the total number of classes.

5.3. Performance Comparison

In the comparative experiments of this paper, in order to ensure the consistency of other factors, we use the control variable method. The specific training parameters are set as follows: the initial learning rate is 0.01, training is 200 rounds, the weight decay coefficient is 0.0005, the border loss coefficient is 0.05, the classification loss coefficient is 0.5, and an adaptive anchor box adjustment strategy is adopted.

The selection of network models for this study includes an experimental analysis of two prevalent detection models: YOLOv5 and Faster R-CNN.

Through experimental comparisons, it was observed that all three models converged after 199 iterations. The maximum mAP value achieved by each model during the iteration process is shown in bold in Table 4. The mAP of the SBD model reached its highest value of 0.821, which was 5.2 percentage points higher than Faster R-CNN and 4.6 percentage points higher than YOLOv5. At the end of the iteration, the mAP of the SBD model was still significantly higher than that of the other two models. The comparison results of Faster R-CNN, YOLOv5, and SBD are shown in Table 4 below.

After 199 iterations, all three models reached convergence and achieved their highest mAP values. Subsequently, further analysis was conducted on other evaluation metrics, including the F1 score, recall, and mAP. The results indicate that the SBD model shows advantages in the F1 score, recall, and mAP. The detailed results are presented in Table 5.

The accuracy of YOLOv5 and SBD is shown in Figure 8 and Figure 9 below. The accuracy results demonstrate that the SBD model exhibits superior convergence across these metrics compared to YOLOv5, and the prediction of slippage improved to a certain extent. Specifically, the accuracy increased from 31.6% with YOLOv5 to 38.4% with SBD, an improvement of 6.8 percentage points.

The precision–recall (PR) curve of the SBD model is shown in Figure 10. The PR curve indicates that the model converges well, further confirming the improvement effect of the SBD model.

Some of the detection results of the SBD model are shown in Figure 11. Table 3 above categorizes the various student engagement states in the classroom, with each coded number representing a specific behavior. The rectangular regions indicate the detection zones identified by the model. It can be seen that the improved SBD model can accurately detect various classroom behaviors of students. The model achieved high accuracy in classroom behavior detection across different behavior types.

6. Conclusions

To enhance the application of information technology in the field of education, this paper explores the problem of detecting student behavior in the classroom. We proposed a student behavior detection framework that integrates advanced deep learning techniques to improve recognition accuracy in teacher–student interaction scenarios.

SBD employed a combination of YOLOv5 and the CA attention mechanism to improve the target feature extraction process and enhance the accuracy of student behavior recognition in complex dynamic scenes. By integrating the CA attention mechanism into YOLOv5, we optimized the semantic information density of feature maps, enabling the model to accurately recognize small targets even in intricate backgrounds.

Additionally, the feature map generated by YOLOv5-A was used in place of VGG-19 in OpenPose as input, significantly improving the accuracy of human posture recognition. This method focuses on long-distance human posture recognition and effectively reduces the impact of background noise on detection.

Through experiments and evaluations in various classroom scenarios, we validated the advantages of this method in recognition accuracy, demonstrating superior performance compared to other models. The framework proposed in this study automates the detection of students’ behaviors by analyzing classroom videos. Based on these detection results, teachers can quantify and assess students’ classroom attention levels by examining the frequency and duration of specific behaviors. Notably, the analysis extends beyond individual student behavior to include the overall attention distribution of the entire class. These comprehensive data enable teachers to identify potential issues within the classroom, such as areas where certain students may be disengaged, and adjust teaching strategies accordingly.

Currently, due to the absence of publicly available datasets for classroom behavior, the data must be sourced from public online courses, which often suffer from image quality issues. Despite the improved accuracy demonstrated by the enhanced SBD network, future research should consider utilizing higher-definition cameras to improve data quality and model training accuracy. Additionally, exploring lighter model architectures, which may compromise some accuracy but enhance parameter efficiency and inference speed, presents a promising avenue for future investigation.

Author Contributions

Writing—original draft, Q.J.; Writing—review & editing, J.H. All authors have read and agreed to the published version of the manuscript.

Funding

The work was funded by the Key Laboratory of Data Science and Intelligence Education (Hainan Normal University), the Ministry of Education, China (No: DSIE202301).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Huang, J.; Saleh, S.; Liu, Y. A review on artificial intelligence in education. Acad. J. Interdiscip. Stud. 2021, 10, 206–217. [Google Scholar] [CrossRef]
Bosch, N.; D’mello, S.K. Automatic detection of mind wandering from video in the lab and in the classroom. IEEE Trans. Affect. Comput. 2019, 12, 974–988. [Google Scholar] [CrossRef]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Chen, H.; Zhou, G.; Jiang, H. Student behavior detection in the classroom based on improved YOLOv8. Sensors 2023, 23, 8385. [Google Scholar] [CrossRef] [PubMed]
Zhao, J.; Zhu, H.; Niu, L. BiTNet: A lightweight object detection network for real-time classroom behavior recognition with transformer and bi-directional pyramid network. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101670. [Google Scholar] [CrossRef]
Liu, C.; Tao, Y.; Liang, J.; Li, K.; Chen, Y. Object detection based on YOLO network. In Proceedings of the 2018 IEEE 4th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 14–16 December 2018; pp. 799–803. [Google Scholar]
Samkari, E.; Arif, M.; Alghamdi, M.; Al Ghamdi, M.A. Human pose estimation using deep learning: A systematic literature review. Mach. Learn. Knowl. Extr. 2023, 5, 1612–1659. [Google Scholar] [CrossRef]
Park, S.; Lee, S.; Park, J. Data augmentation method for improving the accuracy of human pose estimation with cropped images. Pattern Recognit. Lett. 2020, 136, 244–250. [Google Scholar] [CrossRef]
Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef] [PubMed]
Afsar, P.; Cortez, P.; Santos, H. Automatic visual detection of human behavior: A review from 2000 to 2014. Expert Syst. Appl. 2015, 42, 6935–6956. [Google Scholar] [CrossRef]
Batool, M.; Jalal, A.; Kim, K. Sensors technologies for human activity analysis based on SVM optimized by PSO algorithm. In Proceedings of the 2019 International Conference on Applied and Engineering Mathematics (ICAEM), Taxila, Pakistan, 27–29 August 2019; pp. 145–150. [Google Scholar]
Fan, Z.; Yin, J.; Song, Y.; Liu, Z. Real-time and accurate abnormal behavior detection in videos. Mach. Vis. Appl. 2020, 31, 72. [Google Scholar] [CrossRef]
Zhu, Y.; Lan, Z.; Newsam, S.; Hauptmann, A. Hidden two-stream convolutional networks for action recognition. In Proceedings of the Computer Vision–ACCV 2018, 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Revised Selected Papers, Part III 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 363–378. [Google Scholar]
Yang, H.; Yuan, C.; Li, B.; Du, Y.; Xing, J.; Hu, W.; Maybank, S.J. Asymmetric 3d convolutional neural networks for action recognition. Pattern Recognit. 2019, 85, 1–12. [Google Scholar] [CrossRef]
Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1112–1121. [Google Scholar]
Ji, H.; Zeng, X.; Li, H.; Ding, W.; Nie, X.; Zhang, Y.; Xiao, Z. Human abnormal behavior detection method based on T-TINY-YOLO. In Proceedings of the 5th International Conference on Multimedia and Image Processing, Nanjing, China, 10–12 January 2020; pp. 1–5. [Google Scholar]
Kamel, A.; Sheng, B.; Yang, P.; Li, P.; Shen, R.; Feng, D.D. Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans. Syst. Man Cybern. Syst. 2018, 49, 1806–1819. [Google Scholar] [CrossRef]
Pabba, C.; Kumar, P. An intelligent system for monitoring students’ engagement in large classroom teaching through facial expression recognition. Expert Syst. 2022, 39, e12839. [Google Scholar] [CrossRef]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2969–2978. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Ingham, F.; Poznanski, J.; Fang, J.; Yu, L.; et al. Ultralytics/ YOLOv5, V3.1-bug fixes and preformance improvements. Zenodo 2020. [Google Scholar] [CrossRef]
Guo, B.; Li, X.; Yang, M.; Zhang, H.; Xu, X.S. A robust and lightweight deep attention multiple instance learning algorithm for predicting genetic alterations. Comput. Med. Imaging Graph. 2023, 105, 102189. [Google Scholar] [CrossRef] [PubMed]
Viswakumar, A.; Rajagopalan, V.; Ray, T.; Parimi, C. Human gait analysis using OpenPose. In Proceedings of the 2019 Fifth International Conference on Image Information Processing (ICIIP), Shimla, India, 15–17 November 2019; pp. 310–314. [Google Scholar]

Figure 1. Detection model diagram.

Figure 2. Framework of Student Behavior Detection.

Figure 3. YoLov5 network structure.

Figure 4. CA attention mechanism.

Figure 5. YOLOv5-A network structure.

Figure 6. OpenPose network structure.

Figure 7. Detection flow chart.

Figure 8. Precision curve of YOLOv5 model.

Figure 9. Precision curve of SBD model.

Figure 10. PR curve of SBD model.

Figure 11. Actual detection results of SBD.

Table 1. Comparison of model methods.

Reference	Method	Detection Scenario	Innovation
Newsam et al. [13]	Two-Stream Network	Human behavior in simple backgrounds	Fusion of optical flow and RGB information
Yuan et al. [14]	3D CNN	Action recognition in complex scenes	Extraction of spatial and temporal features
Zhang et al. [15]	Skeleton-Based Action Recognition	Multi-person interaction scenarios	Introduction of high-level semantic information
Hongxia Ji et al. [16]	T-TINY-YOLO	Abnormal behavior detection	CNN pruning and real-time optimization
This Work	YOLOv5-CA-OpenPose	Student behavior in complex classroom environments	Integration of YOLOv5, CA mechanism, and OpenPose, improved small-object detection and pose recognition

Table 2. Criteria for judging students’ classroom behaviors.

Behavior	Criteria
Raising_hand	Student is raising one hand or both hands
Standing_up	Student is in a standing position
Writing	Student is holding a pen and paper
Slippage	Student is in a head-turning posture
Listening	Student’s gaze is directed at the blackboard or teacher

Table 3. Category code.

Category	Raising_hand	Standing_up	Writing	Slippage	Listening
Encoding	0	1	2	3	4

Table 4. Comparison results of Faster R-CNN, YOLOv5, and SBD.

		40	80	120	160	199
	mAP
Models
Faster R-CNN		0.521	0.607	0.769	0.756	0.741
YOLOv5		0.543	0.606	0.727	0.775	0.766
SBD		0.552	0.621	0.751	0.821	0.811

Table 5. Other comparison results of Faster R-CNN, YOLOv5, and SBD.

	F1	R	mAP
Models	F1	R	mAP
Faster R-CNN	0.718	0.646	0.769
YOLOv5	0.726	0.711	0.775
SBD	0.732	0.793	0.821

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jia, Q.; He, J. Student Behavior Recognition in Classroom Based on Deep Learning. Appl. Sci. 2024, 14, 7981. https://doi.org/10.3390/app14177981

AMA Style

Jia Q, He J. Student Behavior Recognition in Classroom Based on Deep Learning. Applied Sciences. 2024; 14(17):7981. https://doi.org/10.3390/app14177981

Chicago/Turabian Style

Jia, Qingzheng, and Jialiang He. 2024. "Student Behavior Recognition in Classroom Based on Deep Learning" Applied Sciences 14, no. 17: 7981. https://doi.org/10.3390/app14177981

APA Style

Jia, Q., & He, J. (2024). Student Behavior Recognition in Classroom Based on Deep Learning. Applied Sciences, 14(17), 7981. https://doi.org/10.3390/app14177981

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Student Behavior Recognition in Classroom Based on Deep Learning

Abstract

1. Introduction

2. Related Work

2.1. Traditional Methods

2.2. Methods Based on Deep Learning

3. Framework of Student Behavior Detection

3.1. Problem Scenario

3.2. Framework of Student Behavior Detection

4. Mechanism of Student Behavior Detection

4.1. Integration of YOLOv5 and CA for Feature Fusion and Extraction

4.2. Student Behavior Recognition

5. Experiment and Analysis

5.1. Dataset

5.2. Metrics

5.3. Performance Comparison

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI