Students’ Classroom Behavior Detection System Incorporating Deformable DETR with Swin Transformer and Light-Weight Feature Pyramid Network
Abstract
:1. Introduction
- We propose a novel neural network that combines the powerful performance of the Swin Transformer as a backbone network with the benefits of an encoder–decoder structure in object classification. The Swin Transformer serves as the backbone network in the Deformable DETR framework, providing enhanced capabilities for students’ classroom behavior detection and analysis. The source code of this study is publicly available at https://github.com/CCNUZFW/Student-behavior-detection-system (accessed on 1 May 2023).
- We propose the feature pyramid network (FPN) structure, which effectively fuses feature maps obtained from the Swin Transformer at four different scales: large scale, medium scale, small scale, and extremely small scale. This integration enables the extraction of robust top-down semantic features, leading to improved accuracy in detecting and analyzing students’ classroom behavior.
- We introduce the CARAFE lightweight operator to enhance the receptive field of the FPN network during feature vector rearrangement. By utilizing input features to guide the reorganization process, the CARAFE operator further improves the precision and effectiveness of the student’s classroom behavior detection system.
- The development of a dedicated dataset naming ClaBehavior for detecting students’ classroom behavior is a significant contribution of our study. Having access to reliable and annotated datasets is essential for the development and evaluation of machine learning models. Our ClaBehavior dataset, which comprises a diverse collection of classroom images with behavior annotations, addresses a gap in the existing literature and serves as a valuable resource for future research in the field of students’ classroom behavior detection. The dataset from our study is publicly available at https://github.com/CCNUZFW/Student-behavior-detection-system/tree/master/dataset/coco (accessed on 1 May 2023).
2. Related Work
2.1. Students’ Classroom Behavior Detection
2.2. Convolutional Neural Networks
2.3. Transformers
3. Materials and Methods
3.1. Research Problem
3.2. Proposed Method
3.2.1. Deformable DETR for Classroom Behavior Detection
3.2.2. Swin-Transformer in Classroom Scenarios
Algorithm 1 The proposed model of students’ classroom behavior detection |
Input: Pictures/videos of teachers in actual educational situations Output: The behavior of each student in the current classroom
|
3.2.3. FPN-CARAFE in a Classroom Environment
3.2.4. Training Phase and Adaptive Training
4. Experimental Results and Analysis
4.1. Dataset
4.2. Evaluation Metrics
- True Positive (TP): The model correctly identifies the location and type of the object in the classroom-behavior-detection task.
- False Positive (FP): The model correctly identifies the location of the object in the classroom-behavior-detection task, but misidentifies its type.
- False Negative (FN): The model fails to identify the correct position and type of the object in the classroom-behavior-detection task.
4.3. Baseline Models
- Faster R-CNN [29]: This method utilizes the ResNet network for feature extraction and incorporates the Region Proposal Network (RPN) to generate bounding boxes. It employs a k-means algorithm for post-processing and filtering.
- SSD [18]: This method also utilizes the ResNet network for feature extraction and improves the accuracy of students’ classroom behavior detection by integrating the Feature Pyramid Network (FPN). The FPN enhances the detection accuracy for small target objects, improving overall performance.
- YOLOv3 [23]: This method adopts the Darknet-53 network as the backbone for feature extraction. It incorporates the FPN for multi-scale task detection. The classification part utilizes an SVM classifier for behavior classification.
- YOLOv5 [20]: This method utilizes the CSPDarknet network to extract features from input images. It introduces the Spatial Intersection over Union (SIoU) loss function in the Convolutional Block Layer (CBL) module and employs the GELU function as the activation function.
- YOLOv7 [14]: This method employs CBSDarknet as the backbone network and utilizes the Path Aggregation Network (PAN) and Feature Pyramid Network (FPN) for feature fusion across different levels of hierarchy in the images. The detection results are obtained using the Efficient Local Attention Network (ELAN) and Category-aware Transformation (CAT) modules as the detection heads.
4.4. Experimental Settings
4.5. Comparison Experiments with the State-of-the-Art Methods
4.6. Ablation Experiments
4.7. Case Study
4.7.1. Representation Capability of the Proposed Method
4.7.2. Sensitivity Analysis of the Proposed Method
4.7.3. Multiple Case Studies
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
FPN | Feature Pyramid Network |
CNNs | Convolutional Neural Networks |
SSD | Single Shot MultiBox Detector |
YOLO | You Only Look Once |
ViT | Vision Transformer |
DETR | Detection Transformer |
W-MSA | Windowed Multihead Self-Attention |
SW-MSA | Sliding-Window Multihead Self-Attention |
MLP | Multilayer Perceptron |
LN | Layer Normalization |
CARAFE | Content-Aware ReAssembly of Features |
mAP | mean Average Precision |
RPN | Region Proposal Network |
CBL | Convolutional Block Layer |
PAN | Path Aggregation Network |
ELAN | Efficient Local Attention Network |
CAT | Category-aware Transformation |
References
- Li, L.; Wang, Z.; Zhang, T. GBH-YOLOv5: Ghost Convolution with BottleneckCSP and Tiny Target Prediction Head Incorporating YOLOv5 for PV Panel Defect Detection. Electronics 2023, 12, 561. [Google Scholar] [CrossRef]
- Wang, Z.; Yao, J.; Zeng, C.; Wu, W.; Xu, H.; Yang, Y. YOLOv5 Enhanced Learning Behavior Recognition and Analysis in Smart Classroom with Multiple Students. In Proceedings of the 2022 International Conference on Intelligent Education and Intelligent Research (IEIR), Wuhan, China, 18–20 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 23–29. [Google Scholar] [CrossRef]
- Bhanji, F.; Gottesman, R.; de Grave, W.; Steinert, Y.; Winer, L.R. The retrospective pre–post: A practical method to evaluate learning from an educational program. Acad. Emerg. Med. 2012, 19, 189–194. [Google Scholar] [CrossRef] [PubMed]
- Bunce, D.M.; Flens, E.A.; Neiles, K.Y. How long can students pay attention in class? A study of student attention decline using clickers. J. Chem. Educ. 2010, 87, 1438–1443. [Google Scholar] [CrossRef]
- Chang, J.J.; Lin, W.S.; Chen, H.R. How attention level and cognitive style affect learning in a MOOC environment? Based on the perspective of brainwave analysis. Comput. Hum. Behav. 2019, 100, 209–217. [Google Scholar] [CrossRef]
- Kuh, G.D. What we’re learning about student engagement from NSSE: Benchmarks for effective educational practices. Chang. Mag. High. Learn. 2003, 35, 24–32. [Google Scholar] [CrossRef]
- Ashwin, T.; Guddeti, R.M.R. Unobtrusive behavioral analysis of students in classroom environment using non-verbal cues. IEEE Access 2019, 7, 150693–150709. [Google Scholar] [CrossRef]
- Jain, D.K.; Zhang, Z.; Huang, K. Multi angle optimal pattern-based deep learning for automatic facial expression recognition. Pattern Recognit. Lett. 2020, 139, 157–165. [Google Scholar] [CrossRef]
- Muhammad, K.; Hussain, T.; Baik, S.W. Efficient CNN based summarization of surveillance videos for resource-constrained devices. Pattern Recognit. Lett. 2020, 130, 370–375. [Google Scholar] [CrossRef]
- Sindagi, V.A.; Patel, V.M. A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognit. Lett. 2018, 107, 3–16. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, USA, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
- Wenchao, L.; Meng, H.; Yuping, Z.; Shuai, L. Research on intelligent recognition algorithm of college students’ classroom behavior based on improved SSD. In Proceedings of the 2022 IEEE 2nd International Conference on Computer Communication and Artificial Intelligence (CCAI), Beijing, China, 6–8 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 160–164. [Google Scholar]
- Wang, Z.; Yan, W.; Zeng, C.; Tian, Y.; Dong, S. A Unified Interpretable Intelligent Learning Diagnosis Framework for Learning Performance Prediction in Intelligent Tutoring Systems. Int. J. Intell. Syst. 2023, 2023, 4468025. [Google Scholar] [CrossRef]
- Ren, X.; Yang, D. Student behavior detection based on YOLOv4-Bi. In Proceedings of the 2021 IEEE International Conference on Computer Science, Artificial Intelligence and Electronic Engineering (CSAIEE), Virtual, 20–22 August 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 288–291. [Google Scholar]
- Tang, L.; Xie, T.; Yang, Y.; Wang, H. Classroom Behavior Detection Based on Improved YOLOv5 Algorithm Combining Multi-Scale Feature Fusion and Attention Mechanism. Appl. Sci. 2022, 12, 6790. [Google Scholar] [CrossRef]
- Hu, M.; Wei, Y.; Li, M.; Yao, H.; Deng, W.; Tong, M.; Liu, Q. Bimodal learning engagement recognition from videos in the classroom. Sensors 2022, 22, 5932. [Google Scholar] [CrossRef] [PubMed]
- Zheng, Z.; Liang, G.; Luo, H.; Yin, H. Attention assessment based on multi-view classroom behaviour recognition. IET Comput. Vis. 2022. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhu, T.; Ning, H.; Liu, Z. Classroom student posture recognition based on an improved high-resolution network. EURASIP J. Wirel. Commun. Netw. 2021, 2021, 140. [Google Scholar] [CrossRef]
- Shi, L.; Di, X. A recognition method of learning behaviour in English online classroom based on feature data mining. Int. J. Reason.-Based Intell. Syst. 2023, 15, 8–14. [Google Scholar] [CrossRef]
- Pabba, C.; Kumar, P. An intelligent system for monitoring students’ engagement in large classroom teaching through facial expression recognition. Expert Syst. 2022, 39, e12839. [Google Scholar] [CrossRef]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Li, L.; Wang, Z. Calibrated Q-Matrix-Enhanced Deep Knowledge Tracing with Relational Attention Mechanism. Appl. Sci. 2023, 13, 2541. [Google Scholar] [CrossRef]
- Lyu, L.; Wang, Z.; Yun, H.; Yang, Z.; Li, Y. Deep Knowledge Tracing Based on Spatial and Temporal Representation Learning for Learning Performance Prediction. Appl. Sci. 2022, 12, 7188. [Google Scholar] [CrossRef]
- Wang, Z.; Hou, Y.; Zeng, C.; Zhang, S.; Ye, R. Multiple Learning Features–Enhanced Knowledge Tracing Based on Learner–Resource Response Channels. Sustainability 2023, 15, 9427. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Agrawal, P.; Girshick, R.; Malik, J. Analyzing the performance of multilayer neural networks for object recognition. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VII 13. Springer: Cham, Switzerland, 2014; pp. 329–344. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3651–3660. [Google Scholar]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Jin, J.; Feng, W.; Lei, Q.; Gui, G.; Wang, W. PCB defect inspection via Deformable DETR. In Proceedings of the 2021 7th International Conference on Computer and Communications (ICCC), Chengdu, China, 10–13 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 646–651. [Google Scholar]
- Shanliang, L.; Yunlong, L.; Jingyi, Q.; Renbiao, W. Airport UAV and birds detection based on deformable DETR. J. Phys. Conf. Ser. 2022, 2253, 012024. [Google Scholar] [CrossRef]
- Gao, L.; Zhang, J.; Yang, C.; Zhou, Y. Cas-VSwin transformer: A variant swin transformer for surface-defect detection. Comput. Ind. 2022, 140, 103689. [Google Scholar] [CrossRef]
- Kim, J.H.; Kim, N.; Won, C.S. Facial expression recognition with swin transformer. arXiv 2022, arXiv:2203.13472. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv 2017, arXiv:1706.02677. [Google Scholar]
- Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vis. 2019, 127, 302–321. [Google Scholar] [CrossRef] [Green Version]
- Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and flexible image augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef] [Green Version]
- Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Zhu, Z. Recognition and Application of Head-Down and Head-Up Behavior in Classroom Based on Deep Learning. Ph.D. Thesis, Central China Normal University, Wuhan, China, 2019. [Google Scholar]
Notation | Description |
---|---|
I | Classroom image signal |
B | Student behaviors |
i | Number of identified behaviors |
Size of image | |
C | Dimension of image |
Feature map of the original image | |
Feature map of FPN | |
N | Number of targets in a single image |
Parameters of attentional manipulation | |
D | Dimension of Deformable DETR’s attention module |
Dimension of the current detection sequence | |
Class results detected by model checking | |
Bounding box results identified by model detection | |
The weight ratio of the loss function |
Behaviors | Write | Read | Lookup | Turn_Head | Raise_Hand | Stand | Discuss |
---|---|---|---|---|---|---|---|
Statistics | 1025 | 1075 | 5725 | 1025 | 725 | 94 | 242 |
Models | FLOPs/G | mAP(0.50:0.95) | Precision | Recall | Image Size |
---|---|---|---|---|---|
Faster R-CNN | 23.38 | 0.491 | 0.617 | 0.643 | 224 × 224 |
SSD | 34.59 | 0.430 | 0.524 | 0.612 | 300 × 300 |
YOLOv3 | 77.1 | 0.378 | 0.378 | 0.572 | 640 × 640 |
YOLOv5 | 77.6 | 0.455 | 0.544 | 0.607 | 640 × 640 |
YOLOv7 | 104.7 | 0.583 | 0.707 | 0.722 | 640 × 640 |
Proposed | 33.21 | 0.605 | 0.738 | 0.751 | 224 × 224 |
Models | FLOPs/G | mAP (0.50:0.95) | Precision | Recall | Image Size |
---|---|---|---|---|---|
DeformableDETR | 11.01 | 0.544 | 0.654 | 0.722 | 224 × 224 |
DeformableDETR+CSPDarknet | 85.49 | 0.488 | 0.606 | 0.663 | 640 × 640 |
DeformableDETR+Swin | 26.51 | 0.566 | 0.703 | 0.725 | 224 × 224 |
DeformableDETR+Swin+FPN | 28.97 | 0.593 | 0.717 | 0.736 | 224 × 224 |
Proposed | 33.21 | 0.605 | 0.738 | 0.751 | 224 × 224 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Z.; Yao, J.; Zeng, C.; Li, L.; Tan, C. Students’ Classroom Behavior Detection System Incorporating Deformable DETR with Swin Transformer and Light-Weight Feature Pyramid Network. Systems 2023, 11, 372. https://doi.org/10.3390/systems11070372
Wang Z, Yao J, Zeng C, Li L, Tan C. Students’ Classroom Behavior Detection System Incorporating Deformable DETR with Swin Transformer and Light-Weight Feature Pyramid Network. Systems. 2023; 11(7):372. https://doi.org/10.3390/systems11070372
Chicago/Turabian StyleWang, Zhifeng, Jialong Yao, Chunyan Zeng, Longlong Li, and Cheng Tan. 2023. "Students’ Classroom Behavior Detection System Incorporating Deformable DETR with Swin Transformer and Light-Weight Feature Pyramid Network" Systems 11, no. 7: 372. https://doi.org/10.3390/systems11070372
APA StyleWang, Z., Yao, J., Zeng, C., Li, L., & Tan, C. (2023). Students’ Classroom Behavior Detection System Incorporating Deformable DETR with Swin Transformer and Light-Weight Feature Pyramid Network. Systems, 11(7), 372. https://doi.org/10.3390/systems11070372