Author Contributions
Conceptualization, Y.H., W.D. and T.X.; methodology, Y.H., W.D. and T.X.; software, Y.H., W.D. and T.X.; validation, W.D. and T.X.; visualization, Y.H.; writing—original draft, Y.H. All authors have read and agreed to the published version of the manuscript.
Figure 1.
The network architecture of YOLOv8x-Improved.
Figure 1.
The network architecture of YOLOv8x-Improved.
Figure 2.
Schematic diagram of the MCC module.
Figure 2.
Schematic diagram of the MCC module.
Figure 3.
Schematic diagram of WIoU.
Figure 3.
Schematic diagram of WIoU.
Figure 4.
Impact of WIoUv3 hyperparameter selection on gradient gain curve [
27].
Figure 4.
Impact of WIoUv3 hyperparameter selection on gradient gain curve [
27].
Figure 5.
Adjustment module diagram of EfficientNetv2.
Figure 5.
Adjustment module diagram of EfficientNetv2.
Figure 6.
Feature comparison of the MINIST dataset trained with different loss functions. (a) Center Loss; (b) Cross Entropy Loss.
Figure 6.
Feature comparison of the MINIST dataset trained with different loss functions. (a) Center Loss; (b) Cross Entropy Loss.
Figure 7.
Structure diagram of the AAM module.
Figure 7.
Structure diagram of the AAM module.
Figure 8.
Network architecture of FaceFeatXtractor.
Figure 8.
Network architecture of FaceFeatXtractor.
Figure 9.
Network architecture of MultiEmoNet.
Figure 9.
Network architecture of MultiEmoNet.
Figure 10.
Annotation examples from the CrowdHuman dataset [
33].
Figure 10.
Annotation examples from the CrowdHuman dataset [
33].
Figure 11.
Example of video data collection. (a) Front-facing camera video; (b) Rear-facing camera video.
Figure 11.
Example of video data collection. (a) Front-facing camera video; (b) Rear-facing camera video.
Figure 12.
Example of human key points annotation.
Figure 12.
Example of human key points annotation.
Figure 13.
Example of generated skeleton diagram.
Figure 13.
Example of generated skeleton diagram.
Figure 14.
Actual detection results using YOLOv8.
Figure 14.
Actual detection results using YOLOv8.
Figure 15.
Comparison of training results between YOLOv8x-Improved and YOLOv8x.
Figure 15.
Comparison of training results between YOLOv8x-Improved and YOLOv8x.
Figure 16.
Training and validation loss curves.
Figure 16.
Training and validation loss curves.
Figure 17.
Annotation example diagram of self-made student classroom object detection dataset.
Figure 17.
Annotation example diagram of self-made student classroom object detection dataset.
Figure 18.
Comparison of object detection. (a) Classroom A; (b) Classroom B.
Figure 18.
Comparison of object detection. (a) Classroom A; (b) Classroom B.
Figure 19.
Confusion matrix of MultiEmoNet. (a) Face + Sk; (b) Face + Sk+ Bg.
Figure 19.
Confusion matrix of MultiEmoNet. (a) Face + Sk; (b) Face + Sk+ Bg.
Figure 20.
(a) Detailed emotion statistics; (b) Emotion classification statistics.
Figure 20.
(a) Detailed emotion statistics; (b) Emotion classification statistics.
Figure 21.
Emotion prediction results.
Figure 21.
Emotion prediction results.
Figure 22.
Time series of classroom emotion statistics (five minutes are captured).
Figure 22.
Time series of classroom emotion statistics (five minutes are captured).
Table 1.
Classification of 26 distinct emotions [
12].
Table 1.
Classification of 26 distinct emotions [
12].
No | Emotion Categories | No | Emotion Categories |
---|
1 | Affection | 14 | Excitement |
2 | Anger | 15 | Fatigue |
3 | Annoyance | 16 | Fear |
4 | Anticipation | 17 | Happiness |
5 | Aversion | 18 | Pain |
6 | Confidence | 19 | Peace |
7 | Disapproval | 20 | Pleasure |
8 | Disconnection | 21 | Sadness |
9 | Disquietment | 22 | Sensitivity |
10 | Doubt/Confusion | 23 | Suffering |
11 | Embarrassment | 24 | Surprise |
12 | Engagement | 25 | Sympathy |
13 | Esteem | 26 | Yearning |
Table 2.
Parameters of YOLOv8 model with replaced C2f module.
Table 2.
Parameters of YOLOv8 model with replaced C2f module.
No | From | N | Params | Module | Arguments |
---|
0 | −1 | 1 | 2320 | Conv | [3, 80, 32] |
1 | −1 | 1 | 115,520 | Conv | [80, 160, 3, 2] |
2 | −1 | 3 | 436,800 | C2f | [160, 160, 3, True] |
3 | −1 | 1 | 461,440 | Conv | [160, 320, 3, 2] |
4 | −1 | 6 | 3,281,920 | C2f | [320, 320, 6, True] |
5 | −1 | 1 | 1,844,480 | Conv | [320, 640, 3, 2] |
6 | −1 | 6 | 9,509,760 (Reduced by 27%) | C2f_MCC | [640, 640, 6, True] |
7 | −1 | 1 | 3,687,680 | Conv | [640, 640, 3, 2] |
8 | −1 | 3 | 5,165,760 (Reduced by 26%) | C2f_MCC | [640, 640, 3, True] |
9 | −1 | 1 | 1,025,920 | SPPF | [640, 640, 5] |
10 | −1 | 1 | 0 | Upsample | [None, 2, ‘nearest’] |
11 | [−1, 6] | 1 | 0 | Concat | [1] |
12 | −1 | 3 | 5,575,360 (Reduced by 24%) | C2f_MCC | [1280, 640, 3] |
13 | −1 | 1 | 0 | Upsample | [None, 2, ‘nearest’] |
14 | [−1, 4] | 1 | 0 | Concat | [1] |
15 | −1 | 3 | 1,948,800 | C2f | [960, 320, 3] |
16 | −1 | 1 | 922,240 | Conv | [320, 320, 3, 2] |
17 | [−1, 12] | 1 | 0 | Concat | [1] |
18 | −1 | 3 | 5,370,560 (Reduced by 25%) | C2f_MCC | [960, 640, 3] |
19 | −1 | 1 | 3,687,680 | Conv | [640, 640, 3, 2] |
20 | [−1, 9] | 1 | 0 | Concat | [1] |
21 | −1 | 3 | 5,575,360 (Reduced by 24%) | C2f_MCC | [1280, 640, 3] |
22 | [15, 18, 21] | 1 | 8,795,008 | Detect | [80, [320, 640, 640]] |
Table 3.
Number of annotations in the image emotion dataset.
Table 3.
Number of annotations in the image emotion dataset.
Emotion Category | Count |
---|
Doubt | 3857 |
Engagement | 5987 |
Fatigue | 4083 |
Disconnection | 7696 |
Pleasure | 3779 |
Peace | 8078 |
Total | 33,481 |
Table 4.
Experimental configuration.
Table 4.
Experimental configuration.
Experimental Environment | Configuration |
---|
Operating system | Windows 10 |
CPU model | Intel i7-12700KF |
Memory | 32 GB |
GPU model | NVIDIA RTX3080 10 G |
Computing platform | CUDA 11.6 |
Programming language | Python 3.8 |
Development IDE | PyCharm 2022.2.2 |
Deep learning framework | Pytorch 11.6 |
Table 5.
Training hyperparameters.
Table 5.
Training hyperparameters.
Parameter Name | Value |
---|
Image size | 640 × 640 |
Load pretrained weights | True |
Training epochs | 100 |
Optimizer | SGD |
Initial learning rate | 0.01 |
Final learning rate | 0.001 |
SGD momentum | 0.937 |
Momentum decay | 0.0005 |
Warmup epochs | 3.0 |
Table 6.
Comparison of detection results on the CrowdHuman dataset.
Table 6.
Comparison of detection results on the CrowdHuman dataset.
Model | Precision | Recall | mAP50 | mAP50-95 | Model Size | GFLOPS | FPS |
---|
SSD | 0.661 | 0.614 | 0.638 | 0.482 | 105.1 M | 15.0 | 43 |
Faster-RCNN ResNet + FPN | 0.684 | 0.621 | 0.662 | 0.524 | 57.22 M | 134.5 | 18 |
YOLOv7x | 0.842 | 0.721 | 0.845 | 0.581 | 135 M | 188 | 39 |
YOLOv8x | 0.847 | 0.769 | 0.851 | 0.603 | 130 M | 258.1 | 40 |
YOLOv8x- Improved | 0.853 | 0.772 | 0.862 | 0.611 | 109 M | 232.2 | 48 |
Table 7.
Performance comparison of different GPU configurations.
Table 7.
Performance comparison of different GPU configurations.
Hardware Configuration | CPU | GPU | Inference Time per Image (ms) | FPS |
---|
1 | Intel i7-12700KF | NVIDIA RTX 2080 | 24.39 | 41 |
2 | Intel i7-12700KF | NVIDIA RTX 2080 Super | 22.22 | 45 |
3 | Intel i7-12700KF | NVIDIA RTX 3080 | 20.83 | 48 |
Table 8.
Comparison of custom classroom object detection dataset results.
Table 8.
Comparison of custom classroom object detection dataset results.
Model | Precision | Recall | mAP50 | mAP50-95 |
---|
YOLOv8x-COCO | 0.744 | 0.698 | 0.77 | 0.479 |
YOLOv8x | 0.781 | 0.831 | 0.851 | 0.566 |
YOLOv8x-Improved | 0.802 | 0.852 | 0.863 | 0.572 |
Table 9.
Comparison of experiments using single-person face information.
Table 9.
Comparison of experiments using single-person face information.
Model | Precision |
---|
Resnet50 | 0.689 |
Vgg16 | 0.678 |
MobileNetv3 | 0.669 |
EfficientNetv2-S | 0.702 |
FaceFeatXtractor | 0.721 |
Table 10.
Comparison of experiments using single bone information.
Table 10.
Comparison of experiments using single bone information.
Model | Precision |
---|
SVM | 0.516 |
KNN | 0.503 |
Random Forest | 0.552 |
EfficientNetv2-S | 0.601 |
Table 11.
Comparison of experiments using single background information.
Table 11.
Comparison of experiments using single background information.
Model | Precision |
---|
Resnet50 | 0.794 |
Vgg16 | 0.774 |
MobileNetv3 | 0.771 |
EfficientNetv2-S | 0.813 |
Table 12.
Comparison of experiments using multi-channel information.
Table 12.
Comparison of experiments using multi-channel information.
Model | Precision |
---|
Face + Sk | 0.841 |
Face + Sk + Bg | 0.914 |