Improved Video Action Recognition Based on Pyramid Pooling and Dual-Stream C3D Networks
Abstract
:1. Introduction
2. Methodology
2.1. Static and Dynamic Feature Extraction for Character Movement
2.1.1. First Frame Static Feature Extraction Based on CNN
2.1.2. Trajectory-Based Motion Frame Generation
Algorithm 1. DBSCAN-Based Trajectory-Clustering Algorithm |
Input: D: region of significant motion in the frame |
ε: radius parameter of the cluster |
MinPoints: field density thresholds |
Output: set of density-based clusters |
1: c < 0 |
2: for each P ∈ D do |
3: if P is visited then |
4: continue |
5: end if |
6: NeighborPts = getAllPoints(P,ε) |
7: if size (NeighborPts) < MinPoints then |
8: mark p as noise |
9: else |
10: c = next cluster |
11: addToCluster(P,NeighborPts,c,e,MinPoints) |
12: end if |
13: mark P as visited |
14: end for |
15: function addToCluster(P,NeighborPts,c,ε,MinPoints) |
16: add P to cluster c |
17: for each point np ∈ NeighborPts do |
18: if P is not visited then |
19: mark np as visited |
20: NeighborPts = getAllPoints(np,ε) |
21: if size (NeighborPts) ≥ MinPoints then |
22: NeighborPts < NeighborPts joined with NeighborPts |
22: NeighborPts < NeighborPts joined with NeighborPts |
23: end if |
24: end if |
25: if np is not yet member of any cluster then |
26: add np to cluster c |
27: end if |
28: end for |
29: end function |
Algorithm 2. Chebyshev’s algorithm [7]: |
Input: Dis: largest Chebyshev value in the cluster |
C: clusters of clusters in the frame |
Output: the set of clusters after noise reduction |
1: totalPoints ← points within Dis of center of C |
2: currentPoints ← totalPoints |
3: while true do |
4: if COUNT (currentPoints) < COUNT(totalPoints)*0.8 then return current Points |
5: end if |
6: Dis < Dis-1 |
7: currentPoints ← points within Dis of center C |
8: end while |
2.1.3. Dynamic Feature Extraction Based on Motion Tubes
2.2. Feature Fusion Based on Dynamic and Static Features
2.2.1. Feature Fusion Based on Cholesky Variation
2.2.2. Gaussian Distribution-Based Feature Fusion
2.2.3. PCA-Based Feature Fusion
2.3. GRU-Based Video Character Classification
2.3.1. GRU-Based Video Character Classification Model
2.3.2. Video Classification Process for GRU Networks
2.4. Classification of Movements
2.4.1. Three-Dimensional Convolutional Neural Network Variable-Scale-Based Feature Extraction
2.4.2. Video Character Behaviour Classification Network Based on Dual-Stream Improved C3D
3. Experimental Section
3.1. Experiment 1: Experimental Study of Video Behaviour Recognition with Different Feature-Contribution Ratios
3.2. Experiment 2: Behavioural Feature Extraction for Variable-Scale Video
3.3. Experiment 3: Video Character Behaviour Classification Experiment Based on Two-Stream Improved C3D Network
3.4. Experimental Results and Analyses
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Liu, W.T.; Liu, M.F.; Liu, H.H. A review of human behaviour recognition based on deep learning. Mod. Inf. Technol. 2024, 8, 50–55. [Google Scholar]
- Huang, S.; Mao, J. A surface EMG gesture recognition method based on improved deep forest. J. Shanghai Univ. Eng. Technol. 2023, 37, 190–197. [Google Scholar]
- Zhang, Y.C. Research on Human Movement Recognition Algorithm Based on Ultra-Wideband Radar. Master’s Thesis, Liaoning University of Engineering and Technology, Liaoning, China, 2022. [Google Scholar]
- Li, R. Research on Key Technology of Intelligent Identification of Passenger Flow in High-Speed Railway Stations Based on Deep Learning. Ph.D. Thesis, China Academy of Railway Science, Beijing, China, 2022. [Google Scholar]
- Su, X.Y. Research and Application of Human Movement Behaviour Recognition Method Based on AlphaPose and LSTM. Master’s Thesis, Southwest Jiaotong University, Chengdu, China, 2022. [Google Scholar]
- Liu, Y.; Zhang, L.; Xin, S.; Zhang, Y. Deep learning web video action classification incorporating spatio-temporal attention mechanism. Chin. Sci. Technol. Pap. 2022, 17, 281–287. [Google Scholar]
- Wang, C.; Wei, Z.L.; Chen, S.H. An Action Recognition Method for Borderless Applications Based on Self-Attention Mechanism. Comput. Res. Dev. 2022, 59, 1092–1104. [Google Scholar]
- Zhang, A. Human Action Recognition in Table Tennis Based on Improved GoogLeNet Network. Master’s Thesis, Shenyang University of Technology, Shenyang, China, 2021. [Google Scholar]
- Wang, F. Through-Wall Radar Human Action Recognition Based on Feature Enhancement and Shallow Neural Network. Master’s Thesis, Taiyuan University of Technology, Taiyuan, China, 2021. [Google Scholar]
- Bai, F. Action Recognition Based on Channel State Information; China University of Mining and Technology: Xuzhou, China, 2021. [Google Scholar]
- Gong, F.M.; Ma, Y.H. Research on human action recognition based on spatio-temporal two-branch network. Comput. Technol. Dev. 2020, 30, 23–28. [Google Scholar]
- Yang, J.T. Human Continuous Action Recognition Based on LSTM. Master’s Thesis, Xi’an University of Technology, Xi’an, China, 2020. [Google Scholar]
- Chu, J.H.; Zhang, S.; Tang, W.H.; Lu, W. Driving behaviour recognition method based on tutor-student network. Adv. Lasers Optoelectron. 2020, 57, 211–218. [Google Scholar]
- Chen, X.H. Research on Limb Movement Recognition Based on 3D Skeleton. Master’s Thesis, University of Electronic Science and Technology, Chengdu, China, 2019. [Google Scholar]
- Ding, H.J.; Gong, F.M. Human activity state recognition and localisation based on time series analysis. Comput. Technol. Dev. 2019, 29, 82–86+90. [Google Scholar]
- Han, A. Multimodal action recognition based on deep learning framework. Comput. Mod. 2017, 07, 48–52. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 1–9. [Google Scholar]
- Tran, D.; Ray, J.; Le, Q.V. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Ng, A.Y. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
- Du, Y.; Wang, L.; Wang, X. Hierarchical recurrent neural network for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 111–119. [Google Scholar]
- Liu, Z.; Shah, M.; Gool, L.V. Spatiotemporal convolutional networks for video action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1260–1269. [Google Scholar]
- Choutas, V.; Kompatsiaris, I.; Ferrari, V. Pseudo-3D residual networks for action recognition in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5238–5246. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? In A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6296–6305. [Google Scholar]
- Ng, X.; Socher, R. Action recognition with attention-based LSTMs. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 19–24. [Google Scholar]
- Wu, Z.; Xiong, Y.; Yu, S. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 284–293. [Google Scholar]
- Zheng, T.H.; Long, W.; Shen, B.; Zhang, Y.J.; Lu, Y.J.; Ma, K.J. Seismic Stability Assessment of Single-Layer Reticulated Dome Structures by the Development of Deep Seismic Stability Assessment of Single-Layer Reticu. Int. J. Struct. Stab. Dyn. 2024. [Google Scholar] [CrossRef]
- Wang, M.D.; Zhang, X.L.; Chen, S.Q.; Li, X.M.; Zhang, Y. Modeling the skeleton-language uncertainty for 3D action recognition. Neurocomputing 2024, 608, 128426. [Google Scholar] [CrossRef]
- Wu, G.S.; Wen, C.H.; Jiang, H.C. Wushu Movement Recognition System Based on DTW Attitude Matching Algorithm. Entertain. Comput. 2025, 52, 100877. [Google Scholar] [CrossRef]
- Su, Y.X.; Zhao, Q. Efficient spatio-temporal network for action recognition. J. Real-Time Image Process. 2024, 21, 158. [Google Scholar] [CrossRef]
Dynamic and Static Ratios | Relationship Between P1 and P2 | Eigenvector (Math.) |
---|---|---|
20%S, 80%M | p1 = 4p2 | |
40%S, 60%M | 2p1 = 3p2 | |
60%S, 40%M | 3p1 = 2p2 | |
80%S, 20%M | 4p1 = P2 |
UCF101 | Hollywood2 | |
---|---|---|
S:M = 8:2 | 96.5% | 76.9% |
S:M = 6:4 | 95.2% | 75.3% |
S:M = 4:6 | 94.9% | 74.7% |
S:M = 2:8 | 91.4% | 71.4% |
Training Modalities | Original 3D Convolutional Neural Network | 3D Convolutional Neural Networks for 3D Pyramid Pooling |
---|---|---|
Fixed Scale (training from scratch) | 77.8% | 80.2% |
Fixed scale (using pre-trained weights) | 82.1% | 82.4% |
Variable scale (using pre-trained weights) | unsupported | 83.8% |
Network Infrastructure | Accuracy | Parameters (Millions) |
---|---|---|
Raw 3D convolutional neural network | 82.1% | 77.9 |
Pyramid pooling on two levels | 82.7% | 54.9 |
Pyramid pooling on three levels | 83.6% | 88.4 |
UCF101 | HMDB51 | |
---|---|---|
Early integration | 89.23% | 64.32% |
late fusion | 87.34% | 58.43% |
Methodologies | Top-1 Accuracy (%) | Top-5 Accuracy (%) |
---|---|---|
traditional RNN | 75.2 | 92.1 |
3D-CNN | 84.3 | 96.3 |
GRU + 3D-CNN | 89.6 | 98.1 |
Methodologies | Top-1 Accuracy (%) | Top-5 Accuracy (%) |
---|---|---|
traditional RNN | 61.7 | 85.5 |
3D-CNN | 70.2 | 90.8 |
GRU + 3D-CNN | 75.8 | 92.5 |
Modelling | Times (Hours) |
---|---|
Traditional RNN training time | 10 |
3D-CNN training time | 12 |
GRU + 3D-CNN training time | 14 |
Method | UCF101 Top-1 Accuracy (%) | HMDB51 Top-1 Accuracy (%) | Hollywood2 mAP (%) | Something-Someing V1 Top-1 Accuracy (%) |
---|---|---|---|---|
C3D | 75.8 | 46.8 | 59.7 | 32.1 |
Dual-Stream Late Fusion | 81.2 | 50.1 | 63.4 | 35.4 |
Dual-Stream Late Fusion | 83.7 | 52.5 | 65.9 | 37.2 |
Pyramid pool + C3D | 85.9 | 55.3 | 68.2 | 40.8 |
Dual-Stream C3D Early | 87.1 | 56.4 | 70.5 | 42.3 |
Dual-Stream C3D Late | 88.3 | 57.8 | 71.8 | 43.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tan, Y.; Fu, X.; Li, H. Improved Video Action Recognition Based on Pyramid Pooling and Dual-Stream C3D Networks. Appl. Sci. 2025, 15, 4454. https://doi.org/10.3390/app15084454
Tan Y, Fu X, Li H. Improved Video Action Recognition Based on Pyramid Pooling and Dual-Stream C3D Networks. Applied Sciences. 2025; 15(8):4454. https://doi.org/10.3390/app15084454
Chicago/Turabian StyleTan, Yuzhe, Xueliang Fu, and Honghui Li. 2025. "Improved Video Action Recognition Based on Pyramid Pooling and Dual-Stream C3D Networks" Applied Sciences 15, no. 8: 4454. https://doi.org/10.3390/app15084454
APA StyleTan, Y., Fu, X., & Li, H. (2025). Improved Video Action Recognition Based on Pyramid Pooling and Dual-Stream C3D Networks. Applied Sciences, 15(8), 4454. https://doi.org/10.3390/app15084454