Next Article in Journal
Subcritical Extraction of Coal Tar Slag and Analysis of Extracts and Raffinates
Previous Article in Journal
ResGDANet: An Efficient Residual Group Attention Neural Network for Medical Image Classification
Previous Article in Special Issue
Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer

1
College of Information Science & Technology, Qingdao University of Science & Technology, Qingdao 266101, China
2
College of Information Engineering, Xinjiang Institute of Engineering, Urumqi 830023, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(5), 2695; https://doi.org/10.3390/app15052695
Submission received: 3 January 2025 / Revised: 24 February 2025 / Accepted: 25 February 2025 / Published: 3 March 2025

Abstract

To address the limitations of traditional two-stream networks, such as inadequate spatiotemporal information fusion, limited feature diversity, and insufficient accuracy, we propose an improved two-stream network for human action recognition based on multi-scale attention Transformer and 3D convolutional (C3D) fusion. In the temporal stream, the traditional 2D convolutional is replaced with a C3D network to effectively capture temporal dynamics and spatial features. In the spatial stream, a multi-scale convolutional Transformer encoder is introduced to extract features. Leveraging the multi-scale attention mechanism, the model captures and enhances features at various scales, which are then adaptively fused using a weighted strategy to improve feature representation. Furthermore, through extensive experiments on feature fusion methods, the optimal fusion strategy for the two-stream network is identified. Experimental results on benchmark datasets such as UCF101 and HMDB51 demonstrate that the proposed model achieves superior performance in action recognition tasks.
Keywords: multi-attention; multi-scale; two-stream network; action recognition; transformer; C3D multi-attention; multi-scale; two-stream network; action recognition; transformer; C3D

Share and Cite

MDPI and ACS Style

Liu, M.; Li, W.; He, B.; Wang, C.; Qu, L. Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer. Appl. Sci. 2025, 15, 2695. https://doi.org/10.3390/app15052695

AMA Style

Liu M, Li W, He B, Wang C, Qu L. Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer. Applied Sciences. 2025; 15(5):2695. https://doi.org/10.3390/app15052695

Chicago/Turabian Style

Liu, Minghua, Wenjing Li, Bo He, Chuanxu Wang, and Lianen Qu. 2025. "Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer" Applied Sciences 15, no. 5: 2695. https://doi.org/10.3390/app15052695

APA Style

Liu, M., Li, W., He, B., Wang, C., & Qu, L. (2025). Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer. Applied Sciences, 15(5), 2695. https://doi.org/10.3390/app15052695

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop