WLiT: Windows and Linear Transformer for Video Action Recognition

Sun, Ruoxi; Zhang, Tianzhao; Wan, Yong; Zhang, Fuping; Wei, Jianming

doi:10.3390/s23031616

Open AccessArticle

WLiT: Windows and Linear Transformer for Video Action Recognition

by

Ruoxi Sun

^1,2

,

Tianzhao Zhang

^1,3,

Yong Wan

⁴,

Fuping Zhang

^1,*

and

Jianming Wei

¹

Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai 201210, China

²

School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

⁴

State Key Laboratory of Geomechanics and Geotechnical Engineering, Institute of Rock and Soil Mechanics, Chinese Academy of Sciences, Wuhan 430071, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(3), 1616; https://doi.org/10.3390/s23031616

Submission received: 13 December 2022 / Revised: 28 January 2023 / Accepted: 29 January 2023 / Published: 2 February 2023

(This article belongs to the Section Physical Sensors)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The emergence of Transformer has led to the rapid development of video understanding, but it also brings the problem of high computational complexity. Previously, there were methods to divide the feature maps into windows along the spatiotemporal dimensions and then calculate the attention. There are also methods to perform down-sampling during attention computation to reduce the spatiotemporal resolution of features. Although the complexity is effectively reduced, there is still room for further optimization. Thus, we present the Windows and Linear Transformer (WLiT) for efficient video action recognition, by combining Spatial-Windows attention with Linear attention. We first divide the feature maps into multiple windows along the spatial dimensions and calculate the attention separately inside the windows. Therefore, our model further reduces the computational complexity compared with previous methods. However, the perceptual field of Spatial-Windows attention is small, and global spatiotemporal information cannot be obtained. To address this problem, we then calculate Linear attention along the channel dimension so that the model can capture complete spatiotemporal information. Our method achieves better recognition accuracy with less computational complexity through this mechanism. We conduct extensive experiments on four public datasets, namely Something-Something V2 (SSV2), Kinetics400 (K400), UCF101, and HMDB51. On the SSV2 dataset, our method reduces the computational complexity by 28% and improves the recognition accuracy by 1.6% compared to the State-Of-The-Art (SOTA) method. On the K400 and two other datasets, our method achieves SOTA-level accuracy while reducing the complexity by about 49%.

Keywords: action recognition; Spatial-Windows attention; linear attention; self-attention; transformer

Share and Cite

MDPI and ACS Style

Sun, R.; Zhang, T.; Wan, Y.; Zhang, F.; Wei, J. WLiT: Windows and Linear Transformer for Video Action Recognition. Sensors 2023, 23, 1616. https://doi.org/10.3390/s23031616

AMA Style

Sun R, Zhang T, Wan Y, Zhang F, Wei J. WLiT: Windows and Linear Transformer for Video Action Recognition. Sensors. 2023; 23(3):1616. https://doi.org/10.3390/s23031616

Chicago/Turabian Style

Sun, Ruoxi, Tianzhao Zhang, Yong Wan, Fuping Zhang, and Jianming Wei. 2023. "WLiT: Windows and Linear Transformer for Video Action Recognition" Sensors 23, no. 3: 1616. https://doi.org/10.3390/s23031616

APA Style

Sun, R., Zhang, T., Wan, Y., Zhang, F., & Wei, J. (2023). WLiT: Windows and Linear Transformer for Video Action Recognition. Sensors, 23(3), 1616. https://doi.org/10.3390/s23031616

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WLiT: Windows and Linear Transformer for Video Action Recognition

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI