1. Introduction
One of the most important tasks in video understanding is to understand human actions. The task to recognize human actions in a video is called video action recognition [
1]. Transformer architecture is driving a new paradigm shift in computer vision, and researchers are rapidly adapting transformer architectures to improve the accuracy and efficiency of action recognition task [
2]. Video frame sequence has an explicit sequential relationship-like sentence, which means that video understanding has a high similarity to Natural Language Processing (NLP) tasks [
3]. Therefore, rich context information is essential for video understanding tasks. This is one of the reasons that the Transformer structure [
4], which is widely used in NLP, has received a lot of attention in the video field in recent years. A unit in convolutional networks only depends on a region of the input, and this region in the input is the perceptual field for that unit. Since anywhere in an input image outside the perceptual field of a unit does not affect the value of that unit, it is necessary to carefully control the perceptual field to ensure that it covers the entire relevant image region [
5]. So, another important reason for the widespread use of the Transformer structure is that the size of the perceptual field is important for many Computer Vision (CV) tasks, especially for video understanding tasks. The main previous approach is a three-dimensional convolutional neural network [
6,
7,
8,
9] (3D CNN) using
convolutional kernels with multilevel architectures and down-sampling to gradually increase the perceptual field. However, when the Transformer [
10,
11] has been widely used in the CV field in the last two years, it becomes a better approach to capture long-distance visual dependencies through the attention layer. While this approach has a powerful global context modeling capability, its computational complexity grows quadratically with token length, limiting its ability to scale to high-resolution scenarios. Therefore, designing a more efficient model structure that can capture global information is still an open issue. There is much effort that has been undertaken to modify the Transformer’s structure to pursue this goal.
IGPT [
10] uses the standard Transformer model to solve vision tasks with the pixels of the input image as the sequence to be processed, and the information interacts at the pixel level. Subsequently, ViT [
11] divides the image into non-overlapping patches instead of pixels and imitates the name in the NLP field [
12], calling each block a “token”. That effectively reduces computational effort and improves performance and lays the foundation for later research. To further reduce the computational cost, several approaches propose to compute local attention within a window [
13,
14,
15,
16], as well as to perform spatial down-sampling operations on tokens [
17,
18] during attention computation. Such methods maintain good accuracy while significantly reducing the computational effort, but they still require operations such as shifting [
13], overlapping patches [
17,
18], etc., to compensate for the lost spatiotemporal information. In addition to this, some methods note that the attentional overhead is mainly in the intermediate matrix operations [
19,
20,
21]. In particular, the Softmax function restricts the order of matrix computation, resulting in a complexity that is a squared relationship of the sequence length. They try some operations to change the computational order of the matrix, but the results are not satisfactory. These methods obtain a trade-off between accuracy and computational complexity, but it is clear that researchers still pursue better methods that reduce complexity while maintaining the most complete spatiotemporal information possible.
This paper aims to propose a more accurate and efficient action recognition method, which can reduce the computational complexity as much as possible while ensuring recognition accuracy. Our work is inspired by the above works. Firstly, we divide the feature maps into multiple windows along the spatial dimensions and calculate the attention inside the windows, respectively. Unlike Swin [
22], which divides the windows along the spatiotemporal dimensions, we only divide the windows along the spatial dimensions, which further reduces the computational complexity. This method also brings the problem of a limited perceptual field of the model; that is, the model can only represent and reinforce the information inside each window and lacks the ability to mine long-distance dependencies. Therefore, we consider that features extracted from different dimensions may help to solve this problem. In traditional attention calculations [
16], each feature in the spatiotemporal dimensions is called a spatiotemporal token, which has complete channel information. Thus, if we transpose the dimensions of the feature maps, that is, each feature in the channel dimension is called a channel token, which contains complete spatiotemporal information. As shown in
Figure 1, we divide the feature maps along the spatial and channel dimensions to obtain the Spatial-Windows token and Channel token, respectively. After the Spatial-Windows attention calculation, we use the Channel token for attention calculation, which we call Linear attention, so as to supplement the model’s ability to represent global information. Different from some of the previous Linear attention methods, we still follow the self-attention without introducing other additional operations to the calculation process. To intuitively represent the logical positional relationship between Spatial-Windows attention and Linear attention, other modules between the two attentions are omitted.
While such a method is theoretically feasible, there are still some challenges to overcome. In the calculation of attention, the Spatial-Windows/Channel token takes the channel/spatiotemporal dimensions as the hidden dimensions, and the complete channel/spatiotemporal information of a Spatial-Windows/Channel token is compressed into a single value after matrix multiplication during the calculation. This means that channel/spatiotemporal information is naturally lost in complex operations. Therefore, as shown in
Figure 1, we use two kinds of attention alternately to compensate for the lost information in the spatiotemporal and channel dimensions.
To summarize, the main contributions of this article are as follows:
- (1)
We propose a complementary framework of Windows and Linear Transformer (WLiT), which ensures the ability of the model to capture global information while achieving efficient action recognition.
- (2)
We present the Spatial-Windows attention module that only divides the feature maps along the spatial dimensions, which further reduces the computational complexity.
- (3)
We fully analyze and discuss the computational complexity of the attention mechanism, and theoretically prove our method.
- (4)
We conduct a lot of experiments to verify our method. On the SSV2 dataset, our method achieves higher accuracy than the SOTA method while having less computational complexity.
This paper consists of five parts: The first part is the introduction, which introduces the action recognition task and the mainstream methods in this field and summarizes the methodological basis and main contributions of our study. The second part introduces some work related to our research and describes the problems and optimization possibilities of the previous methods in detail. The third part describes our research method particularly, not only introducing the overall structure of the method but also fully interpreting each part of the model. The fourth part first introduces the dataset selected in this paper and the relevant experimental details and then shows sufficient experimental results to prove the reliability of our research. The last part summarizes the content of the full text and puts forward possible directions for further research.
3. Method
Our method has been further innovated and optimized on the basis of the previous ones. We achieve the low computational complexity through Spatial-Windows attention and ensure the ability of the model to obtain global information through Linear attention and additional Feed-Forward-Network (FFN) modules. In addition, we use a concise adaptive position encoding module, which simply and efficiently ensures that the position of the tokens in spatiotemporal and channel dimensions is fixed. In this section, we first give an overview of the overall structure of the model and analyze the traditional spatiotemporal self-attention. Next, we detail the key components of the model one by one. Finally, we qualitatively analyze the computational complexity to prove the reliability of the proposed method and explain some key parameters in the model.
3.1. Overview of WLiT Architecture
We follow the structural setting of MViT [
46] and other transformer-based methods [
22] to facilitate a fairer comparison of results. As shown in
Table 1, our WLiT is composed of 4 stages, each having several transformer blocks of consistent channel dimension. At the beginning of the network, we sample and crop the video to obtain input features of size
(8 is the number of frames, and 224 is the spatial resolution). WLiT initially projects the input to a channel dimension of
with overlapping spatiotemporal cubes of shape
. The resulting sequence of length
is reduced by a factor of 4 for each additional stage to a final sequence length
at the last stage. In tandem, the channel dimension is up-sampled by a factor of 2 at each stage, increasing to 512 at stage 4. To visually show the difference among stages, we also show the attention operator used in each stage and the number of superimpositions.
Figure 2 illustrates the architecture of our WLiT; it can be seen that each block contains an attention module, two adaptive position encoding modules, and an FFN module. The attention module can be divided into Spatial-Windows attention, Linear attention, and spatiotemporal self-attention at different stages. Spatial-Windows attention and Linear attention are the core of this study and are therefore presented in more detail in
Figure 1 (The Norm layer, the adaptive position encoding layer, and FFN module in
Figure 2 are omitted). A patch embedding layer is inserted before the start of each stage, and the adaptive position encoding module is used. The spatiotemporal resolution and feature dimensions are kept constant in each stage. What’s more, the FFN module is an important part of the transformer structure, which can introduce nonlinear feature activation and supplement the ability of the model to capture all channel information. Therefore, after each calculation of attention, the feature must be activated by an FFN module.
It should be emphasized that the attention modules in the first two stages of the model are different from those in the last two stages. We decide to use Spatial-Windows attention and Linear attention in the first two stages of the network. In the last two stages, we still use the traditional spatiotemporal self-attention. This is a good trade-off between computational complexity and precision that we achieve after theoretical analysis and experimental verification. We will conduct a sufficient analysis in the part of this section behind.
We first introduce the calculation process of spatiotemporal self-attention [
3]. Assume that there is a video feature,
, where
denotes the length of the feature sequence;
denotes the channel dimension;
means time, height, and width, respectively. The feature
is projected by an adaptive matrix to generate
. Then we perform the calculation as follows.
After the matrix multiplication in , the feature’s dimensions change from to . Besides, means the number of channels of . The obtained similarity matrix is normalized by the Softmax operation and multiplied by so that the feature’s dimensions change from to . Then, the features are output from the FFN module. In this process, the computational complexity is , which is mainly positively related to the sequence length as well as the number of channels.
In the first two stages of the model, , so we strive to minish to reduce computational complexity. We achieve this through Spatial-Windows attention. Then, we supplement the global information with Linear attention. The calculation processes of Spatial-Windows attention and Linear attention are different from traditional self-attention, and we will introduce them in turn.
3.2. Spatial-Windows Attention
In the first two stages, the influence of sequence length on computational complexity dominates [
47], so we introduce Spatial-Windows attention to limit the attention calculation to local windows. The whole feature maps are divided evenly in a non-overlapping manner. Suppose there are
windows, then each window contains
patches, where
,
, and
is the window size. Additionally note that we divide the feature maps into windows only along the spatial dimensions, which means that our windows have only two dimensions,
and
. That is unlike the division method of Swin [
22], and therefore each window contains fewer patches and is less computationally intensive.
where
stands for query, key, and value inside each window. The computational complexity of the process is
. It can be observed that there is a significant reduction in computational effort. The model loses the ability to mine global contextual information which can be captured by channel-based Linear attention.
3.3. Channel-Based Linear Attention
Spatial-Windows attention is computed in a single window that does not overlap. It means that there is no interaction of information among different windows, which can severely limit the ability of the model to extract and represent features. We note that the Spatial-Windows token used in the calculation of Spatial-Windows attention contains all the channel information. So, if the feature maps are divided along the channel dimension, the obtained Channel tokens can cover all spatiotemporal information. Using Channel tokens for attention computation, we can effectively implement the interaction of different windows information. This improves the ability of the model to represent global information while maintaining a lightweight network. Therefore, after computing Spatial-Windows attention, we restore all windows to the original feature maps and then compute Linear attention on the entire feature maps.
where
represent the query, key, and value of Linear attention, respectively. It can be noticed that, unlike Spatial-Windows attention, the channels occupy the main component of the computational complexity in this process. We still follow the computational process of spatiotemporal self-attention without too much additional modification. This also proves that the effectiveness is derived from this structural design.
3.4. Adaptive Position Encoding
Compared with image tasks, video has an additional time dimension, which requires a more sufficient positional encoding layer to exploit the positional relationship of tokens. Instead of the usual absolute and relative position encodings [
48,
49], we choose to use a position encoding that adapts to different input lengths and does not require additional clipping or interpolation. The local inductive bias contained in the convolution method can naturally memorize the image position due to its unique characteristics such as translation invariance [
50]. Some previous works used convolutions as a position encoding method [
51]. We also follow such an idea but explore it a little deeper. At the beginning of each stage, we insert an adaptive position encoding to ensure that the positions among tokens are stable. Our method includes an operation of dividing the window. To ensure that this process does not affect the positional relationship of tokens, we additionally add an adaptive position encoding module that does not share weights before the features are input into the FFN module. In order to ensure the simplicity of the model, we choose depth-wise convolution to realize adaptive position encoding. We also verify in the experiments that the adaptive position encoding improves the accuracy of the model without increasing the number of parameters and computation.
3.5. Computational Complexity Analysis and Model Structure Design
We first analyze traditional spatiotemporal self-attention. The module of projecting from the feature maps to generate is a linear layer, and the computational complexity is only related to the number of channels, that is, Then in the calculation process of attention, the complexity of multiplication is . The complexity of multiplying the weight matrix by is . The complexity of the final linear layer and FFN module is . So, the total computational complexity is .
However, when we calculate the attention after dividing the feature maps into windows, the computational complexity becomes , where means the number of windows, and means the square of the window size. To show the reduction of computational complexity more intuitively, we roughly conduct a quantitative analysis. Taking the first stage as an example, , , so the first part of the traditional spatiotemporal self-attention complexity is . The complexity of Spatial-Windows attention is only . Thus, the computational complexity is reduced by a factor of 64. Then there is Linear attention computed along the channel dimension. The computational complexity of other steps is the same as that of spatiotemporal attention, but only the complexity of the matrix multiplication process of attention becomes . So, the total computational complexity of Linear attention is . At the first stage, . The sum of the computational complexity of the Spatial-Windows attention and Linear attention is about , and the computational complexity is reduced by 27.8 times.
In the first two stages where , we reduce the computational complexity from to . In the last two stages of the network, the spatiotemporal resolution has been reduced to a relatively low level due to the layer-by-layer down-sampling operation, which means . If we continue with the previous setup, the complexity is higher than normal attention. Thus, we only use Spatial-Windows attention and Linear attention in the first two stages of the network.
We sample 8 frames from a video as input, and we first down-sample the spatiotemporal dimensions of the frames using non-overlapping
convolutions and the input’s dimensions change from
to
. After each stage, the input’s spatiotemporal dimensions are further reduced to
by a
convolutional layer. Besides, the input’s channel dimension is raised from
to
. We set the channel expansion rate of FFN and the number of layers in the network stages completely following the settings in MViT [
46]. Thus, the performance fluctuation caused by hyperparameter changes can be reduced as much as possible, and the effectiveness of our method is proved. The specific network structure is shown in
Table 1.