MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

Huo, Hua; Li, Bingjie

doi:10.3390/electronics13050948

Open AccessArticle

MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

by

Hua Huo

^*

and

Bingjie Li

Information Engineering College, Henan University of Science and Technology, Luoyang 471000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(5), 948; https://doi.org/10.3390/electronics13050948

Submission received: 30 January 2024 / Revised: 25 February 2024 / Accepted: 26 February 2024 / Published: 29 February 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Nowadays, the field of video-based action recognition is rapidly developing. Although Vision Transformers (ViT) have made great progress in static image processing, they are not yet fully optimized for dynamic video applications. Convolutional Neural Networks (CNN) and related models perform exceptionally well in video action recognition. However, there are still some issues that cannot be ignored, such as high computational costs and large memory consumption. In the face of these issues, current research focuses on finding effective methods to improve model performance and overcome current limits. Therefore, we present a unique Vision Transformer model based on multi-granularity and multi-scale fusion to accomplish efficient action recognition, which is designed for action recognition in videos to effectively reduce computational costs and memory usage. Firstly, we devise a multi-scale, multi-granularity module that integrates with Transformer blocks. Secondly, a hierarchical structure is utilized to manage information at various scales, and we introduce multi-granularity on top of multi-scale, which allows for a selective choice of the number of tokens to enter the next computational step, thereby reducing redundant tokens. Thirdly, a coarse-fine granularity fusion layer is introduced to reduce the sequence length of tokens with lower information content. The above two mechanisms are combined to optimize the allocation of resources in the model, further emphasizing critical information and reducing redundancy, thereby minimizing computational costs. To assess our proposed approach, comprehensive experiments are conducted by using benchmark datasets in the action recognition domain. The experimental results demonstrate that our method has achieved state-of-the-art performance in terms of accuracy and efficiency.

Keywords:

action recognition; multi-granularity multi-scale fusion; vision transformer; efficiency

1. Introduction

With the advent of the information age in human society, the rapid progress of the Internet has propelled significant advancements in video technology. Videos have become a fundamental medium for information transmission due to their expressiveness and effectiveness, which have been widely applied in various fields of daily life and business. Video streaming is growing at an exponential rate, and video understanding is one of the most important areas of research in computer vision and is developing quickly as well. Human action recognition has garnered considerable attention due to its high practicality in various applications such as video surveillance and human–computer interaction. Action recognition aims to identify and understand human actions and activities in video sequences. In the complex interplay of computer vision and machine learning, researchers are striving to explore innovative approaches to enhance the accuracy, efficiency, and robustness of action recognition systems.

Action recognition mostly relied on traditional computer vision techniques before the development of deep learning. These traditional methods require manually extracting and expressing important feature regions in videos, and they are represented as dense vectors or sparse sets of interest points. However, this manual extraction approach has obvious drawbacks, one of which is its heavy reliance on human intervention, leading to inefficiency. Moreover, the extracted features have strong subjectivity and lack sufficient robustness. Due to the emergence of deep learning, neural networks can be used to automatically learn feature representations, overcoming the limitations of traditional methods, which not only enhances the performance of action recognition systems but also reduces the need for manual intervention. However, Convolutional Neural Networks (CNNs) still have some limitations and weaknesses as follows. Firstly, they are sensitive to translation invariance and hyperparameters. Secondly, CNNs often require large-scale annotated data for training, particularly when deep learning is involved, demanding significant computational resources. Lastly, when confronted with irregular data, CNNs may yield less satisfactory results. The Transformer [1] is first applied in the field of natural language processing, achieving unprecedented success in tasks such as machine translation. The model mainly consists of self-attention mechanisms, positional encoding, encoder–decoder architecture, feedforward neural networks, layer normalization, and residual connections, which exhibits a robust ability to capture long-range dependencies in sequential data, particularly excelling in tasks involving structured data with sequential or spatial characteristics. Therefore, it was introduced into computer vision, leading to the development of the Vision Transformer (ViT) [2], which has also achieved remarkable results. Unlike traditional CNNs that use convolutional kernels to manipulate local image regions, ViT adopts a patch-wise processing approach, where each patch is linearly embedded to form a sequence. Its performance in diverse tasks has motivated researchers to explore further the Transformer architecture, resulting in the development of different variations and enhancements.

This paper presents a new vision Transformer model named MgMViT, which uses a fusion approach with multi-scale and multi-granularity. The purpose of our method is to reduce the computational cost and number of parameters in the model for identifying human behavior more quickly and effectively. MgMViT begins with multi-scale computations that incorporate multi-granularity calculations within each scale. Following the output, a mixed-granularity layer is introduced to combine coarse-grained and fine-grained information, effectively reducing the complexity of the computation sequence. By incorporating the multi-scale multi-granularity module into the model structure of Transformer blocks, the computational burden of the original module can be effectively reduced. In addition, the introduced mixed-granularity layer possesses a plug-and-play characteristic without introducing additional parameters. The structure of MgMViT is shown in Figure 1. It is one of the most direct and effective methods to shorten the sequence length involved in computation and reduce the number of tokens to improve computational efficiency. The proposed MgMViT model has been evaluated on widely used action recognition datasets such as Kinetics-400 [3] and Something-SomethingV2 [4]. The results indicate that MgMViT is capable of significantly reducing the model runtime and parameter count while maintaining effective performance. Compared with existing models, MgMViT is more efficient in terms of both processing time and parameter quantity.

2. Related Work

With the advent of large annotated datasets, computational resources, and deep learning, action recognition has made significant progress in recent years. Before 2010, early action recognition methods typically rely on manually crafted features and traditional machine learning. These approaches are constrained by the complexity of human behavior and the challenge of designing effective features. Since the rise of deep learning in 2012, especially with the introduction of convolutional neural networks (CNNs), the landscape of action recognition has undergone a radical transformation. AlexNet [5] introduces a novel deep architecture and the dropout method. The two-stream network model [6], which integrates both spatial and temporal information by combining RGB frames and motion flow(optical flow), represents a significant shift from handcrafted feature extraction to machine-learned features. Limin Wang et al. [7,8] introduce Temporal Segment Networks(TSN), which creatively segmented videos into multiple segments and sampled frames within each segment to model the structure of long videos. This has led to significant advancements in action recognition with the help of deep learning. Simultaneously, 3D Convolutional Neural Networks (3D CNNs) [9,10] extend the network from traditional 2D convolutional neural networks (2D CNNs) to three dimensions, allowing action recognition to directly capture spatial and temporal features from video data without requiring additional input streams. This innovation has led to the widespread application of 3D convolutional networks in action recognition. In addition, several large datasets concerning action recognition have emerged, such as Kinetics [3], Something-Something [4], etc., which provide a more extensive and diverse range of data for training and evaluating action recognition research. Additionally, the introduction of benchmark challenges, such as the Activity Networks Challenge, has facilitated the standardization of evaluation metrics and the comparison of performance among different models. In the same period, Non-Local Neural Network [11] and R(1 + 2)D [12] are introduced. The SlowFast network [13] introduces a dual-stream architecture with different frame rates to capture both slow and fast motion information in videos, thereby enhancing the recognition of actions with varying temporal dynamics. Since 2020, there has been a significant advancement in self-supervised learning and models based on pre-text tasks for action recognition. The main focus has been on developing training methods that do not require extensive labeled data. The prevailing approach has been to pre-train models on large datasets and then fine-tune them on smaller, task-specific datasets.

The architecture based on Transformer [1] models is a relatively recent development. Transformer is originally designed for natural language processing and has gained widespread attention due to its capability to capture long-range dependencies and model complex relationships. Subsequently, researchers began to explore the application of the Transformer in the field of computer vision. In 2020, the Vision Transformer (ViT) [2] was presented to initially applied to image classification. The researchers view image sequences as sequences of frames, considering the frame sequence as a temporal dimension beyond the spatial dimensions, to address video data for action recognition tasks. Since the introduction of ViT, it has been widely adopted in the field of action recognition as the backbone structure for action recognition models. The Swin Transformer (Swin) [14,15], which was released in 2021, introduces an adaptive computation strategy. It restricts self-attention computations to non-overlapping local windows by using a mobile window and performs cross-window connections simultaneously. TimeSFormer [16] captures information from a series of frame-level patches by learning spatiotemporal features and employs a standard Transformer architecture to adapt to the processing of video data. ViViT [17] is a purely Transformer-based model, which extracts spatiotemporal tokens from input videos and encodes them through a series of Transformer layers. Transformer-based action recognition models [18,19,20] develop rapidly. However, due to the large size of the Transformer as the backbone network, recent research has shifted focus towards reducing the complexity of attention mechanisms, aiming to make the Transformer more efficient in practical applications. For instance, MViT [21,22] combines multi-scale feature hierarchies with Transformer models, starting from a lower input resolution and smaller channel dimensions. The channel capacity is progressively increased in stages while reducing spatial resolution to lower costs. Meanwhile, some studies such as Video Transformer Network (VTN) [23] and MoViNets [24] have explored ways to enhance the efficiency of Transformer models. These endeavors focus on reducing computational and memory costs while maintaining performance, aiming to better accommodate real-world applications.

Inspired by the efficient model [25,26] and other multi-granularity models [27,28] in various domains, we propose a Vision Transformer model with Multi-Scale and Multi-Granularity Fusion, which can involve embedding a multi-scale and multi-granularity module into the Transformer block to reduce secondary attention computations, enhance efficiency in terms of computational costs and memory requirements, and maintain performance levels.

3. Multi-Granularity and Multi-Scale Vision Transformer

We present a Multi-Scale Multi-Granularity Fusion Transformer model in this paper, which is built upon the concepts of multi-scale and multi-granularity mechanisms. In contrast to the substantial computational load introduced by the traditional attention mechanism, where each frame’s query attends to all key-value pairs, the introduction of multi-scale and multi-granularity helps reduce the computational complexity of the model. The combination of multiple granularities and scales achieves savings in computational costs in terms of both sequence length and sampling.

3.1. The Base Model

The Transformer model represents a novel architectural paradigm. One of its most crucial innovations is the self-attention mechanism, allowing the model to simultaneously consider contextual information across the input sequence. This marks a significant departure from earlier models that sequentially processed input sequences. In the field of computer vision, the application of attention mechanisms is primarily employed to explore the relationships between different blocks. When we aim to identify specific objects, these mechanisms enable the model to discern which blocks are relevant to the identified object and which ones are irrelevant. Consequently, the model learns to focus on key information, enhancing its ability to recognize and process pertinent features. The computation of attention primarily involves three vectors: query, key, and value. The entire calculation process maps a query and a set of key-value pairs to an output, which is the weighted sum of values. The weights are determined by the relevance between the query and the current key, which is defined as Equation (1).

A t t e n t i o n (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{C}}) V .

(1)

where

Q \in R^{N_{q} \times C}

,

K \in R^{N_{k} \times C}

,

V \in R^{N_{ν} \times C}

serves as input,

\sqrt{C}

is introduced to address concerns related to weight concentration and gradient vanishing.

With the emergence of attention mechanisms, researchers have further enhanced the performance of self-attention layers by introducing multi-head self-attention. Different attention heads correspond to distinct representation subspaces, achieved by splitting the output into multiple heads along the channel dimension. This allows the model to focus on different aspects of information, such as the following Equations

Q_{i} = X W_{i}^{Q}, K_{i} = X W_{i}^{K}, V_{i} = X W_{i}^{V}, i = 1, \dots, 8

(2)

h e a d_{i} = A t t e n t i o n (Q_{i}, K_{i}, V_{i}), i = 1, \dots, 8

(3)

M u l t i H e a d (Q, K, V) = C o n c a c t (h e a d_{1}, \dots, h e a d_{8}) W^{o}

(4)

where

h e a d_{i} \in R^{n \times \frac{C}{h}}

represents the output of the i-th attention head,

W^{o} \in R^{C \times C}

is the result of dot-multiplication of the weight matrix with the 8-header merge.

3.2. Multi-Scale and Multi-Granularity Fusion

Currently, the Transformer models used for action recognition rely heavily on direct attention computation, which can become quite computationally demanding, particularly as the number of tokens increases. To address this challenge, we develop a new multi-scale and multi-granularity fused Vision Transformer, which utilizes multi-scale processing to shorten the sequence length required for self-attention computation, and leverages multi-granularity modules to sample only the most useful tokens for future calculations. Additionally, mixed-granularity layers are incorporated to reduce the length of token sequences that contain less informative content. We propose a scale-granularity fusion module, which can effectively reduce computational costs by decreasing the length and quantity involved in quadratic attention computations. This module initially performs multi-scale computation, followed by further multi-granularity computation based on the multi-scale foundation. Multi-scale computation is achieved through pooled multi-head attention mechanisms, aiding in the reduction of the sequence length of output tokens. Subsequently, multi-granularity computation further diminishes the number of samples collected. The structure of the multi-scale and multi-granularity module is shown in Figure 2. This section will provide a detailed overview of our model.

We first introduce the multi-scale module. The mechanism of multi-scale is described as follows. Starting with the resolution of the inputs and smaller channel sizes, the channel capacity is expanded hierarchically while reducing the spatial resolution. Regarding the reduction of spatial resolution, since the Transformer operates based on patches, video signals are transformed into one-dimensional signals

(L, D)

after being processed through patches.

S_{Q} = (L_{Q}, D), S_{K} = (L_{K}, D), S_{V} = (L_{V}, D)

(5)

S (Q K^{T}) = (L_{Q}, D) \times (D, L_{K}) = (L_{Q}, L_{K}), S (Q K^{T} \times V) = (L_{Q}, L_{K}) \times (L_{V}, D) = (L_{Q}, D)

(6)

where

L = T \times H \times W

. Therefore, to reduce spatial resolution, it is necessary to modify the sequence length of the query. The attention layer has a total of N + 1 input tokens. During the attention computation, Q, K, and V are involved in pooling. Specifically, for a D-dimensional input X with a sequence length of L, the head-pooled attention mechanism projects the input X onto Q, K, and V, which are necessary for calculating attention.

\hat{Q} = X W_{Q} \hat{K} = X W_{K} \hat{V} = X W_{V}

(7)

The pooling operations are applied by using the intermediate tensor and the pooling operator

P (\cdot; Θ)

, in which pooling operations involve the pooling operator performing calculations along each dimension of the input tensor. Here,

\cdot Θ : = (k, s, p)

,

k_{T} \times k_{H} \times k_{w}

represents the pooling kernel dimension,

s_{T} \times s_{H} \times s_{w}

denotes the stride dimension, and

p_{T} \times p_{H} \times p_{w}

represents the padding dimension. To reduce the length of the input tensor, the specific formula is:

\tilde{L} = [\frac{L + 2 p - k}{s}] + 1

(8)

The input tensor is shortened to

\tilde{L} = \tilde{T} \times \tilde{H} \times \tilde{W}

, and the output tensor obtained after the pooling operation shrinks overall to

S_{T} S_{H} S_{W}

.

After the pooling operation, the attention vectors become

Q = p (\tilde{Q}; Θ_{Q})

,

K = p (\tilde{K}; Θ_{K})

and

V = p (\tilde{V}; Θ_{V})

. To compute the attention matrix on these vectors that have been shortened in sequence length, after pooling the computed, the query becomes

\tilde{Q} \in R^{\tilde{L} \times d}

, the key becomes

\tilde{K} \in R^{\tilde{L} \times d}

and the value becomes

\tilde{V} \in R^{\tilde{L} \times d}

. The relative positions are also added into the pooled self-attention computation, which depends on the relative positional distance between the tokens of the attention matrix calculation formula, which now becomes:

\begin{matrix} A = Softmax ((p (Q; Θ_{Q}) p {(K; Θ_{K})}^{T} + E^{(r e l)}) / \sqrt{d}) \\ w h e r e E_{i j}^{(r e l)} = Q_{i} \cdot R_{p (i), p (j)} \end{matrix}

(9)

where

R_{p (i), p (j)}

is the positional embedding encoded in the relative position between i, j, p(i), p(j) is the spatio-temporal location of i, j. The formula is given as follows:

R_{p (i), p (j)} = R_{h (i), h (j)}^{h} + R_{w (i), w (j)}^{w} + R_{t (i), t (j)}^{t}

(10)

The resolution of the channel gradually increases, i.e., increases the dimensionality by pooling, and simultaneously decreases the spatio-temporal resolution i.e., decreases the sequence length. For increasing the number of channels, it is necessary to map the vector dimension D through a fully connected layer, and there is a transition operation between the two phases, specifically, up-sampling the channel dimension, i.e., upscaling the dimension of the final MLP layer by a factor of two in the previous phase to perform only the operation of increasing the number of channels, which is then inputted into the next phase of the computation. Due to the larger stride in K and V compared to Q, and Q undergoes downsampling only during stage changes, an additional residual connection is added to the attention block to enhance information flow and facilitate training and convergence. Specifically, the pooled query Q is added to the output sequence, and the formula for attention becomes:

O = A p (V; Θ_{V}) + p (Q; Θ_{Q})

(11)

Sequence lengths and channel dimensions have changed, so residual joins need to avoid mismatches in dimensions at both ends. To handle this mismatch, we cannot add input X directly to the output. On the contrary, we need to pool the input X in the same way as for query Q.

For multi-granularity, we divide coarse-grained and fine-grained sampling into two parts. The first part is added to the multi-scale module to operate on the attention matrix, and the second aspect is the attention computation followed by the operation on the output tokens. Multi-granularity computation consists of two aspects: token scoring strategies and combined coarse and fine granularity computation.

(1) Significance Score

The key step in the multi-granularity module is to compute the significance score of the input token, which is the basis for performing the next step of coarse- and fine-granularity differentiation computation. The computation of the significance score relies on the self-attention matrix. Specifically, since the attention matrix of each row in the attention mechanism sums to 1, the computation of the output token is obtained by weighted summation of the attention weights. The attention weights reflect the importance of the input tokens to the output tokens, and the first row of the attention matrix

A_{1, :}

represents the attention weights of the categorized tokens. Our significance score calculation is also divided into two parts. Firstly, at the level of self-attention, the computation of output tokens relies heavily on the attention matrix A and the pooling operator

p (V; Θ_{V})

, we use the paradigm specification to improve the results. The calculation of the attention significance score can be expressed by the following formula.

S_{j} = \frac{A_{1, j} \times ∥p (V_{j}; Θ_{V_{j}})∥}{\sum_{i = 2} A_{1, i} \times ∥p (V_{i}; Θ_{V_{i}})∥} i, j \in \{2 \dots \tilde{L}\}

(12)

After the computation of the attention layer, the resulting output token is then passed through a layer of Hybrid granularity and then output to further reduce the length of the computational sequence, and the token significance score is computed by the following formula.

\begin{matrix} I (x_{j}^{(t)}) & = \sum_{i = 1, i \neq j}^{n} A_{i, j} \\ I (x_{j}) & = \frac{1}{h} \sum_{t = 1}^{h} I (x_{j}^{(t)}) \end{matrix}

(13)

(2) Token Sampling with Combined Coarse and Fine Granularity

After obtaining the significance score, the process of combining coarse and fine granularity through sampling begins. Previous methods typically employ pruning operations directly, trimming tokens with low saliency scores to reduce the number of parameters involved in attention matrix computation. However, such pruning operations may lead to a certain degree of information loss and may filter out tokens that may be useful at a later date. Even though tokens that are pruned may contain less or similar information, direct discarding can result in the loss of valuable information. To circumvent the above issues, our model employs a more efficient sampling method and computational units, maintaining a high level of accuracy while conserving memory consumption. The proposed token sampling for the integration of coarse and fine granularity is also divided into two parts. Firstly, the fine-grained attention matrix reduces redundant output tokens based on the first step. Secondly, the added granularity fusion layer further reduces the length of the token sequence with less informative content.

First, we achieve adaptive sampling by refining the attention matrix through the attention significance score. This step filters out redundant tokens, and when encountering tokens with high similarity, performs coarse sampling. The probability of sampling one token from a set of similar tokens is equal to the sum of their saliency scores. The Cumulative Distribution Function (CDF) is used to sample the tokens based on the significance score, which can be considered as a probability distribution considering that the significance score has been normalized. The inverse of the cumulative distribution function is used as the sampling function.

\begin{matrix} C D F_{i} = \sum_{j = 2}^{j = i} S_{j} \\ Ψ (k) = C D F^{- 1} (k) k \in [\begin{matrix} 0, 1 \end{matrix}] \end{matrix}

(14)

The K samples can be obtained by sampling K times from a uniform distribution

U [0, 1]

. To avoid the issue of excessive randomization in the Top-k method with a large K value, we adopt a fixed sampling approach by selecting

k = \{\frac{1}{2 K}, \frac{3}{2 K}, \dots, \frac{2 K - 1}{2 K}\}

. Because

Ψ (.) \in R

we select the index of the token with the nearest significance score as the sampling index. When a token is sampled more than once, we retain only one instance of that token. Therefore, the number of unique indices selected

K^{'}

is typically less than the total number of samples K. Cases where the number of unique indices is equal to

K^{'} = 1

or

K^{'} = K

exceeds it only occur in extreme situations. The number and position of the tokens vary at different stages of the model computation and with the input image. By stage, it is common to sample more tokens in the early stages and fewer tokens in the later stages. For image types with large uniform backgrounds, only a few tokens may need to be sampled, whereas for more cluttered images, more tokens need to be involved in the computation for classification. In this way, the refined attention matrix is obtained after combining coarse and fine granularity, and then the output token is obtained by computation. With the indexes, the attention matrix is refined by sampling its corresponding rows to obtain

A^{s} \in R^{(K^{'} + 1) \times (N + 1)}

, and the final output token is given as follows.

O = A^{S} p (V; Θ_{V}) + p (Q; Θ_{Q})

(15)

After obtaining the output tokens by refining the matrix, we add a Hybrid granularity layer to further reduce the length of the computational sequence, and sort the output tokens by using the Top-k method. The Hybrid granularity layer architecture is shown in Figure 3. Meanwhile, instead of violently pruning the tokens at the bottom of the ordering list, we use better computational units to replace these tokens. This approach aims to guarantee the amount of information while reducing the computational cost loss from pruning.According to the token significance score Equation (13), the information is first classified at the token level into more informative tokens and less informative tokens.We let

x_{c l s} \oplus X

be the sequence of input token vectors, where

X = [x_{1}, x_{2}, \dots, x_{s}]

s is the length of the sequence of output tokens after adaptive sampling in the previous step. We use Top-k to set the token vectors in the top k positions of the significance score as

X_{i n}

, and form the rest of the tokens into a sequence of less informative tokens

X_{l s}

, here

X_{i n} \in R^{k \times D}

,

X_{l s} \in R^{(K^{'} - k) \times D}

. These uninformative tokens are not cut off by pruning operations, and we perform aggregation operations along the sequence dimension, such as average pooling:

X_{l s}^{'} = P o o l i n g (X_{l s})

(16)

or average weighted pooling:

\begin{matrix} α_{x_{k}} & = softmax (I (x_{l s})) \\ X_{l s}^{'} & = P o o l i n g (α X_{l s}) \end{matrix}

(17)

where

X_{l s}^{'} \in R^{{(K^{'} - k)}^{'} \times D}

is the aggregation of the token sequence. The length of the token sequence becomes

[x_{c l s} \oplus X_{i n} \oplus X_{l s}^{'}]

after granularity mixing. Concerning the determination of k, we set up a new set of parameters to learn k, denoted by

R = [r_{1}, \dots, r_{s}]

, while the parameters are restricted to

r_{i} \in [0; 1]

, determined by a uniform distribution, at which point the output token

x_{i}

is changed to

x_{i} \leftarrow r_{p o s (x_{i})} x_{i}

where

p o s (x_{i})

is the ranked position of the amount of information contained in

x_{i}

The l-th layer in which it is located

k_{l}

is given by:

\begin{matrix} k_{l} = c e i l (s u m (l; R)) \\ s . t . k_{l + 1} \leq k_{l} \end{matrix}

(18)

3.3. Implementing Regulation

In this section, we provide a detailed overview of the specific instance of the model. For the multi-scale module, we follow the pattern of MViTv2, dividing the network model into four scale stages. Each scale consists of Transformer blocks with the same channel size. The initial data is first divided into patches with a shape of

1 \times 16 \times 16

. The inception cropping operation is then performed using the operator

4 \times 1 \times 1

, resulting in data input with a shape of

16 \times 224 \times 224

. This input is projected into overlapping spatiotemporal cubes with a shape of

3 \times 7 \times 7

, where the channel dimension is D = 96. Subsequently, after transforming the four scale stages, resolution downsampling, and channel dimension upsampling by a factor of 2 at each stage, the final input sequence length becomes

8 \times 7 \times 7

, and the channel dimension becomes D = 768. Importantly, all pooling operations do not involve the processed class token embeddings. For the attention mechanism, we increase the number of heads with the growth of the channel dimension, initially set to h = 1. In the multi-granularity module, we primarily focus on incorporating the multi-granularity module into the scale module. The parameter governing the sampling in the attention layer is set to K = 197. For the token level, the initialization method for the parameter sequence R, which is used to determine the value of k, is a uniform distribution.

4. Experiments

4.1. Datasets

The datasets used in this paper are all larger communal datasets that have been established in the field of video action recognition in the last few years, i.e., Kinetics-400, and Something-Something-v2. Kinetics-400 is a high-quality dataset that encompasses a wide range of human activities. It comprises 400 action categories, with each category containing at least 400 videos. Each video clip has a duration of approximately 10 s. The dataset is sourced from the YouTube platform, but due to video removals or takedowns, the actual number of videos may be less than initially reported. The Something-Something-v2 dataset primarily adopts a first-person perspective and showcases humans performing predefined basic actions with everyday objects. It consists of 220,847 videos, with 168,913 in the training set, 24,777 in the validation set, and 27,157 in the test set. The dataset includes 174 labels, allowing models to perform fine-grained action recognition.

4.2. Experimental Details

In terms of model configuration, our model is developed based on MViTv2-S. The training and experimental setups with MViTv2-S are basically the same, except for some differences in epoch selection. In the pre-normalization setup, we utilize residual connections [29] and layer normalization [30]. Layer normalization is applied at the beginning of the residual function. The MLP layer consists of linear layers with the GELU activation function. The first layer of this MLP layer expands the dimensionality from D to 4D, and the second layer restores it to the input dimension D. At the end of a scale expansion stage, we increase the channel dimension to match the input for the next scale expansion stage due to the multi-scale nature of the model. We train the model on GPU using the AdamW optimizer [31] for 100 epochs. For learning rate decay, we employ cosine scheduling [32], with an initial learning rate of

1.6 \times 10^{- 3}

. The batch size is set to 8, and a linear warm-up strategy is applied in the early epochs. The cosine scheduling is completed when the final learning rate reaches. For weight decay, we set it to 0.05, and the final classifier utilizes a dropout of 0.5. In terms of temporal sampling, we sample a segment from the full-length video, with the network’s input consisting of T frames, and the temporal stride is

τ

, denoted as

T \times τ

. For spatial sampling, we employ inception-style [33] cropping. The size of the input region is randomly adjusted between [min, max], with a scale of [0.05, 1.00], followed by cropping to

H \times W = 224 \times 224

. Regarding data augmentation, we apply the same set of augmentations to all frames, including random horizontal flipping, mixup, cutmix [34], random erasing [35], and RandAugment [36].

4.3. Ablation Experiment

In typical scenarios, we randomly initialize and train the model from the dataset. We conduct ablation experiments on the Kinetics-400 dataset, providing data related to Top-1 accuracy and computational complexity. Based on these findings, we perform an in-depth analysis of the proposed multi-scale and multi-granularity Vision Transformer.

Pooling function. The configuration of the multi-scale module relies heavily on the multi-head pooled self-attention mechanism, making the choice of pooling function crucial. To validate which pooling method yields better experimental results, we conduct ablation experiments and test three different pooling methods, including max pooling, average pooling, and channel-wise convolution with layer normalization (LN) pooling. Under the same pooling kernel, average pooling reduces the length of the input sequence by sliding a window over the sequence and averaging all elements within the window to a single value. However, this leads to significant performance degradation as it causes the attention mechanism to overlook a portion of information. Compared to max pooling and convolutional pooling, average pooling results in a reduction of 1.8% and 2.5%, respectively. Experimental results demonstrate that the method of channel-wise convolution with LN pooling can achieve the best performance. Compared to max pooling, this method shows a performance improvement of 0.9%, as detailed in Table 1.

Coarse-grained pooling and units. In the granularity fusion layer of the multi-granularity module, pooling is employed for aggregating tokens subjected to coarse-grained processing. We propose two pooling methods: average pooling and weighted average pooling, and conduct ablation experiments to evaluate their effectiveness. Different configurations are applied to the coarse-grained units for mixed granularity, with one set having 1 coarse-grained unit and another set having 5 coarse-grained units. The use of pooling to aggregate coarsely sampled tokens reduces the computational sequence length while retaining information. Experimental analysis shows that the weighted average pooling achieves an overall accuracy of 0.4% higher than direct average pooling. This is because the weighted average pooling retains information more comprehensively, considering the importance of each token, thereby preserving the input sequence’s information more effectively and aiding in feature capture. After performing the reduction operation on the fine-grained attention matrix, setting the coarse-grained sampling units to 5 in the later stages compensates the model with more recognition feature information. The model’s performance increased by 0.3% compared to when the units were set to 1. The specific results are shown in Table 2.

For the second step of coarse-grained sampling, we investigate the impact of pooling on all tokens versus pooling on tokens ranked later. We uniformly employ the weighted average pooling with all coarse-grained units set to 5. However, the results do not meet expectations. Pooling on all tokens leads to a 3.9% decrease in model accuracy. Average pooling can weaken the feature information, especially when the model underwent two rounds of coarse-fine granularity combined token sampling. This feature information has a significant impact on the final classification. In contrast, pooling only on tokens ranked lower not only further saves the model’s computational expenses, but also reduces the impact on the classification results. Details are shown in Figure 4.

Attention significance score rating strategy. The foundation of the model’s multi-scale mechanism is the calculation of significance scores. We evaluate different methods for computing attention significance scores, including randomly selecting a token, self-attention scores (sum of attention weights for all tokens), and attention weights for classification tokens. In the end, we adopt the attention weights of classification tokens as significance scores. This approach allows us to identify tokens with a greater impact on the results, as classification tokens directly participate in the final category confirmation. This helps the model focus more on tokens that are significant for the classification results, enhancing the model’s sensitivity to key features. To improve effectiveness, we add the L2 norm of the value, resulting in better performance, which is shown in Table 3.

Sampling method. For the choice of sampling method, we compare the commonly used Top-k subsampling method for input tokens with the inverse transformation sampling and the combined method with the mixed granularity layer used in this paper. The Top-k method selects the top K tokens with the highest significance scores. For tokens not in the top K rankings, a pruning function is directly applied to discard them. This method leads to the removal of some effective tokens, reducing accuracy. Moreover, Top-k leads to a fixed token selection rate at each stage of the model, and it cannot adaptively reduce token sampling in the later stages. This significantly limits the model’s performance. On the other hand, inverse transform sampling based on CDF rarely discards tokens with lower significance scores, providing more diverse information for the lower layers. This also contributes to the final Transformer block outputting classification tokens. The hybrid granularity layer uses more efficient computation units to replace tokens discarded by the Top-k method. This allows the model to reduce computational costs without losing a significant amount of information. As shown in Figure 5, our method outperforms the results obtained by the Top-k method both in terms of high FLOPs and low FLOPs.

4.4. Comparison with State-of-the-Art Methods

In this section, we compare the performance of the Multi-Granularity Multi-Scale Vision Transformer model with recent methods proposed for human action recognition in the past few years, which include VTN [23], TimeSFormer [16], ViViT [17], MViT [21,22] and others. Table 4 primarily lists the Top-1 and Top-5 classification accuracies, the FLOPs generated by the model, and the number of network parameters. Table 4 presents the results obtained by various models on the Kinetics-400 dataset. Table 4 categorizes models into those based on CNNs and those based on ViTs. Additionally, a separate section lists models that require pretraining on the ImageNet dataset. Table 5 displays the comparison between the proposed model and other pre-trained models on the Something-Something-v2 dataset. The experiments on Kinetics-400 indicate that the multi-scale, multi-granularity fusion model proposed in this paper has a shorter runtime and fewer parameters compared to other models. VTN, TimeSFormer, and ViViT rely on pre-training on ImageNet-21K. TimeSFormer and ViViT achieve further accuracy improvement with an increase in parameters and FLOPs. The MViT series models, including MgMViT, do not require pre-training. Among the compared models, MViT models have relatively lower parameters and computational costs. However, MgMViT specifically has fewer FLOPs. The lower accuracy of MgMViT compared to the base architecture MViTv2-S can be attributed to the multi-granularity filtering. While adaptive sampling reduces the computational sequence length, the adoption of more efficient computational units to replace the directly discarded units in the original method inevitably leads to some information loss. This results in a slight decrease in accuracy compared to the base model. However, despite the lower accuracy compared to models requiring pre-training, MgMViT offers significantly reduced computational costs.

5. Discussion

Although our proposed MgMViT model drastically reduces the computational cost of the model, in terms of computation time and number of model parameters, compared to the existing models, there are still limitations. For example, our recognition accuracy is lower than existing models such as MoViNet and MViTv2. In addition, our model is only for action recognition models, does not address other video understanding domains, and has thin metrics. Therefore, we will further improve the accuracy of the proposed model and add more metrics to measure model performance. Furthermore, we will try to apply the proposed modules to different areas of video understanding.

6. Conclusions

It is a crucial challenge in the practical application of computer vision to reduce computational costs and improve efficiency. To address this issue, we present a model that introduces a multi-level refinement strategy by integrating multi-scale and multi-granularity mechanisms. The proposed model encompasses key aspects of attention and tokens, aiming to optimize the model’s performance in real-world scenarios. This strategy further adds the selection of multiple granularities on top of multi-scale, which effectively shortens the computational sequence length and thus reduces the computational cost in action recognition tasks. The extensive experimental results also demonstrate the effectiveness of the proposed method. We expect that our proposed model will be further developed in the future in a wider range of application scenarios, and further take into account the retention of feature information while saving computational costs.

Author Contributions

Conceptualization, B.L. and H.H.; methodology, B.L.; software, B.L.; validation, B.L.; formal analysis, B.L.; investigation, B.L.; resources, H.H.; data curation, B.L.; writing—original draft preparation, B.L.; writing—review and editing, H.H.; visualization, B.L.; supervision, H.H.; project administration, H.H.; funding acquisition, H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China under grant NSFC 61672210, the Major Science and Technology Programs in Henan Province under grant 221100210500, the Central Government Guiding Local Science and Technology Development Fund Program of Henan Province under grant no. Z20221343032.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Goyal, R.; Ebrahimi Kahou, S.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5842–5850. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar] [CrossRef]
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 20–36. [Google Scholar]
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2740–2755. [Google Scholar] [CrossRef] [PubMed]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6450–6459. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Rrepublic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the ICML Conference, Virtual, 18–24 July 2021; Volume 2, p. 4. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
Zhang, C.L.; Wu, J.; Li, Y. Actionformer: Localizing moments of actions with transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 492–510. [Google Scholar]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
Shi, D.; Zhong, Y.; Cao, Q.; Zhang, J.; Ma, L.; Li, J.; Tao, D. React: Temporal action detection with relational queries. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 105–121. [Google Scholar]
Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6824–6835. [Google Scholar]
Li, Y.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4804–4814. [Google Scholar]
Neimark, D.; Bar, O.; Zohar, M.; Asselmann, D. Video transformer network. In Proceedings of the IEEE/CVF international conference on computer vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3163–3172. [Google Scholar]
Kondratyuk, D.; Yuan, L.; Li, Y.; Zhang, L.; Tan, M.; Brown, M.; Gong, B. Movinets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 16020–16030. [Google Scholar]
Shi, D.; Zhong, Y.; Cao, Q.; Ma, L.; Li, J.; Tao, D. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18857–18866. [Google Scholar]
Fayyaz, M.; Koohpayegani, S.A.; Jafari, F.R.; Sengupta, S.; Joze, H.R.V.; Sommerlade, E.; Pirsiavash, H.; Gall, J. Adaptive token sampling for efficient vision transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland; pp. 396–414. [Google Scholar]
Zhao, J.; Wang, Y.; Bao, J.; Wu, Y.; He, X. Fine-and Coarse-Granularity Hybrid Self-Attention for Efficient BERT. arXiv 2022, arXiv:2203.09055. [Google Scholar]
Zhang, X.; Li, P.; Li, H. AMBERT: A pre-trained language model with multi-grained tokenization. arXiv 2020, arXiv:2008.11869. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Loshchilov, I.; Hutter, F. Fixing weight decay regularization in adam. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Rrepublic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. In Proceedings of the AAAI Conference On Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13001–13008. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 702–703. [Google Scholar]
Feichtenhofer, C. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 203–213. [Google Scholar]
Lin, J.; Gan, C.; Han, S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Rrepublic of Korea, 27 October–2 November 2019; pp. 7083–7093. [Google Scholar]

Figure 1. The structure of the Vision Transformer model with Multi-Granularity Multi-Scale Fusion. The Multi-Granularity Multi-Scale module is integrated into the Transformer module. Multi-granularity modules that add attention levels to multi-scale modules. The attention matrix A is reduced to

A^{'}

by CDF sampling and then connected to the pooled values

p (V; Θ_{V})

and residuals

p (Q; Θ_{Q})

are computed and fed into the mixed granularity layer of multiple granularities to obtain tokens containing coarser and finer granularities to proceed to the next step in the computation.

Figure 1. The structure of the Vision Transformer model with Multi-Granularity Multi-Scale Fusion. The Multi-Granularity Multi-Scale module is integrated into the Transformer module. Multi-granularity modules that add attention levels to multi-scale modules. The attention matrix A is reduced to

A^{'}

by CDF sampling and then connected to the pooled values

p (V; Θ_{V})

and residuals

p (Q; Θ_{Q})

are computed and fed into the mixed granularity layer of multiple granularities to obtain tokens containing coarser and finer granularities to proceed to the next step in the computation.

Figure 2. The structure of the multi-scale and multi-granularity module. Adding Multi-Granularity Module (Attention) to Multi-Scale Module. With pooled Q, K, and V input into the Multi-Granularity Attention Module with significance scores filtering the token with inverse transformer sampling is input into the following calculations.

R_{p (i), p (j)}

is embedded relative to the position.

Figure 2. The structure of the multi-scale and multi-granularity module. Adding Multi-Granularity Module (Attention) to Multi-Scale Module. With pooled Q, K, and V input into the Multi-Granularity Attention Module with significance scores filtering the token with inverse transformer sampling is input into the following calculations.

R_{p (i), p (j)}

is embedded relative to the position.

Figure 3. Hybrid granularity layer architecture. Blue is fine-grained tokens, green is aggregated coarse-grained tokens, yellow is classified tokens; fine-grained tokens are directly output, and coarse-grained tokens are aggregated and then output. The combination of the two enters the Feed-Forward Network.

Figure 4. Whether pooling all tokens or only tokens with less information, when the model operates at the same FLOPs level, it can be observed that pooling all tokens significantly degrades the model performance compared to pooling a subset of tokens.

Figure 5. For the choice of sampling method. we also set the model to run at the desired FLOPs. Running at the same FLOPs, the results of Top-k and our model using CDF with a mixed granularity layer are recorded, with each point corresponding to the Top-1 accuracy.

Table 1. Comparison of three different pooling functions.max pooling, average pooling, and channel-wise convolution with layer normalization (LN) pooling.

Kernel	Pooling Func	Acc	Param
s + 1	Max	79.2	28.5
s + 1	Average	77.4	28.5
s + 1	Conv	79.2	28.7
3 × 3 × 3	Conv	80.1	28.7

Table 2. Comparison of coarse-grained aggregation methods. The table also compares the effects of coarse-grained units set to 1 and 5. 1 and 5 represent the number of coarse-grained units.

Pool Meth	Acc	Param
Avg Pool-1	79.4	28.3
Avg Pool-5	79.7	28.3
Weight Avg-1	79.8	28.7
Weight Avg-5	80.1	28.7

Table 3. The impact of attention significance score strategies. Three different attention significance scores on model performance at average FLOPs.

Scoring Strategy	Acc	Param
Random	76.3	28.7
Self-Attention	79.6	28.7
CLS + \|V\|	80.1	28.7

Table 4. Comparison with previous work on kinetics400. We compare Top-1 and Top-5 accuracies, model run FLOPs, number of views, number of parameters, etc., respectively. The magnitudes of FLOPs and Param are Giga (

10^{9}

) and Mega (

10^{6}

), respectively.

Table 4. Comparison with previous work on kinetics400. We compare Top-1 and Top-5 accuracies, model run FLOPs, number of views, number of parameters, etc., respectively. The magnitudes of FLOPs and Param are Giga (

10^{9}

) and Mega (

10^{6}

), respectively.

Model	Pre-Train	Top-1	Top-5	Views	FLOPs	Param
SlowFast [13]	-	78.7	93.5	3 × 10	116	59.9
X3D [37]	-	79.1	93.9	3 × 10	48.4	11.0
VTN [23]	IN-21K	78.6	93.7	1 × 1	4218	114.0
TimeSFormer [16]	IN-21K	80.7	94.7	3 × 1	2380	121.4
ViViT [17]	IN-21K	81.3	94.7	3 × 4	3992	87.2
Swin-L [15]	IN-21K	84.9	96.7	5 × 10	2107	200.0
MoViNet-A6 [24]	-	81.5	95.3	1 × 1	386	31.4
MViTv1, 16 × 4 [21]	-	78.4	93.5	1 × 5	70.3	36.6
MViTv2-S [22]	-	81.0	94.6	1 × 5	64	34.5
MgMViT (Ours)	-	80.1	93.7	1 × 5	43.8	28.7

Table 5. Compared with previous work performed on Something-Something-v2.

Model	Pre-Train	Top-1	Top-5	Views	FLOPs	Param
TSM-RGB [38]	IN-21K + K400	63.3	88.2	3 × 2	62.4	42.9
SlowFast R50, 8 × 8 [13]	K400	61.9	87.0	3 × 1	65.7	34.1
TimeSformer [16]	K400	62.5	-	3 × 1	1703	121.4
MoViNet-A3 [24]	N/A	64.1	88.8	1 × 1	24	5.3
Swin-B [15]	IN-21K + K400	69.6	92.7	3 × 1	321	88.8
MViTv1-B [21]	K400	64.7	89.2	3 × 1	70.5	36.6
MViTv2-S [22]	K400	68.2	91.4	3 × 1	64.5	34.4
MgMViT (Ours)	K400	67.3	90.4	3 × 1	44.1	28.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huo, H.; Li, B. MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition. Electronics 2024, 13, 948. https://doi.org/10.3390/electronics13050948

AMA Style

Huo H, Li B. MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition. Electronics. 2024; 13(5):948. https://doi.org/10.3390/electronics13050948

Chicago/Turabian Style

Huo, Hua, and Bingjie Li. 2024. "MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition" Electronics 13, no. 5: 948. https://doi.org/10.3390/electronics13050948

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

Abstract

1. Introduction

2. Related Work

3. Multi-Granularity and Multi-Scale Vision Transformer

3.1. The Base Model

3.2. Multi-Scale and Multi-Granularity Fusion

3.3. Implementing Regulation

4. Experiments

4.1. Datasets

4.2. Experimental Details

4.3. Ablation Experiment

4.4. Comparison with State-of-the-Art Methods

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI