Separable ConvNet Spatiotemporal Mixer for Action Recognition

Cheng, Hsu-Yung; Yu, Chih-Chang; Li, Chenyu

doi:10.3390/electronics13030496

Open AccessArticle

Separable ConvNet Spatiotemporal Mixer for Action Recognition

by

Hsu-Yung Cheng

^1,*

,

Chih-Chang Yu

²

and

Chenyu Li

¹

Department of Computer Science and Information Engineering, National Central University, Taoyuan 32001, Taiwan

²

Department of Information and Computer Engineering, Chun-Yuan Christian University, Taoyuan 320, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(3), 496; https://doi.org/10.3390/electronics13030496

Submission received: 13 December 2023 / Revised: 23 January 2024 / Accepted: 23 January 2024 / Published: 24 January 2024

(This article belongs to the Special Issue Future Trends and Challenges of Ubiquitous Computing and Smart Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Video action recognition is vital in the research area of computer vision. In this paper, we develop a novel model, named Separable ConvNet Spatiotemporal Mixer (SCSM). Our goal is to develop an efficient and lightweight action recognition backbone that can be applied to multi-task models to increase the accuracy and processing speed. The SCSM model uses a new hierarchical spatial compression, employing the spatiotemporal fusion method, consisting of a spatial domain and a temporal domain. The SCSM model maintains the independence of each frame in the spatial domain for feature extraction and fuses the spatiotemporal features in the temporal domain. The architecture can be adapted to different frame rate requirements due to its high scalability. It is suitable to serve as a backbone for multi-task video feature extraction or industrial applications with its low prediction and training costs. According to the experimental results, SCSM has a low number of parameters and low computational complexity, making it highly scalable with strong transfer learning capabilities. The model achieves video action recognition accuracy comparable to state-of-the-art models with a smaller parameter size and fewer computational requirements.

Keywords:

action recognition; spatiotemporal fusion; lightweight; video recognition

1. Introduction

Video recognition is important in computer vision, with applications ranging from capturing highlights in sports events to traffic enforcement cameras [1,2,3]. Many recognition tasks require a powerful and efficient video recognition model to extract comprehensive and continuous pictures for further analysis. In multitasking applications, especially those involving live or streaming video feeds, quick and accurate decision-making is essential. High accuracy ensures that the system can reliably recognize and interpret the content of the video, enabling timely responses to dynamic situations. Also, efficient video recognition models consume fewer computational resources, making them more suitable for multitasking applications where various processes may be running simultaneously. This helps in optimizing hardware usage and allows for the deployment of such models in resource-constrained environments. Therefore, a highly accurate and efficient video recognition model, as the backbone for feature extraction, is vital for multi-task and industrial applications. The industrial applications requiring action recognition include behavior analysis in surveillance systems, fall detection in home care systems, video annotation or summarization for movies, and video analysis for sports.

The challenges and problems currently faced by video recognition models include the following. Videos involve a temporal dimension, and understanding the temporal context and dependencies between frames is challenging. Recognizing actions or events over time requires models to capture long-term dependencies, and handling variable-length videos can be complex. Training robust video recognition models often requires large-scale labeled datasets, which are expensive to create. Ensuring diversity in the dataset is crucial to building models that generalize well to different scenarios. Videos captured in the real world can have variations in the lighting conditions, camera viewpoints, background clutter, and other factors that make recognition challenging. Models need to be robust to handle these variations and generalize well to diverse real-world scenarios. Distinguishing between subtle actions or activities that share visual similarities is also a challenging task. Fine-grained recognition requires models to focus on subtle details and subtle differences between classes. Video recognition models often involve processing a large number of frames, which can be computationally expensive and slow, especially for real-time applications. Efficient algorithms and model architectures are needed to handle the computational demands of video analysis. In this paper, we propose a novel model, named Separable ConvNet Spatiotemporal Mixer (SCSM), that explores a spatiotemporal hierarchical architecture to reduce computational complexity. By compressing frames in the spatial domain to reduce parameters and investigate the importance of the spatiotemporal fusion mechanism, we aim to develop a lightweight and efficient video recognition model.

Starting from the Two-Stream ConvNet [4], which processed temporal and spatial information separately, the development of 3D Convolution [5] attempted to mix spatial and temporal information. The R(2 + 1)D [6] architecture introduced a new idea of handling spatial and temporal information in separate layers by splitting the 3D convolution into 2D convolutions for spatial processing and 1D convolutions for temporal processing. The work in [7] expanded the weights of Inception V1 along the temporal dimension to process videos. SlowFast, presented in [8], proposed the concept of low and high frame rates. More recently, the Transformer [9] and Video Vision Transformer (VIVIT) [10] introduced various Transformer-based architectures for video recognition.

Although Transformers have demonstrated an excellent performance in the vision field, its practicality for video recognition is limited due to the requirement for comprehensive hardware and datasets, as well as increased parameter complexity. Most Vison Transformer models are trained from scratch using multiple GPUs or TPUs. For example, Touvron et al. [11] used an eight V100 GPU machine to train a vision transformer with 53 h of pre-training and optionally 20 h of fine-tuning. Therefore, developers who do not have access to such computational resources are not able to train such models. Currently, most researchers in video recognition have achieved state-of-the-art results using Transformer. However, in many multi-task domains, such as human behavior skeleton prediction [12] and high-level action feature extraction, video feature extractors still relied on models like I3D [7] and R(2 + 1)D [6] prior to 2020. This indicates that the current practicality of Transformer is not ideal. The time complexity and accuracy of a model should ideally strike a balance, rather than solely pursuing accuracy while ignoring the inference time or the need for extensive computational resources for training.

Inspired by ConvNeXt [13], it has been proven that Convolutional Neural Networks (CNN) are not inferior to Transformers, as their convolutional nature is well-suited for image-related applications. It is essential to leverage the inherent characteristics of a domain rather than indiscriminately employing Transformer. Based on this perspective, we propose letting CNNs handle spatial feature extraction, which they excel at. On the other hand, we use either time-series models or spatiotemporal models for time-domain processing. Such a design allows each component to contribute its maximum benefit.

Our SCSM model develops a spatiotemporal hierarchical architecture. It introduces a frame compression input method and develops a mechanism named Time wrapper for the consecutive feature extractor to extract features. This approach allows for the easy utilization of pre-trained models. In the temporal domain, instead of general time series models, we design a spatiotemporal fusion mechanism. SCSM incorporates the Mixer Layer, mentioned in the Multi-Layer Perceptrons Mixer (MLP-Mixer) [14]. The Spatiotemporal Mixer is used for extracting spatiotemporal mixing information. The overall concept of spatiotemporal fusion is similar to the idea in MobileNet [15], which uses pointwise convolution to extract deep spatial operations and improve the issue of channel-wise independent calculations in depth-wise convolutions. The SCSM model maintains the independence of each frame in the spatial domain for feature extraction and fuses the spatiotemporal features in the temporal domain, enabling powerful transfer learning and scalability. It achieves competitive recognition accuracy with SOTA models, even with its low complexity and fewer parameters.

2. Methodology

2.1. Network Architecture

The SCSM architecture we designed is mainly composed of two parts: the spatial domain and the temporal domain—which are combined together, as shown in Figure 1. The spatial domain component focuses on extracting spatial feature details, while the temporal domain component handles the fusion of continuous features.

2.2. Frame Compression Method

We need to design a compression method on the front end of the architecture to reduce the amount of information processed by the model. We need to reduce the image to a specific size as the input to improve the calculation speed of information extraction on the back end of the architecture. At the beginning, we used a method similar to Vision Transformer (VIT) to cut patches. We changed the approach and treated each frame of the video as a patch, and added position coding to let the Transformer know the order of the frames. With the length, width, and RGB channels, the image vector dimension of a single patch is several times higher than that of VIT, resulting in a sudden increase in the number of parameters. Due to the aforementioned reasons, the image must be compressed before it can be converted into a patch. As the temporal domain part of our model involves processing time sequences, we drew inspiration from the Transformer model, a trending architecture for sequence modeling. The self-attention mechanism in Transformer primarily focuses on capturing the correlations between patches. Additionally, the Siamese network [16] architecture utilizes two separate ConvNets to extract 1-dimensional features *and compare their similarity. Based on these insights, we designed a compression method that employs a feature extractor (ConvNet) to compute a 1-dimensional vector, significantly compressing the 2D image information. This approach offers two advantages: compressing the images while learning precise image features and the ability to leverage existing lightweight and efficient image recognition models that have been pre-trained on ImageNet datasets, providing strong transfer learning and scalability. Users can choose between larger models for higher accuracy or lightweight models for faster processing. Moreover, the model can converge faster and exhibit more stability based on a solid foundation. There would not be a significant impact on the predictions of the model when the training data are slightly altered.

The overall frame compression method is described as follows: assuming a 2D image

x

, we transform it to

x \in R^{C \times H \times W}

. After applying a feature extractor

F (x)

, we obtain the feature representation

{x_{f} \in R}^{d}

. Then, we apply the linear transform

L (x_{f})

to obtain the feature

x_{p}

of a specific patch size. In Equation (2), the notation V represents the domain, and W represents the range.

x_{f} = F (x) {, x_{f} \in R}^{d}

(1)

x_{p} = L (x_{f}), L : V \to W

(2)

2.3. Spatial Domain

The frame compression method is mentioned in Section 2.2. We designed a sequential module named “Time wrapper” to accommodate the feature extractor and enable continuous computation to align with the input requirements of the feature extractor (ConvNet), which takes

2 D i m a g e \in R^{B \times C \times H \times W}

. The Time wrapper input is

c l i p s \in R^{B \times F \times C \times H \times W}

, where

B

denotes the batch size (number of clips from different video),

F

denotes the number of input frames,

C

denotes the number of channels,

H

denotes the height, and

W

denotes the width. Instead of directly reshaping the clips into a tensor of shape

c l i p s \in R^{(B \cdot F) \times C \times H \times W}

, which theoretically could result in better mixing when performing normalization in batch, we aim to maintain the independence of each frame within a video. We want the temporal domain component to learn the spatiotemporal relationships. Additionally, normalization should follow the characteristics of normalization on different videos in batch rather than normalizing across different videos and their consecutive frames together.

The purpose of Time wrapper is to process the

B

and

F

dimensions to enable the use of different feature extractors. First, we transpose the input, as stated in Equation (3). Then, the feature extractor

F (x)

is used to extract the feature

x_{p i}

with each frame

I_{i}

. The term

I_{i}

in Equation (4) represents the frame with the same index i from different videos. The input dimension of

F (I_{i})

is

I_{i} \in R^{B \times C \times H \times W}

. We design two different methods to handle batches in the Time wrapper. The “All to batch” method and “Along to batch” method are described as follows. We will show that the “Along to batch” method performs better in the experimental results.

All to batch.

B, F, C, H, W \to (B, F), C, H, W

. In this approach, we directly multiply the F and B dimensions, treating them as the batch for the feature extractor. Then, we reshape it back to

B, F, C, H, W

. This allows the batch to operate on all frames within the batch, promoting better mixing within the entire batch. However, this approach may not be suitable for consecutive frames from the same video.

Along to batch.

B, F, C, H, W \to F, B, C, H, W

. In this approach, we apply the batch to different frames with the same index across different videos. This method maintains the independence between frames.

When computing

F (I_{i})

sequentially, the normalization effectively applies to the same index frame of different videos in a batch. This method allows for the preservation of independence between frames within a video. The output is a 2D vector in the spatial domain, as shown in Figure 2.

The Time wrapper is described in Algorithm 1. It takes input

c l i p s \in R^{B \times F \times C \times H \times W}

, and after consecutive feature extractors, outputs

p a t c h e s \in R^{B \times F \times d}

, where d is the patch size length after feature extraction. The characteristic of Time wrapper is to facilitate the replacement of different ConvNet or Transformer feature extractors. Each frame’s feature is independently concatenated along the time axis, as stated in Equation (5).

Algorithm 1: Time wrapper
Input: $c l i p s \in R^{B \times F \times C \times H \times W}$
Output: $p a t c h e s \in R^{B \times F \times d}$
Step 1: Reshape each clip:
$c l i p \in R^{B \times F \times C \times H \times W} \to c l i p \in R^{F \times B \times C \times H \times W}$	(3)
$Step 2 : Form feature x_{p i}$ $with each frame I_{i}$ of the same index i from different videos using $F (\cdot)$ as the feature extractor.
$x_{p i} = F (I_{i}), I_{i} \in R^{B \times C \times H \times W}$	(4)
Step 3: Concatenate the features
$p a t c h e s = \{x_{p 1} ⊛ x_{p 2} ⊛ \dots ⊛ x_{p F}\}, p a t c h e s \in R^{B \times F \times d},$	(5)
$⊛ = C o n c a t e n a t e$

2.4. Temporal Domain

We have a spatiotemporal feature in the spatial domain, where patches serve as the input in the temporal domain. When considering time sequence processing, Transformer naturally comes to mind, given its excellent performance in various domains with its self-attention mechanism. We believe that Transformer’s role in capturing the relationships between patches is irreplaceable. It excels at learning the differences between patches along the time axis. However, it falls short in effectively learning within the patches themselves.

Each frame’s feature space is mutually independent, lacking the interaction and integration of information within and between patches. Hence, we began contemplating how to learn a mixture of spatial and temporal information in the temporal domain. We tried various methods to blend spatial and temporal information, such as adding a Transformer to transpose the patches by 90 degrees to learn spatial correlations in the channel dimension or using multi-scale 1D convolutions in conjunction with pooling. However, the results did not meet our expectations.

Finally, we adopted the Mixer Layer from MLP-Mixer for fusion. The idea of transposing by 90 degrees aligns with our previous attempts. It approaches the problem from two different perspectives: the time perspective and the spatial perspective. The Spatiotemporal Mixer provides a fusion of temporal and spatial features. We feed it with 2D feature vectors obtained from the spatial domain, where t represents consecutive t frames of frame features along the time axis and p represents the independent features of each frame. Thus,

P a t c h e s \in R^{t \times p}

. The Mixer layer does not use convolutions or self-attention. Instead, it relies only on basic matrix multiplication routines and the modifications to the data layout, including reshapes and transpositions. A convolution is more complex than the plain matrix multiplication in MLPs. Therefore, using the Mixer layer significantly reduces the number of parameters and computational complexity of the proposed architecture.

The first MLP block (MLP1) is named Spatial-Tube mixing. It performs feature extraction on different locations in space at different times. By sliding along the feature axis, it learns the temporal features of the same spatial location along the time axis, as shown in Figure 3a. This is equivalent to mixing the features of the same spatial region across consecutive frames of a video, as shown in Figure 3b. The MLP1 is designed with weight sharing, similar to the operation of 1D convolution. It allows communication and feature mixing between different spatial positions and temporal channels on patches.

We first transpose the patches to obtain

{p a t c h e s}^{T}

and apply the MLP1 mapping from

R^{t} \mapsto R^{t}

to maintain the dimensionality, allowing different spatial positions in the patches to communicate and to mix information across different spatial locations.

The second MLP block (MLP2) is the Temporal-Frame mixing. In this step, we extract features from different spatial positions at the same time. By sliding along the time axis, it learns the features of different spatial positions simultaneously, as shown in Figure 4a. This is equivalent to mixing features of different spatial locations within the same frame of a video, as shown in Figure 4b. The MLP2 is also designed with weight sharing, enabling communication and feature mixing between fused features from different frames. Calculations are performed on the rows of the patches, allowing weight sharing between rows. Through the MLP2 mapping from

R^{p} \mapsto R^{p},

which maintains the dimensionality, different patches can communicate, and feature information can be mixed across different times.

Inside the Spatiotemporal Mixer, there are two MLPs. Each MLP is composed of two fully connected layers with Gaussian Error Linear Unit (GELU) activation. The fully connected layer is denoted as

W_{*}

and GELU is denoted as

σ

. The spatiotemporal features serve as the input to the first MLP block and result in

U_{*, i},

as described in Equation (6). Afterwards,

U_{*, i}

is used as the input to the second MLP block to obtain

Y_{j, *}

. Furthermore, the output

Y_{j, *}

operates through a skip connection and undergoes element-wise addition with the previous input

U_{*, i},

as described in Equation (7).

U_{*, i} = {P a t c h e s}_{*, i} + W_{2} σ (W_{1} {L a y e r N o r m (P a t c h e s)}_{*, i}), f o r i = 1 \dots p

(6)

Y_{j, *} = U_{j, *} + W_{4} σ (W_{3} {L a y e r N o r m (U)}_{j, *}), f o r j = 1 \dots t

(7)

The MLP is unlike Transformer, which is influenced by the number of input patches. The complexity of the MLPs grows linearly, as opposed to the quadratic growth in Transformer. The Skip connection and LayerNorm in Spatiotemporal Mixer enable it to learn the varying importance of different stages and automatically adjust the scaling of inputs.

2.5. MLP Head

It consists of LayerNorm [17] and a linear transformation, as shown in Figure 5. The output of Spatiotemporal Mixer is normalized, and the average over different time steps is taken on the features. The final Softmax layer is used to output the action classification result.

3. Experimental Results

The hardware and computational resource used in the experiments are listed as follows: Intel^® CoreTM i7-10700K [email protected] CPU, 64 GB RAM, GeForce RTX 3090 GPU, and 24 GB VRAM. The operating system is Ubuntu 20.04.6 LTS. The CUDA Version is 11.7. The versions of the packages used are: Python 3.8.10, PyTorch 2.0.1, PyTorchVideo 0.1.5, Pillow-SIMD 9.0.0.

3.1. Dataset

Kinetics-400 [18] was sourced from videos on the YouTube platform and consists of 400 different types of actions. Each video has a duration of approximately 10 s and various resolutions. The frame rate of the videos ranges between 250 and 300 frames per second. The dataset contains a total of 306,245 videos, which are divided into training, validation, and test sets. The training set includes 250 to 1000 videos per category, the validation set includes 50 videos per category, and the test set includes 100 videos per category. Due to factors such as video corruption, video encoding issues, copyright takedowns, and others, the number of usable videos does not match the initial dataset’s quantity. We obtained a total of 259,664 usable videos from Kinetics-400, including train and test sets. This represents a decrease of approximately 2.5% in video quantity compared to the original dataset.

3.2. Training Method

We set a batch size of 32. Each frame is randomly cropped to 224 × 224 pixels, with the shorter side randomly sampled in [256, 320] (Non-local networks [19]). Unless otherwise specified, we performed experiments with an even sampling of 16 frames per video (

F = 16

) and used ResNet-50 [20] as the backbone in the spatial domain. We set the epoch to 20, which allowed for faster training. The optimizer used was AdamW [21] with a learning rate set to

10^{- 4},

and Cross Entropy was employed as our loss function. We designed a Learning Rate Schedule, reducing the learning rate by a factor of 0.1 at epochs 7 and 15. Due to GPU VRAM limitations, we employed gradient accumulation, which enabled simulations of batch sizes larger than the available VRAM capacity.

3.3. Various Spatiotemporal Mechanism

The time domain part of SCSM, as mentioned in Section 2.4, employs various mechanisms to handle time series features. We designed various spatiotemporal feature fusion architectures. SC_winslide attempts to fuse information on spatiotemporal features through multiple sliding convolutions. Then, we designed SCTA using a similar approach to the VIT architecture, transforming each frame into a patch for training. By utilizing the self-attention mechanism of Transformer, we achieved an accuracy of 62.3%. Subsequently, we designed the SCTA_SlowFast architecture to learn differences under fast and slow frame rates, similar to SlowFast. Compared to SCTA, adding fast and slow mechanisms increases the accuracy by approximately 2.5%. In SCTAX, we attempted to add Transformer to transpose the spatiotemporal features and calculate from different spatiotemporal perspectives, aiming to learn spatiotemporal interaction fusion information. However, it may not perform as expected. Finally, the Mixer Layer transposition idea in MLP-Mixer is similar to our SCTAX method. Therefore, we developed the SCSM architecture, which employs Mixer Layers to fuse information from the temporal and spatial features. This approach achieved an accuracy of 67.4%, an improvement of approximately 8.1% over SCTA.

In the MLP-Mixer paper [14], the accuracy of MLP-Mixer for image classification on ImageNet was lower than Transformer. However, when we applied it to spatiotemporal fusion, the situation was exactly the opposite (as shown in Table 1). This precisely illustrates the importance of spatiotemporal fusion information, as previously mentioned. We believe that Transformer lacks interaction and integration information between patches. Our spatiotemporal fusion information method is clearly superior to the self-attention mechanism. The significant difference in accuracy of 5.1% between SCSM and SCTA proves our spatiotemporal fusion information idea.

3.4. Spatial Extractor

The SCSM architecture has the ability to replace various image feature extractors (as shown in Table 2), demonstrating powerful transfer learning capability. We found that the feature extractors obtained through the structural search NAS method did not perform as well as their performance on ImageNet, and the validation accuracy fluctuated dramatically during training. We think that excessive fine-tuning on the original data led to poor generalization.

From Table 2, we can observe that feature extractors towards the Transformer architecture show a better performance. We believe this is due to the factors introduced using Batch Normalization within the image feature extractor. The Time wrapper architecture in the spatial domain seems to make SCTA considerably less effective when using Batch normalization. To validate this idea, we tested the SCSM architecture using ResNet50 and RegNet_x_800MF as backbones and compared the results with SCTA.

We can observe that the Spatiotemporal Mixer in SCSM effectively compensates for the impact of Batch Normalization. There is a significant difference in accuracy between SCSM and SCTA when both use ResNet50 as the backbone, whereas there is no difference when using other image feature extractors without Batch Normalization. Spatiotemporal Mixer has a smaller impact on w/wo Batch Normalization, allowing the SCSM architecture to easily employ various CNN-based extractors (most convolutional architectures still adopt Batch Normalization). The ConvNeXt architecture, which is also a CNN structure, uses Layer Normalization as it mimics the Transformer architecture, and its effect is equivalent to its performance on ImageNet. As the effectiveness of the feature extractor improves, it also leads to a corresponding growth in our recognition rates.

In the case of using ConvNeXt_Tiny as the backbone, the parameter count for SCTA is 41 M, while for SCSM, it is 34 M. SCSM reduces the parameter count by 15% compared to SCTA in the three-layer temporal domain. By increasing the number of layers in SCSM’s Spatiotemporal Mixer layer to 5 while maintaining a similar parameter as SCTA, the accuracy improves to 73.7, surpassing the accuracy achieved by SCTA.

3.5. Patch Size Analysis

The feature generated by the feature extractor in the spatial domain serves as one of the patches in our spatiotemporal features. We tested the impact of different patch sizes on SCSM with a ResNet-50 backbone. A larger patch size generated by the image feature extractor indicates that more features can be represented in the image. According to our test results, shown in Figure 6, we found that a patch size of 512 achieved the highest accuracy, while a patch size of 256 showed a slight decrease. Patch sizes below 256 dropped significantly in validation accuracy, indicating an ineffective representation of image features. In the case of a patch size of 1024, we believe that in the same training time, it was insufficient to adequately train the model. Therefore, we selected a patch size of 512 for our experiments.

3.6. N Spatiotemporal Mixer Layer

We tested the effectiveness of the Spatiotemporal Mixer using ResNet-50 as a feature extractor at different numbers of layers. According to Table 3, we found that the results were highest when using three layers, and increasing the depth did not lead to a notable enhancement.

3.7. Time Wrapper

We compared the “All to batch” method and “Along to batch” method used to handle batches in the Time wrapper. The results in Table 4 demonstrate that using the “Along to batch” method, which maintains frame independence and allows the SCSM temporal domain to handle it, leads to better results compared to mixing frames within the entire batch.

3.8. ImageNet Pretrain

In Figure 7, we trained SCSM using the ConvNeXt_Tiny backbone, following our training setup, and compared the difference between the pretrained model and non-pretrained model for the first 10 epochs. We observed a significant gap in the first epoch. The red line, representing the pretrained model, achieved an accuracy of over 60% in the first epoch, while the blue line, representing the non-pretrained model, exhibits an accuracy rate of only less than 10%. Training the feature extractor without pretraining may require more than three times the original training time to reach convergence. Due to our limited resources, we decided to use the model pretrained on ImageNet, which greatly reduced our training time. Most video recognition models [6,25] cannot directly use image feature extractor as their backbone. Our approach allows us to achieve the claimed validation accuracy in a very short training time.

3.9. Comparison with State-of-the-Art

The comparisons with the state-of-the art methods are listed in Table 5. Our SCSM architecture demonstrates a different approach compared to other multi-prediction methods in terms of achieving higher accuracy. For example, ip-CSN-152 and SlowFast require 30 times greater prediction to pursue higher accuracy. It is obvious that performing multi-prediction has a substantial impact on the inference time. On the other hand, models like ViViT-L and TimeSformer, which are Transformer-based architectures, rely on fewer numbers of multi-predictions but have high complexity and require a longer inference time. With SCSM, we can achieve a comparable validation accuracy of 73.4% with just a single prediction. Additionally, our model has a lower number of parameters and computational requirements compared to the other models. Moreover, SCSM offers flexibility in terms of scalability, allowing for the choice of frame rates and the ability to expand freely.

The SCSM architecture consists of two domains: the spatial domain and the temporal domain. In the spatial domain, the Time wrapper allows the selection of feature extractors with higher complexity to achieve better accuracy. Conversely, a lower complexity can be chosen to trade-off accuracy for faster prediction speed. In the temporal domain, SCSM allows for stacking multiple layers to extract more spatiotemporal fusion information, thereby improving the accuracy. The overall SCSM architecture can be adjusted according to the user’s needs.

To showcase the scalability of our design, we have developed two versions of the model: SCSM_T (tiny) and SCSM_B (base). These models demonstrate their powerful scalability. SCSM_T consists of the Time wrapper with a spatial extractor using ConvNeXt_tiny and three layers of Spatiotemporal Mixers, while SCSM_B comprises the Time wrapper with a spatial extractor using ConvNeXt_small and five layers of Spatiotemporal Mixers. In our research, we succeeded in achieving a balance between accuracy and computational requirements, as well as high flexibility in choosing the model architecture.

4. Conclusions

We developed a novel SCSM video recognition model that achieves a balance between accuracy and computational complexity, demonstrating the feasibility of this network architecture design. The proposed SCSM model uses a new hierarchical spatial compression using the spatiotemporal fusion method, consisting of a spatial domain and a temporal domain. The SCSM model maintains the independence of each frame in the spatial domain for feature extraction and fuses the spatiotemporal features in the temporal domain. SCSM can extract video features with minimal computations and parameters. This architecture exhibits strong scalability and transfer learning capabilities, allowing users to adjust the model based on their specific needs, whether it be speed- or accuracy-oriented. The architecture is highly scalable, enabling expansion based on different frame rate requirements. It has a simple overall concept and low prediction and training costs, making it highly beneficial as a backbone for multi-task video feature extraction or industrial applications. One of the limitations of the proposed method is that if the action is captured by the camera angles not included in the dataset, the accuracy of the recognition result would be lowered. Another limitation is that it is not good at recognizing mutual interactions with other people or dynamic objects in the scenes. These limitations are also common in the other existing methods. For future works, we would investigate the mechanisms to deal with the above-mentioned challenges. We also plan to employ Neural Architecture Search (NAS) [28] to select the most suitable architecture for SCSM in the future. In addition, applying SCSM to other multi-task learning tasks would further demonstrate the benefits of its lightweight and low computational requirements in action localization, video understanding, and video analysis.

Author Contributions

Conceptualization, C.L.; Methodology, H.-Y.C.; Software, C.L.; Validation, C.-C.Y.; Investigation, H.-Y.C.; Resources, H.-Y.C.; Writing—original draft, C.L.; Writing—review and editing, H.-Y.C.; Visualization, C.-C.Y.; Supervision, C.-C.Y.; Project administration, C.-C.Y.; Funding acquisition, H.-Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council NSTC, Taiwan, under grant number 112-2221-E-008-069-MY3.

Data Availability Statement

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, H.B.; Zhang, Y.X.; Zhong, B.; Lei, Q.; Yang, L.; Du, J.X.; Chen, D.S. A comprehensive survey of vision-based human action recognition methods. Sensors 2019, 19, 1005. [Google Scholar] [CrossRef] [PubMed]
Liang, C.; Yang, J.; Du, R.; Hu, W.; Tie, Y. Non-Uniform Motion Aggregation with Graph Convolutional Networks for Skeleton-Based Human Action Recognition. Electronics 2023, 12, 4466. [Google Scholar] [CrossRef]
Ji, R. Research on basketball shooting action based on image feature extraction and machine learning. IEEE Access 2020, 8, 138743–138751. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Liu, M.; Liu, H.; Chen, C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 2017, 68, 346–362. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. Adv. Neural Inf. Process. Syst. 1993, 6. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10428–10436. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (PMLR), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? ICML 2021, 2, 4. [Google Scholar]
Hara, K.; Kataoka, H.; Satoh, Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6546–6555. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Feiszli, M. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5552–5561. [Google Scholar]
Elsken, T.; Metzen, J.H.; Hutter, F. Neural architecture search: A survey. J. Mach. Learn. Res. 2019, 20, 1997–2017. [Google Scholar]

Figure 1. SCSM architecture. The feature extractor in the Time wrapper compresses frames into one-dimensional vectors and combines them to create a spatiotemporal feature. This feature is then passed via the Spatiotemporal Mixer to extract fusion information. The MLP Head classifies the output into corresponding actions.

Figure 2. Spatiotemporal feature refers to the features extracted from the same frame and represented by the same color. Features generated by consecutive frames are arranged in parallel along the time axis.

Figure 3. Spatial-Tube mixing. (a) Learning the temporal features of the same spatial location along the time axis; (b) Mixing the features of the same spatial region across consecutive frames.

Figure 4. Temporal-Frame mixing. (a) Learning the features of different spatial positions simultaneously; (b) Mixing features of different spatial locations within the same frame.

Figure 5. MLP head is composed of LayerNorm, mean, Linear, and Softmax in sequence.

Figure 6. Different patch sizes validate the effectiveness on Kinetics-400.

Figure 7. Comparison of w/wo pre-trained model, which verifies the magnitude of differences brought about by pre-trained.

Table 1. Comparison of Different Spatiotemporal Mechanisms. We experimented with various approaches, and the Spatiotemporal Mixer yielded the best accuracy in our tests.

Model	Mechanism	Top1	Top5
Spatial extractor (ResNet50) pretrained on ImageNet1k
SC_winslide	Multi-scale window + Pooling	57.4	81.4
SCTA	Transformer	62.3	85.2
SCTAX	Transformer(row) + Transformer(column)	57.0	81.7
SCTA_SlowFast	Transformer(slow) + Transformer(fast)	63.8	86.2
SCSM	Spatiotemporal Mixer	67.4	87.7

Table 2. Different Spatial extractor. We experimented with various approaches, and the Spatiotemporal Mixer yielded the best accuracy in our tests.

Spatial Extractor (Se)	ImNet Top1	Se Params(M)	Se GFLOPS	Top1	Top5
Spatial extractor pretrained on ImageNet1k, using SCTA model
RegNet_X_800MF [22]	75.2	7.3	0.80	56.0	78.1
EfficientNet_B3 [23]	82.0	12.2	1.83	60.5	83.6
ResNet50	76.1	25.6	4.09	62.3	85.2
Swin_V2_T [24]	82.0	28.4	5.94	73.0	90.8
ConvNeXt_Tiny	82.5	28.6	4.46	73.5	90.5
Spatial extractor pretrained on ImageNet1k, using SCSM model
RegNet_X_800MF	75.2	7.3	0.80	67.8	87.3
ResNet50	82.0	12.2	1.83	67.4	87.7
ConvNeXt_Tiny	82.5	28.6	4.46	73.4	90.1
ConvNeXt_Tiny (5 layers Spatiotemporal Mixer)	82.5	28.6	4.46	73.7	90.8

Table 3. Spatiotemporal Mixer using different numbers of Spatiotemporal Mixer layers.

Spatiotemporal Mixer Layers	Top1	Top5
2	63.6	85.4
3	67.4	87.7
4	67.0	87.4

Table 4. Time wrapper. We processed the B, F dimension in two methods: “All to batch” and “Along to batch”.

Time Wrapper	Top1	Top5
ResNet-50 pretrain (ImageNet1k), using SCSM
All to batch	64.2	85.3
Along to batch	67.4	87.8

Table 5. Comparison with the state-of-the-art on Kinetics-400.

Model	Frame	Pretrain	Param (M)	GFLOPs × Views	Top1	Top5
SCSM backbone utilizes ConvNeXt
ResNet3D-50 [26]	16	-	47.0	80.3 × 1	61.3	83.1
I3D-RGB [7]	25	-	12	108 × 1	68.4	88.0
I3D-RGB [7]	25	IN1k	12	108 × 1	71.1	89.3
R(2 + 1)D [6]	32	-	63.6	152 × 115	72.0	90.0
ip-CSN-152 [27]	-	-	32.8	109 × 30	77.8	92.8
SlowFast 4 × 16, R50 [8]	-	-	34.4	36.1 × 30	75.6	92.1
SlowFast 8 × 8, R101 [8]	-	-	53.7	106 × 30	77.9	93.2
ViViT-L/16 × 2 FE [10]	32	IN1k	-	3980 × 3	80.6	92.7
TimeSformer [25]	8	IN1k	121.4	196.6 × 3	75.8	-
TimeSformer-HR [25]	16	IN21k	-	1703.3 × 3	79.7	94.4
SCSM_T	16	IN1k	34.7	71.4 × 1	73.4	90.1
SCSM_B	16	IN1k	60.6	138.9 × 1	75.1	91.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, H.-Y.; Yu, C.-C.; Li, C. Separable ConvNet Spatiotemporal Mixer for Action Recognition. Electronics 2024, 13, 496. https://doi.org/10.3390/electronics13030496

AMA Style

Cheng H-Y, Yu C-C, Li C. Separable ConvNet Spatiotemporal Mixer for Action Recognition. Electronics. 2024; 13(3):496. https://doi.org/10.3390/electronics13030496

Chicago/Turabian Style

Cheng, Hsu-Yung, Chih-Chang Yu, and Chenyu Li. 2024. "Separable ConvNet Spatiotemporal Mixer for Action Recognition" Electronics 13, no. 3: 496. https://doi.org/10.3390/electronics13030496

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Separable ConvNet Spatiotemporal Mixer for Action Recognition

Abstract

1. Introduction

2. Methodology

2.1. Network Architecture

2.2. Frame Compression Method

2.3. Spatial Domain

2.4. Temporal Domain

2.5. MLP Head

3. Experimental Results

3.1. Dataset

3.2. Training Method

3.3. Various Spatiotemporal Mechanism

3.4. Spatial Extractor

3.5. Patch Size Analysis

3.6. N Spatiotemporal Mixer Layer

3.7. Time Wrapper

3.8. ImageNet Pretrain

3.9. Comparison with State-of-the-Art

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI