1. Introduction
In recent years, video understanding has attracted increasing attention from the academic community [
1,
2,
3]. The accurate recognition of human action in videos is a key step for most video understanding applications. In the literature, several surveys [
4,
5,
6,
7,
8] have proposed to make the task of the human action or activity recognition more effective by extracting new semantic or visual features. For the exploration of semantic features, researchers have tried to conduct research in terms of poses [
9] and poselets [
10], objects [
11], scenes [
12], skeletons [
13,
14], key-frames [
15], attributes [
16] and inference methods [
17], etc. For example, the Dense Trajectory Features (DTF) [
18] and the Improved Dense Trajectories (IDT) [
19] showed their high potential for effectiveness. Recently, with the rise of deep learning, methods such as using convolutional neural networks, modeling of temporal features and designing multi-branch networks have also been proposed to address the challenges existing in the human action recognition task [
5]. Further, to capture spatial and temporal features for video analysis, a 3D convolution is proposed by Ji et al. [
20]. However, due to the layer-by-layer stacking of 3D CNNs, 3D CNN models cause higher training complexity as well as higher memory requirements [
21]. In particular, many researchers put their efforts into the motion and temporal structures modeling [
22] recently. Accordingly, the short-term motion modeling method based on optical flow technique was proposed in [
23]. Temporal Segment Network(TSN) [
24] introduced a sparse segment sampling strategy in the entire video to model long-term temporal features. Furthermore, the frameworks based on 3D CNN [
20,
25] utilized convolution operation along both the temporal and the spatial dimension for the sake of making the model directly learn the relation between the temporal and the spatial features. More recently, several approaches suggested using (2+1)D CNN instead of 3D CNN to achieve more explicit temporal modeling [
26,
27].
Though some actions can be inferred by relying solely on spatial information named “scene-related” actions, such as “diving”, “ride a horse”, while other actions are more closely related to temporal information named “temporal-related” actions, such as “moving something from left to right”, “moving something approaching something”, etc., and the latter is more challenging. According to this observation, we can deduce that the importance of temporal and spatial features to the recognition of different actions is not uniform, and there should be competitive relationship between these two types of features when a model trying to recognize the action. Therefore, we believe that a good spatial-temporal features extractor should have the ability to determine what feature would be the dominant for building a better model.
In the literature, the development of spatial information(or feature) extraction methods is relatively mature, and how to efficiently extract the temporal information of a video has become one of the key challenges in the current action recognition research. We may regard the video as a fourth-dimensional representation (). In particular, W and H are the width and height of a single video frame, respectively, which can also be collectively referred to as the spatial representation of the video frame, denoted as S. T and C, which represent temporal and channel accordingly.
It should be mentioned that most of the previous work tends to focus on the two dimensions of
S and
T, but this study focuses on the channel dimension (
C) along with deep learning structure. As a breakthrough, the proposal of the Temporal Shift Module (TSM) network [
28] made researchers realize that using 2D convolution and 3D convolution to extract time information from the video has a certain equivalence, but the 2D network requires much fewer parameters and calculations. In consideration of these essential facts, we propose an efficient convolutional network for action recognition with the following main contributions.
Firstly, in view that the existing 2D structures perform not very well on the extracting of temporal context features (denoted as foreground changes) compared with the spatial feature extraction, we propose to use a motion attention module (MA) to obtain enhanced temporal features.
Secondly, for better utilization of the obtained features, a spatial-temporal channel attention (STCA) mechanism is used to perform a trainable feature fusion.
Finally, this research reveals that for temporal-related action recognition video datasets, spatial-temporal features have a complex relationship of cooperation and competition inside the convolution channel. The rest of paper is organized as follows: first, some related works are introduced in
Section 2; second, the detailed training process for the proposed approach is depicted in
Section 3; third, some experimental results are reported in
Section 4; finally, conclusions and discussions are given in
Section 5.
2. Related Works
Compared with single image recognition [
7], the main difference of video understanding is that it possesses temporal information. In recent years, researches on deep mining of spatial information by using convolutional neural networks have achieved fruitful results. However, the research on how to effectively model the temporal information is still one of the most important challenges in the current action recognition research. For this reason, Feichtenhofer et al. [
29] suggested that current
architectures are not able to take full advantage of temporal information and their performance is consequently often dominated by spatial (appearance) recognition. Overall, according to most research progress in the current literature, the video-based action recognition task mainly faces the following three challenges.
Effective extraction of spatial and temporal features. Especially the extraction of temporal features is the focus of current research and is one of the great challenges.
The systematic integration of spatial and temporal features. From the early score [
23] fusion to the recent pixel-level fusion [
29], researchers have proposed many novel spatial-temporal feature fusion methods, but at present, there is still much room for improvement in the effectiveness of each method.
Improving the efficiency of action recognition network. This is one of the important constraints on whether the proposed algorithm can be applied in practice. It is also the focus of current research and will be discussed in the following sections.
In the literature, researchers have proposed a few advanced spatial-temporal convolution structures. For intuitive comparison, the spatial-temporal convolution structure proposed by other researchers and the new structure we proposed in this paper are put together here and they can be formally divided into six categories: C2D [
30,
31], C3D [
25,
32], cascade 3D [
27,
33], a reversed cascade 3D (derived from [
27]), parallel [
34] and DTP.
Figure 1a illustrates a C2D Residual Block (TSN [
30] and Temporal Relational Reasoning Network (TRN) [
31] belong to this category);
Figure 1b shows a C3D network;
Figure 1c indicates a cascade 3D network, which decomposes a standard 3D kernel with a 1D temporal convolution followed by a 2D spatial convolution;
Figure 1d demonstrates a reversed cascade 3D architecture, which exchanges the order of temporal and spatial convolution in cascade 3D;
Figure 1e displays a parallel architecture, which models the spatial and temporal information in two independent branches;
Figure 1f shows a new channel-separate temporal convolution proposed to replace the standard temporal convolution in
Figure 1e, named as DTP.
As for the parallel structure, such as SlowFast [
35], ArtNet [
34], which independently extracts the temporal and spatial information in a video. In this paper, based on the parallel structure, we decompose the 3D convolution into a spatial and a temporal convolution. The core contribution is that we no longer use the cascading method to perform the two kinds of convolutions but extract spatial-temporal information, respectively, (as shown in
Figure 1e). By doing so, the network will not be disturbed by spatial information when extracting temporal information. Moreover, we separate the original temporal convolution in the parallel structure, and technically use a depth-wise temporal convolution (the characteristic of depth-wise temporal convolution is to decouple the temporal and channel dimensions, and perform temporal convolution on a depth-by-depth basis) instead of the standard temporal convolution. We call this structure as DTP (depth-wise temporal parallel), which is shown in
Figure 1f.
Currently, the existing structures can be roughly summarized into 5 categories: CNN+Pooling [
29,
30], CNN+RNN [
31,
36,
37], C3D [
25,
32], Efficient 3D [
33,
38], and new C2D(NC2D) [
3,
28]. We will introduce these different architectures separately and the representative results of them on the Something-Something V1 dataset, which is the most widely used dataset in action recognition tasks.
CNN+Pooling [
29,
30,
39,
40]: Karpathy et al. [
39] achieved the initial success in applying CNN to the video field for the first time. Then a two-stream structure proposed by Simonyan et al. [
40] surpassed the manual feature method for the first time in terms of accuracy. Later, TSN network [
30] uses the redundancy between video frames on the basis of the “two-stream” network and proposes a sparse sampling strategy that can greatly reduce the computational cost of the network compared to the dense sampling. Although the computational complexity of 2D networks is lower than that of 3D networks, the “frame-by-frame” processing has caused many difficulties in utilizing the full temporal information contained in the video. Based on the reports of their experimental results, the CNN+Pooling method (represented by TSN [
30]) can obtain 19.7% Top-1 accuracy at the computational cost of 33G floating point operations(Flops) on the Something-Something V1 dataset.
CNN+RNN [
31,
36,
37]: Researchers have proposed several CNN+RNN methods for the temporal modeling. However, this category of method has an inherent flaw: when CNN is used for encoding, the temporal information of the underlying features is often directly lost. Besides, it only considers the temporal information in the top-level features, so its recognition accuracy is not significantly improved compared to C2D networks such as the two-stream networks. Based on the reports of their experimental results, the CNN+RNN method (represented by TRN [
31]) can obtain 34.4% Top-1 accuracy at the computational cost of 33G Flops on the Something-Something V1 dataset.
C3D [
25,
32]: The 3D convolution network can simultaneously extract the spatial-temporal features of a video. Tran et al. [
25] first elaborated on how to apply C3D network to the field of action recognition. However, the lack of large-scale video datasets for pre-training makes the 3D network training very difficult. Although the recognition accuracy of this method is not much improved compared with the previous methods, it opens up a new horizon for the application of 3D convolution in the field of action recognition. Carreira et al. [
32] proposed the I3D network, which introduced a deep 3D network for the first time, and creatively used the weights of the 2D-InceptionV1 network pre-trained on ImageNet to initialize the network through the “inflated” operation. The proposal of the 3D network has greatly improved the recognition accuracy, but its large amount of parameters and calculations make it difficult to deploy on mobile devices with low computational cost. Based on the reports of their experimental results, the C3D method (represented by I3D [
32]) can obtain 41.6% Top-1 accuracy at the computational cost of 306G Flops on the Something-Something V1 dataset.
Efficient 3D [
33,
41,
42,
43]: Since the high computational complexity of 3D convolution is mainly due to the simultaneous convolution of the three dimensions of the video (S, T, C), researchers proposed to decouple the two dimensions of time and space, and decompose the original 3D convolution into cascaded spatial convolution and temporal convolution [
33,
42,
43]; Another way to reduce the computational overload is to use a hybrid spatial-temporal convolution structure similar to “2D-3D” or “3D-2D” to replace the part of 3D convolution in the 3D network kernel [
44]. However, the structure, like “CNN+RNN”, may lose part of the temporal information of the feature. In addition, many experiments have proved that the spatial-temporal features (low-level and high-level) in different layers of the fusion network are meaningful for action recognition [
27]. More recently, X3D [
38] reveals that networks with thin channel dimension and high spatial-temporal resolution can be effective for video recognition. This is also highly consistent with the point held by this research, i.e., making full use of the spatial-temporal channel can improve the performance of the task of action recognition. Based on the reports of their experimental results, the Efficient 3D method (represented by S3D [
33]) can obtain 47.3% Top-1 accuracy at the cost of 66G Flops computational power on the Something-Something V1 dataset.
NC2D [
3,
28,
45]: The previous C2D networks have the advantage of fewer parameters and a low computational cost but struggle to use temporal information effectively compared to C3D networks. To address this issue, TSM [
28] uses a shift in the temporal channel so that channel information of two adjacent frames is shared. Recently, Gated Shift Module(GSM) [
3] has attracted the attention of researchers, and it performs very well, while being extremely light in terms of network width and parameters. Much luckily, this category also provides our study with many aspirations. Based on the reports of their experimental results, the NC2D method (represented by GSM [
3]) can obtain 49.6% Top-1 accuracy only at the cost of 33G Flops computational power on the Something-Something V1 dataset currently.
Motivations: In summary, it should be said that both 2D and 3D methods have their own advantages, but this research mainly focuses on the scope of the NC2D method. On the one hand, 2D and 3D methods have great equivalence in extracting video features and the 2D network requires much fewer parameters and calculations [
28]; on the other hand, compared to the 3D method, NC2D methods generally extract spatial-temporal features separately, which ultimately leads to the necessity of experiencing a reasonable fusion of the two types of features. However, because the existing methods do not conduct an in-depth evaluation of the qualitative or quantitative contribution of the individual features obtained by most of NC2D methods to the classification accuracy, thus largely limits the NC2D method to achieve better classification results on time-related video datasets. However, before the contribution of spatial-temporal features can be evaluated reasonably, it is necessary to explore the relationship between them inside the convolution channel, which is the prerequisite for designing the fusion method. To this end, we start with the latest and typical NC2D structure to explore the internal relationship between spatial-temporal features inside the convolution channel. Our study further explores that compared with their spatial modeling capabilities, the current NC2D structures used in the convolution channel have relatively weakened their own temporal modeling capabilities, so we proposed a MA to improve the temporal modeling capabilities of NC2D. As a result, MA enhanced the temporal modeling capability of NC2D to some extent, and the experimental results also revealed the cooperative and competitive relationship of spatial-temporal features inside the convolution channel. Finally, in response to such cooperative and competitive relationship, we propose to use a parameter-trainable STCA module to integrate temporal and spatial features instead of using a concatenated method.
3. The Proposed Approach
As mentioned in the above analysis, in order to achieve the effective extraction of temporal features and better fusion of temporal and spatial features, we propose a channel-wise spatial-temporal aggregation network (CSTANet, as shown in
Figure 2). Intuitively, our CSTANet looks like a stack of CSTA Blocks. As shown in the
Figure 2, after sampling the frames from a video, we adopt the same strategy as described in TSN [
30]. Technically, the input video is first split into
N segments equally and then one frame from each segment is sampled. In detail, we adopt the ResNet-50 [
46] as a backbone, and replace the Residual block with our CSTA block.
As shown in
Figure 3, our CSTA block contains two stages: Spatial-Temporal Feature Extraction (STE) and Spatial-Temporal Feature fusion (STF). We use DTP with MA as the spatial-temporal extractor, in which the MA is assembled to the temporal branch for enhancing the temporal context features. Next, an STF Module is used to fuse the spatial-temporal features. As for the spatial-temporal feature extraction, we use two independent branches to model the spatial-temporal information, respectively. In the spatial feature extractor module, we adopt a standard 2D convolution while applying a 1D depth-wise convolution to model temporal information. After extracting the features, in order to fuse the features extracted by the two branches, we construct a Spatial-Temporal Channel Attention (STCA) module to aggregate the spatial and temporal features channel-wisely. In contrast to Group Spatial-Temporal (GST) [
47], which uses a hard-wired channel concatenation, we use a self-adaptive and trainable approach to aggregate spatial-temporal features for each block channel. Hence, the new model should be more discerning for different types of video actions. In the following paragraphs, we first introduce the details of the novel Motion Attention Module (MA) and then depict the novel Spatial-Temporal Channel Attention (STCA) module.
3.1. Motion Attention Module (MA)
From experience, a clip often contains a moving foreground and a relatively stable background. It can be observed that in a temporal-related dataset, the backgrounds of many action categories may be closely similar. Therefore, in order to accurately distinguish these actions, we must pay more attention to the moving part of the video, which is denoted as the foreground. However, most current action recognition networks often ignore the distinction between foreground and background, and simply extract foreground (motion) and background (relatively static) features together. In fact, we believe that temporal features should be more closely related to the motion parts of a video. In order to further strengthen the model’s ability of extracting temporal information, we propose a “Parameter-Free” foreground attention mechanism, which highlights the foreground features of a motion in the video without adding additional model parameters. At the same time, the model will improve the model’s ability of perceiving foreground changes.
Given an input
,
, the proposed MA module is mainly divided into three processing steps. As an illustration, we select the
t-th frame and the (
t + 1)-th frame as inputs. As shown in
Figure 4, a mapping function is first used to map a 3-D tensor to a 2-D tensor, which is based on the statistic of activation tensors across the channel dimension. Next, we make an element-wise subtraction between adjacent frames with an activation function followed to get an attention map for frame
t. Finally, the attention map is then multiplied in pixel-wise manner to the frame
t’s feature, where ⊖ denotes element-wise subtraction, and ⊙ denotes element-wise multiplication. In detail, first we define the mapping function:
where
represents the pixel value at the position
. Correspondingly,
represents the value on the output 2D tensor on the position
. After the global features are extracted, we next calculate the changes in two adjacent frames for highlighting the moving part. So, the M-tensor of the two frames obtained in the first step is subtracted and then activated by using an activation function.
We have tried a variety of activation functions such as Sigmoid, Softmax, tanh and so on, and the experiments indicate that Sigmoid has achieved the best activation effect. In the end, the attention map is still a 2D tensor, which has the same shape as
M. In other words, the foreground attention we proposed is “spatial-specific”, i.e., the attention weights are applied to each position on the original frame. Finally, we multiply the obtained attention map with the frame
t pixel by pixel and obtain the following.
where
represents frame
t after foreground attention is applied, and
represents the foreground attention of frame
t, and
. In order to ensure that the output of MA (
) and input
are consistent in the temporal dimension, the last frame of a video is directly used as the last frame of
, i.e.,
3.2. Spatial-Temporal Channel Attention Module (STCA)
In the literature, both 3D convolution and a cascaded spatial-temporal convolution can only perform simple fusion of spatial-temporal features. Based on the results of literature comparison and feature extraction practices, we believe that the spatial-temporal information contained in different layers and channels of the network should be treated in a different manner. Among them, some channels may be more related to temporal information, and some channels may be more related to spatial information. Therefore, we propose a spatial-temporal feature fusion module (STCA), which is tightly coupled in temporal (
T) and spatial(
S). The structure is shown in
Figure 3. For the output feature maps of spatial convolution and temporal convolution
and
, where
and
, both are fed into the Squeeze module. The Squeeze module consists of a global average pooling layer and two consecutive fully connection (FC) layers. First, we process
and
to obtain
and
by the global average pooling, which is followed by two consecutive fully connected layers (FC). Among them, in order to reduce the number of parameters, the first FC layer is configured with the number of channels to the original
. In the experiments, we take
. The output of the Squeeze module is
and
. Finally, we concatenate
to form
S, followed by a FC layer and Softmax for activation in each row. That is, the temporal and spatial features in each channel obey a probability distribution as a whole, and within a channel, the spatial-temporal features form a competitive relationship. Formally,
where
is a matrix of
. In each channel, two attention weights are calculated, representing temporal attention and spatial attention, respectively. In other words, we aggregate the spatial-temporal information channel by channel. Finally, we apply a matrix multiplication to
and
to obtain the output of CSTA block (denoted as
Y), whose dimension is the same as input.
In summary, based on the DTP network structure, the MA module can make the action foreground in a video get special attention. Simultaneously, the temporal and spatial features can be fused in channel-wise by STCA module. In comparison to the recent work GSM [
3], we still keep the backbone structure of GST while developing a new fusion technique.
5. Conclusions
In short, this paper mainly explores a possible best way to extract spatial-temporal information in channel-wise in a video, and presents a new spatial-temporal feature architecture composed of DTP+MA+STCA. Based on the empirical fact that the temporal and spatial characteristics are both cooperative and competitive in the channel dimension, this paper proposes to replace the hard wired fusion method often used in previous studies with an adaptive feature fusion method. The proposed CSTANet integrates the above research inspirations and provides a reference for academic researchers in action recognition in videos.
In addition, we found that in some samples, such as “pushing something from left to right” and “moving something closer to something”, the two are very similar in the path of movement, and sometimes it is difficult for humans to distinguish. In fact, we believe that the current understanding of the characteristics of time and space is far from satisfactory. The reason is that the spatial and temporal characteristics are both complex and abstract concepts, which include at least the background, target, human body, and changes in the temporal temporal dimension of all three. Therefore, it is necessary for us to further analyze in detail the contribution of each feature to action recognition, which will be the direction of our next work.