Next Article in Journal
Uncovering Disturbance Observer and Ultra-Local Plant Models in Series PI Controllers
Previous Article in Journal
Patient-Oriented Herb Recommendation System Based on Multi-Graph Convolutional Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Micro-Attention Branch, a Flexible Plug-In That Enhances Existing 3D ConvNets

1
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
2
National Computer Network Emergency Response Technical Team/Coordination Center of China (CNCERT/CC), Beijing 100029, China
*
Author to whom correspondence should be addressed.
Symmetry 2022, 14(4), 639; https://doi.org/10.3390/sym14040639
Submission received: 23 February 2022 / Revised: 16 March 2022 / Accepted: 19 March 2022 / Published: 22 March 2022

Abstract

:
In the field of video motion recognition, with the increase in network depth, there is an asymmetry in the amount of parameters and their accuracy. We propose the structure of micro-attention branches and a method of integrating attention branches in multi-branches 3D convolution networks. Our proposed attention branches can improve this asymmetry problem. Attention branches can be flexibly added to 3D convolution network in the form of plug-in, without changing the overall structure of the original network. Through this structure, the newly constructed network can fuse the attention features extracted by attention branches in real time in the process of feature extraction. By adding attention branches, the model can focus on the action subject more accurately, so as to improve the accuracy of the model. Moreover, in the case that there are multiple sub-branches in the construction module of the existing network, 3D micro-attention branches can well adapt to this scenario. In the kinetics dataset, we use our proposed micro-attention branch structure to construct a deep network, which is compared with the original network. The experimental results show that the recognition accuracy of the network with micro-attention branches is improved by 3.6% compared with the original network, while the amount of parameters to be trained is only increased by 0.6%.

1. Introduction

Video action recognition is the basic research direction of machine learning in the video field, and the requirements for this direction have increased significantly in recent years. Video action recognition is widely used in many fields such as autonomous driving, smart homes, game interactions, video review, security, sports training, etc. Video can be considered as a collection of images at a specific frequency over a period of time. Compared with still images, it provides additional motion information. These additional clues can be used to identify many actions. In addition, video can provide natural data enhancement for single-image classification.
The research into video recognition is usually developed on the basis of image-recognition methods. These methods are usually adjusted according to the video resolution and the number of classifications needed to deal with video data. A large number of these video action-recognition methods are insensitive to temporal modeling because they involve shallow high-dimensional coding based on local spatiotemporal features. These methods only use spatial features and do not benefit much from temporal information.
The 2D ConvNet architectures are not able to take full advantage of temporal information, and their performance is consequently often dominated by spatial (appearance) recognition. In order to better extract the features of temporal dimensions, the existing directions mainly focus on 3D convolution neural networks. Up to now, these CNN models for video tasks have involved the extraction of original data [1,2] in spatiotemporal dimensions. Moreover, some models subsequently introduce optical flow in the video as a feature to better identify the motion trajectory.
In order to better address the action-recognition task in video, the existing methods are often realized by increasing the depth of the 3D convolution network [3,4,5,6,7,8,9,10,11,12,13,14,15,16]. However, this kind of scheme will bring new problems. With the increase in network depth, the parameters to be trained in the model will also increase significantly, increasing the difficulty of network training. Furthermore, there is asymmetry in parameter scale and accuracy. This will also cause an overfitting phenomenon of the model, which is not convenient for the migration of the model. Moreover, in the video clips, there are complex backgrounds in addition to the action subjects, which makes it more difficult to focus on the action subjects. Therefore, we propose an attention branch structure suitable for 3D convolution depth networks. This is used to assist the model in focusing the action subject from the complex background. The structure is added to the existing network in the form of plug-ins to improve the accuracy of the network.
In this article, we design an attention branch for a 3D convolutional neural network. Attention branches act on the convolution kernel of the network. Attention branches can be added to existing excellent deep networks in the form of plug-ins. They can be flexibly added to any location in the network. Since there is no need to change the framework of the existing network, it can retain the advantages of the existing network structure. Although the addition of attention branches leads to an increase in network parameters, the training process of the new network becomes efficient due to its special structure. Our contributions are as follows:
  • We propose a 3D attention branch structure for convolution networks. The branch can fuse the attention features into the original network in real time, avoiding isolation between the attention features and the original network. The attention branch can not only optimize feature extraction, but also reduces training time;
  • In view of the existence of multiple sub-branches in the network module, we propose the concept and implementation of a micro-attention branch. Through this method, the spatiotemporal features extracted by the micro-attention branches can be applied to the corresponding convolution network with sub-branches.
The structure of this paper is organized as follows. Section 2 discusses the existing network framework for video action classification. Section 3 introduces the design of the attention module and micro-attention module suitable for the 3D convolution network. Section 4 shows the setup of the experiment, the experimental results and the analysis of the results, and Section 5 shows our conclusions and future research directions.

2. Related Works

Action recognition in video has been extensively studied over the past few decades. The research process is as follows: the transition from manually designed, local, spatiotemporal features [17,18,19] to mid-level descriptors [20,21,22], and finally to end-to-end learning deep video representations. Many kinds of DNNs [3,5,8,23,24] have been proposed to extract spatiotemporal information from video.
2D ConvNet: A simple method in this direction is to select one video frame as the input of a 2D convolution network [25] for action recognition. A single frame in the video shows the instantaneous state of the action, but it can not provide complete information on the action. The presence of a lot of interference information irrelevant to the action that needs attention in the video frame will cause confusion. Even humans cannot accurately judge what action is occurring through a single picture. These methods of simply using a single video frame ignore the encoding information of the action in the temporal dimension. This information is usually provided by multiple consecutive frames. Therefore, we use a 3D convolution kernel to replace the original 2D convolution kernel, which can capture and distinguish features from the spatiotemporal dimension.
2D ConvNet with LSTM: In theory, a more satisfying approach is to add a recurrent layer to the model [12,26,27], such as an LSTM [28], which can encode state, and capture temporal ordering and long-range dependencies. This structure design makes LSTM only use the high-level features extracted by the last layer of the CNN, and it cannot capture the primary features extracted by the low-level convolution module. In some cases, these characteristics are crucial. This also leads to excessive training costs.
3D ConvNet: 3D ConvNets [1,8,29] are naturally suitable for video modeling. They use 3D convolution kernels to realize the function of spatiotemporal filters. 3D ConvNets can directly create hierarchical representations of spatiotemporal data. Compared to 2D ConvNet, 3D ConvNet has the ability to model temporal information better owing to 3D convolution and 3D pooling operations. In 3D ConvNets, due to their structural characteristics, convolution and pool operations can be performed in the spatiotemporal dimension, while in 2D ConvNets, convolution and pool operations are only performed in the temporal dimension. One problem with these models is that they have many more parameters than 2D ConvNets with similar structures, which makes them more difficult to train. Moreover, with the increase in network depth, there will be increasing parameters. The large demand for computing resources has become a bottleneck restricting the development of the network. The attention branch that we propose for a 3D convolution network can improve the network performance by adding a small number of parameters.
Multi Stream: Video files contain rich clues, such as pictures, sounds, locations and other information. Therefore, in recent years, increasing network models have adopted multi-channel design to make full use of this modal data information to improve network performance. The network often adopts a multi-stream structure [3,12,26,30,31,32,33,34] design to make use of these information sources. In addition to video frames, networks also use optical flow, sound and other features. This was shown to achieve a very high performance on existing benchmarks. However, these multi-stream structures lead to huge network scales and many parameters, which increases the difficulty of training and resource consumption. The above network architectures are shown in Table 1.
Dataset: In order to meet the needs of machine-learning research in the field of video, many research institutions have released relevant datasets [35,36,37,38,39,40]. Table 2 shows the statistical information of these video datasets. These datasets contain a large number of action classifications, each of which contains hundreds of video clips. This provides an important support for the design and verification of the new model. Most of the videos in these large-scale datasets are self-made videos uploaded by users on video websites, which contain interference information (such as dithering lens, different shooting angles, complex light, etc.). This will also cause the classification accuracy of some actions to decrease compared to the average accuracy. For this reason, we propose a solution to add attention branches to the existing neural network to realize the extraction of key features in the video and realize the real-time fusion of attention features and original features.

3. Materials and Methods

3.1. Attention Branch

At this stage, when building large-scale video datasets, researchers will download the video materials uploaded by users from the video website, and then undertake further processing. This leads to video content in the dataset not only of action itself, but also the interference information such as complex backgrounds, jittery pictures, various shooting angles and so on. The impact of this interference information on the network model includes longer model-training time, more difficult training and more consumption of computing resources. With the increasing depth of the network, these negative effects will become more and more obvious. In order to reduce the impact of interference information in video on the network, we propose adding an independent attention branch to the 3D ConvNets for video classification tasks. Our proposed attention branch tends to be added to the 3D convolution kernel of the network. The 3D convolution kernel is the basic structure of 3D ConvNets. They can be flexibly added to most 3D convolution networks. At present, the deep convolution network can be considered to be composed of different convolution modules connected in turn according to the design order. Based on the understanding of convolution network structure, the attention branch proposed by us can be added to the basic module of the network in the form of a plug-in. This can not only maintain the overall structure of the convolution network, but can also realize the real-time fusion of attention features and the extracted features of the original network.
The attention branch we propose is composed of a 1∗1∗1 3D convolution kernel and a 3∗3∗3 maximum pooling layer. The method of adding attention branches to the original network draws lessons from the branch fusion method in the existing residual network [41]. We select a convolution module in the original network to add attention branches. The convolution kernel of 1∗1∗1 in the attention branch is used to adjust the channel dimension in the matrix to keep it the same as the output dimension of the original convolution module. The obtained matrix then passes through the maximum pooling layer to extract the attention features in the data. This construction realizes that the dimension of the output matrix of the attention branch is the same as the dimension of the output matrix of the original network convolution branch. The output of the attention branch and the output of the original module realize the fusion of features by bit addition. In addition, the newly constructed module can replace the original module of the network. The remaining convolution modules connected with this module in the network do not need any adjustment. This design facilitates the fusion of the features extracted from the two branches of the newly constructed module. The attention branches work as feature selectors which enhance good features and suppress noises from original features.
This design ensures that the attention branch can be applied to most positions in the 3D convolution network. After replacing the original convolution module, the attention features can be fused in the network feature extraction. At the same time, the input and output matrix dimensions of the new module with the attention branch are consistent with those of the original convolution module. This design can modify the specific convolution module in the network separately without modifying other parts of the network. Moreover, the attention branching is composed of a convolution kernel with a dimension of 1∗1∗1 and a pooled layer with no parameters that only performs comparison operations. This design prevents the explosive growth of parameters in the network with attention branches.
We take the i-th 3D convolution module in the network as an example. We assume that it consists of a group of 20 channels, and each channel has a convolution kernel of 3∗3∗3. Then, the new network structure with the added attention branch is shown in Figure 1. As can be seen from the figure, the module is composed of two independent branches. The lower branch is the original module of the network, which is used to extract the features of the spatiotemporal dimension. The upper branch is our attention module. The attention module uses a 1∗1∗1 convolution kernel and a maximum pooling layer to extract feature data of the action subject in the spatial dimension. The output matrices of the two branches are merged into a new matrix through the bitwise addition operation. In this way, the output of the new module has the same dimensions as the output of the original module. Since a 1∗1∗1 convolution kernel is used, the amount of computation of the convolution kernel is only 1/27th that of the 3∗3∗3 convolution kernel. The computational complexity of the network with increased attention branches remains the same as that of the original network.
Suppose the input of the i-th module is the matrix X i (that is, the output of the previous module). The dimension of the matrix is [ N i , T i , C i , H i , W i ], where N i represents the number of videos in each group, T i represents the number of frames of each video, C i represents the number of channels, H i represents the height of each frame of the video and W i represents the width of each frame of the video. It passes through two branches of the network module with the structure shown in Figure 1 to generate matrices X i c (the result of the 3D convolution branch) and X i a (the result of the 3D attention branch) respectively. Then, the two matrices are added by the bitwise operation to obtain the output X i + 1 of the module. The calculation formula of the new module is as follows.
X ic = conv 3     3     3 X i
X ia = Pool 3     3     3 ( c o n v 1     1     1 X i )
X i + 1 = A d d ( X ic , X ia )
Formula (1a) represents the convolution operation of the original 3∗3∗3 3D convolution kernel in the network, and the method is expressed as c o n v 3     3     3 . Formula (1b) represents the feature extraction process of the attention branch. The method of the pooling layer is expressed as P o o l 3     3     3 . Formula (1c) shows that the results of the two branches are added bit-by-bit to obtain the result of the module, and the method is expressed as Add.
Due to the characteristics of the new module structure, it can not only ensure the fusion of the attention feature with the original feature, but also ensures that the output dimension of the new module is the same as the output dimension of the original module. Therefore, we can replace the original module with the module with the attention branch almost anywhere in the 3D convolutional neural network. Or, we can replace the original convolution module with any number of new modules in the network. There is no need to modify the rest of the network.
By adding attention branches to the convolution module, the separation between attention features and original network features can be avoided. The purpose of our proposed attention branch is to optimize the features extracted by the original module. In our proposed structure, the dimension of the feature matrix obtained by the attention branch needs to change with the output of the original convolution module. Moreover, like the residual network in the image field, the attention feature is added to the original module before using the activation function. According to the characteristics of this structure, it can avoid the occurrence of gradient descent and gradient explosion when the attention branch is applied to a deep neural network framework, thereby improving the efficiency of training and reducing training difficulty.

3.2. Micro-Attention Branch

3.2.1. Scarecrow Model

The structure of a deep neural network is not only composed of simple convolution kernel modules stacked in order. The real situation is that the structure of the basic module of the network is often complex. The basic convolution module itself may have multiple convolution sub-branches. The reason for this design is to ensure that the feature data can be extracted from different focuses. Let us take the current I3D network with better recognition accuracy in the field of action classification to illustrate this. It is composed of several basic modules with four sub-branches connected in sequence. The four branches are Branch1 (single-layer convolution structure), Branch2 (pooling layer + convolution structure), Branch3 (two-layer convolution structure), Branch4 (two-layer convolution structure). The four branches are independent of each other when extracting features from the same input. Finally, the outputs of the four branches are concatenated to obtain a matrix and used as the output of the module.
When we add attention branches according to Section 3.1, we can consider the multi-branch module as a whole and add 3D attention branches to the entire module. At this time, the output dimension of the new module remains the same as the original module. Under this condition, we can still directly use the new module to replace the original module without modifying other parts of the network.
We also take the i-th module in the network as an example, assuming that the original module is composed of n 3∗3∗3 3D convolution branches (the structure is shown in Figure 2A). We take this module as a whole and add independent attention branches to construct a new module (the structure of the new module is shown in Figure 2B). The calculation formula of the new module is as follows.
X ic = C o n c a t ( c o n v 1 X i , , c o n v n X i )
X ia = Pool 3     3     3 ( c o n v 1     1     1 X i )
X i + 1 = A d d ( X ic , X ia )
Formula (2a) represents the calculation process of the multi-branch module of the original network. The input matrix X i passes through n independent convolution kernels. Then, we use the concat method to form these matrices into a matrix X i c along the direction of the third dimension. The calculation process of Formula (2b) is the same as that of Formula (1b) in the previous section, which is the feature-extraction process of the attention branch. Formula (2c) indicates that the final output matrix X i + 1 of the new module is obtained by bitwise addition of X i c and X i a .

3.2.2. Micro-Attention Branch

However, is this reasonable? When we add attention branches to the convolution module with multiple sub-branches as a whole, in order to ensure that the attention feature and the original feature can be smoothly integrated, we need to keep the two dimensions of the two outputs the same. This weakens the design purpose of the network module with multiple sub-branches. Moreover, this will mean that the feature data extracted from the attention branch cannot be applied to the corresponding branch. Because different sub-branches in the module have different emphases on extracting data, when adding attention features to such modules, attention branches should be added to different branches separately. Another important reason is that the structure of sub-branches in modules is often different, so we cannot directly add attention branches to this module as a whole.
In order to avoid the above phenomenon that attention characteristics cannot be effectively integrated, we optimized the scheme in Section 3.2.1. We propose micro-attention branch scheme. A micro-attention branch can be considered as the result of splitting the original attention branch according to the dimension of the corresponding sub-branch. In this way, the extracted attention features can be fused to the corresponding sub-branches. The structural design of the micro-attention branch remains the same as the attention branch in Section 3.1, and its structure is still composed of a 1∗1∗1 convolution kernel and a 3∗3∗3 maximum pooling layer.
The micro-attention branch mainly acts on the network module where there are multiple sub-branches and the output results of the sub-branches are combined through the Concat operation. The sub-branch and the corresponding micro-attention branch use the same input matrix to perform feature extraction independently, and add the feature data bit-by-bit when outputting the result, achieving the purpose of fusing attention features. Finally, the results of multiple sub-branches are combined into a final output matrix by Concat, which is used as the input of the next network module.
We still take the i-th basic module of the network as an example, assuming that it contains n convolution sub-branches. We add micro-attention branches to construct a new network module (the structure is shown in Figure 3). The calculation process of the new module is shown in Formula (3).
X imc = c o n v m X i
X ima = P o o l m ( c o n v m X i )
X imo = A d d ( X imc , X ima )
X i + 1 = C o n c a t ( X i 10 , , X in 0 )
In Formula (3), X i m c represents the output matrix of the m-th convolution sub-branch of the i-th module, X i m a represents the result of the attention branch of the m-th branch of the i-th module, and X i m o represents the result of fusing attention features in the m-th branch of the i-th module.
By comparing Formulas (2) and (3), we can find the obvious difference between the two and when to introduce attention features. Formula (2) indicates that the original module and the attention branch operate independently without interfering with each other, and finally fuse the extracted features of the two modules. Formula (3) means that the attention branch is divided into multiple micro-attention branches. The micro-attention branch and the corresponding sub-branch are operated independently to extract features, and the result is added bitwise as the result of the sub-branch. Then, we connect these results of multiple branches through the Concat method as the output of this network module. The algorithm describing integrating the micro-attention branch is shown in Algorithm 1.
The micro-attention branch inherits the characteristics of the attention branch. It can keep the dimension of the result of convolution module unchanged and add can attention features to the characteristics of sub-branches in the module. Based on this feature, the deep neural network only needs to add micro-attention branches for the selected modules with multiple sub-branches, without adjusting the rest of the network.
Algorithm 1 Micro-Attention branch Algorithm
Require: X i : the result of i-1-th module; M o d u l e : the i-th module with some branch; n: the number of i-th module branch;
Ensure: X i + 1
1:
for each m [ 1 , n ]  do
2:
     X i m c = c o n v m X i
3:
     X i m a = P o o l m ( c o n v m X i )
4:
     X i m o = Add( X i m c , X i m a );
5:
     X i + 1 = Concat( X i + 1 , X i m o );
6:
end for
7:
return X i + 1
In addition to solving the problem of how to correctly integrate attention features in multi branch modules, micro-attention branches are more flexible. We can freely set the number of branches of the micro-attention branch used according to the needs. This design can flexibly reduce the parameters that need training and further reduce resource consumption in the training process.

4. Experiments

This chapter will experimentally verify the effects of the attention branch and micro-attention branch we propose. The experiment involves three directions: (a) Does the application of micro-attention branches generate improvements compared to the original network structure? (b) When the module is a multi-branch structure, is it reasonable to directly increase the attention unit? (c) In the same network structure, is there a correlation between the number of micro-attention branches and the classification effect?

4.1. Experimental Setup

Because the proposed attention branch acts on the 3D convolution module in the neural network, we chose the excellent I3D network [3] as the basic network for the experiment. Its network structure is shown in Figure 4. There are multiple basic modules with similar structures in this framework. The structure of these basic modules is shown in Figure 5A, and we named them Module A. The module consists of four independent sub-branches (from right to left in the figure, named Branch1 to Branch4). Existing papers have proved that the network structure has achieved excellent results on multiple video datasets.
According to the content in Section 3.2, we modified the basic module (as shown in Figure 5) of I3D. We added attention branches and different numbers of micro-attention branches respectively to construct 3 different modules, named Module B (Figure 5B: using two micro-attention units, which are added to the two sub-branch of the original module), Module C (Figure 5C: using a micro-attention unit, which is only used on a sub-branch of the original module), and Module D (Figure 5D: the original module is considered as a whole, and an attention branch is added). Taking the first basic module in the basic I3D network as an example, we introduce the specific structure of these four modules (the structure is shown in Table 3). Branch1 to 4 are the four sub-branches of the basic module A; ModuleAtt represents the attention branch structure directly added with module A as a whole; Branch1Att and Branch2Att represent the micro-attention branches added by Branch1 and Branch2, respectively.
On the premise of keeping the basic network structure unchanged, these four modules are used to replace the basic modules in the original network I3D to construct four networks for training and testing. Network 1 is the original network, called I3D; Network 2 is a network that uses module B as the basic module, called R m A 3 D 1 ; and Network 3 is a network that uses module C as the basic module, called R m A 3 D 2 .
The experimental training and test data used Kinetics [38], which is the open source video dataset. We selected seven action categories (exercising with an exercise ball, parasailing, washing dishes, pull ups, cleaning shoes, folding paper, pumping fist) from it. The number of video clips in the dynamics dataset is shown in Table 4.
Data augmentation is known to be of crucial importance for the performance of deep architectures. We process the selected videos through the CV tool, and extract 80 frames with 224∗224 resolution from each video as the input of the network. For shorter videos, we looped the video as many times as necessary to satisfy each model’s input interface.
The above networks are independently trained from scratch in the same software and hardware environment. The machine-learning library we use is tensorflow, version V1.5. The programming language we use is python, version v3.6. Training on videos used standard ’stochastic gradient descent’ with momentum set to 0.9 in all cases. Models were trained using a similar learning rate.

4.2. Results and Discussion

Table 5 reports the results and the monitoring content of these networks during the training process. Comparing the first two rows of Table 5, we can find interesting results. It can be found that the R m A 3 D 1 network which uses two micro-attention branches obtains the best results. Compared with I3D, the classification accuracy of R m A 3 D 1 is improved by 3.6%, while the number of parameters is increased by only 0.6%. We believe that this is due to the reasonable integration of the attention features extracted by the attention branch and the features extracted by the original module. Then, we compare the data in rows 1 and 3 of Table 5. The accuracy of the R m A 3 D 2 network is also increased by 2.8% compared to I3D. We can add micro-attention branches in the network to improve the accuracy of the model in dealing with action-recognition tasks.
Then, we compare the data in rows 2 and 3 of Table 5. It can be found that the recognition accuracy of the R m A 3 D 1 network with two micro-attention branches and the R m A 3 D 2 network with one micro-attention branch is higher. It can be seen that when there are multiple sub-branches in the basic module of the network, more micro-attention branches are added in the module, and the higher the accuracy of the network model for action recognition. That is, the number of micro-attention branches is directly proportional to the effect of the final model.
Figure 6 shows the accuracy of each model under different action categories. We can find that R m A 3 D 1 produces consistently better results than other network on all the classes. We believe that this is the result of the more effective integration of attention features into the process of feature extraction. In addition, we can find that in the classification of ‘washing dishes’ and ‘cleaning shoes’, the results of the R m A 3 D 1 network are only slightly improved compared with the original network. This is because the action subject is obvious in this kind of video, and there is almost no complex background. In the classification of ‘pumping fist’, due to the complex background of this kind of video, the recognition accuracy of R m A 3 D 1 is significantly higher than that of the original network.
Finally, we compare the time consumption of an iteration. We can find that the training of the I3D network takes the longest time. We believe that this is due to the structural characteristics of the residual network, which speed up the training speed. Through this structure, adding the attention branch to the original module can improve the adjustment efficiency of the parameters of the feedforward network.
Comparison with scarecrow model. We still choose I3D as the basic network and add attention branches by taking the basic modules as a whole. The compared network that uses module D as the basic module is called RA3D.
It can be found from Table 6 that although the parameter amount of RA3D increased by 25.0% compared to R m A 3 D 1 , the classification accuracy decreased. The effect of the RA3D network is not only worse than the R m A 3 D 1 network, but also worse than the original I3D network. This shows that when there are multiple branches in the network module, directly using the attention branch will increase the resource consumption in the training phase and negatively affect the accuracy of the model. It can be seen that the micro-attention branch is suitable for application in a multi-branch network, and can improve the recognition accuracy of the original network while not increasing resource consumption as much.
By comparing the contents in Table 5 and Table 6, we can find that among the four networks, the network that uses the micro-attention branches has the highest recognition accuracy. This is because in the network module of multiple sub-branches, different branches have different emphases on extracting features, so these sub-branches cannot be taken as a whole. This makes it impossible to use 3D attention branches directly for this type of module. In the test phase, the testing time of the networks with added attention branches is longer than that of the original network. This is because these new networks add a small number of operations such as convolution. The increase is within the acceptable range.
During the experiment, we recorded the changes in memory occupation and time consumption of four network structures during training (the relevant content is shown in Figure 7). Combining the content in Table 5 and Table 6, we can find the following phenomena: Although the average memory occupation of the ra3d network is the highest (8.7 GB) and the time-consumption is the longest (72,917 s), its classification effect is the worst. In the training process of I3D, the average memory occupation is the smallest (7.5 GB), but the network time consumption (57,924 s) is slightly higher than that of the network with micro-attention branches. This shows that adding micro-attention branches through the residual network structure can improve the training efficiency of the model.
Through the above experiments, it can be found that the attention branch has a positive effect on the model. However, when there are multiple convolution sub-branches in the stacking module of the network, the attention branch cannot be used directly. At this time, micro-attention branches can be added independently to multiple sub-branches of the original network, which can effectively improve the classification effect of the final model and reduce the training time. Under the same conditions, the number of micro-attention units has a positive effect on the training effect.
Comparison with other networks. We choose to compare our results with other attention algorithms (2D + LSTM). ResNET [41] is selected as the 2D Network in this structure. The last layer of multiple ResNETs is connected with LSTM, and video frames are input to different ResNETs according to time sequence. The number of ResNets in the structure is related to the number of video frames. This means that with the increase in pictures, the number of ResNETs also increases. In order to control the number of parameters, we limit the input to 10 frames instead of 80 frames. We evenly select 10 frames from the video clips.
It can be found from Table 7 that although we deliberately reduced the number of input images, the parameters of the 2D + LSTM network increased by 46.81% compared with R m A 3 D 1 . We also found that its memory consumption was much higher than that of R m A 3 D 1 . However, the accuracy of 2D + LSTM is only two-thirds that of of R m A 3 D 1 . This shows that although 2D + LSTM can theoretically obtain features of the spatiotemporal dimension, it only uses high-level features and cannot capture the primary features of video frames. Our proposed attention branch can be added anywhere in the network, so it can capture features of different levels.
Applied to two-stream 3D convolution network. Although 3D ConvNet should be able to learn motion features directly from RGB input, existing networks prefer to make full use of multi-source data in video, such as optical flow. Therefore, we apply a 3D micro-attention branch to a two-stream I3D network to test the effect of the structure. We choose the two-stream network which is state-of-the-art, with one network trained on RGB inputs and another trained on optical flow inputs which carry optimized, smooth flow information. We used the R m A 3 D 1 network instead of I3D to construct a comparative experiment. The two branches of the network have exactly the same structure. Video frames are processed in the same way as other networks. We computed optical flow with a TV-L1 algorithm [42]. We trained the two networks separately and averaged their predictions at test time.
It can be found from Table 8 that the accuracy of two-stream network with the micro-attention branch is improved compared with the original network. 3D attention branches can be used not only to extract attention features from video, but also from optical flow. We believe that the application conditions of micro-attention branches are only related to the network structure.

5. Conclusions

In order to reduce the interference of complex backgrounds in videos and avoid blindly increasing the depth of the network, we designed a 3D attention branch and 3D micro-attention branch. The structure can be added to the existing 3D ConvNet in the form of a plug-in. The attention branch tends to be added to the 3D convolution kernel of the network. This structure can improve the sensitivity of the network to action subjects while keeping the overall structure of the original network unchanged. At the same time, we use the micro-attention branch to achieve the extraction and fusion of attention features when the original network module has multiple sub-branches. By comparing the experimental results, we find that the network with attention branches can effectively improve the accuracy of the network at the cost of a small increase in parameters. In other words, we can apply micro-attention branches to existing 3D deep neural networks to improve the performance of these networks. The benefits of our attention branch are three-fold: It can extract and use attention features reasonably; it improves the training efficiency of the network; it can be flexibly added to the existing 3D ConvNet.
Video is composed of multi-modal data, such as frames, sound, time and so on. Next, we plan to study the impact of sound on action-classification tasks and the application of attention branches to these kind of data.

Author Contributions

Conceptualization, Y.L. and W.Y.; methodology, Y.L.; software, Y.L. and H.W.; validation, Y.L. and T.T.; formal analysis, Y.L.; data curation, Y.L. and W.Y.; writing—original draft preparation, Y.L.; writing—review and editing, W.Y.; visualization, Y.L. and H.W.; supervision, W.Y.; project administration, Y.L.; funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China (Grant No. 62072051).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank anonymous reviewers for their criticism and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 221–231. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Baccouche, M.; Mamalet, F.; Wolf, C.; Garcia, C.; Baskurt, A. Sequential deep learning for human action recognition. In International Workshop on Human Behavior Understanding; Springer: Berlin/Heidelberg, Germany, 2011; pp. 29–39. [Google Scholar]
  3. Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar] [CrossRef] [Green Version]
  4. Diba, A.; Fayyaz, M.; Sharma, V.; Karami, A.H.; Mahdi Arzani, M.; Yousefzadeh, R.; Van Gool, L. Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification. arXiv 2017, arXiv:1711.08200. [Google Scholar]
  5. Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast Networks for Video Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6201–6210. [Google Scholar] [CrossRef] [Green Version]
  6. Ng, J.Y.H.; Davis, L.S. Temporal Difference Networks for Video Action Recognition. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1587–1596. [Google Scholar] [CrossRef]
  7. Sun, L.; Jia, K.; Yeung, D.Y.; Shi, B.E. Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4597–4605. [Google Scholar] [CrossRef] [Green Version]
  8. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar] [CrossRef] [Green Version]
  9. Tran, D.; Wang, H.; Feiszli, M.; Torresani, L. Video Classification With Channel-Separated Convolutional Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 5551–5560. [Google Scholar] [CrossRef] [Green Version]
  10. Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 318–335. [Google Scholar]
  11. Yao, L.; Torabi, A.; Cho, K.; Ballas, N.; Pal, C.; Larochelle, H.; Courville, A. Describing Videos by Exploiting Temporal Structure. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4507–4515. [Google Scholar] [CrossRef] [Green Version]
  12. Yue-Hei Ng, J.; Hausknecht, M.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4694–4702. [Google Scholar]
  13. Zhao, Y.; Xiong, Y.; Lin, D. Trajectory Convolution for Action Recognition. In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Montreal, QC, Canada, 2018; Volume 31, pp. 2208–2219. [Google Scholar]
  14. Zolfaghari, M.; Singh, K.; Brox, T. ECO: Efficient Convolutional Network for Online Video Understanding. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 713–730. [Google Scholar]
  15. Yang, C.; Xu, Y.; Shi, J.; Dai, B.; Zhou, B. Temporal Pyramid Network for Action Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 588–597. [Google Scholar] [CrossRef]
  16. Feichtenhofer, C. X3D: Expanding Architectures for Efficient Video Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 200–210. [Google Scholar] [CrossRef]
  17. Willems, G.; Tuytelaars, T.; Van Gool, L. An efficient dense and scale-invariant spatio-temporal interest point detector. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2008; pp. 650–663. [Google Scholar]
  18. Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar]
  19. Laptev, I.; Lindeberg, T. Velocity adaptation of space-time interest points. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR 2004), Washington, DC, USA, 23–26 August 2004; Volume 1, pp. 52–56. [Google Scholar]
  20. Raptis, M.; Kokkinos, I.; Soatto, S. Discovering discriminative action parts from mid-level video representations. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1242–1249. [Google Scholar]
  21. Jain, A.; Gupta, A.; Rodriguez, M.; Davis, L.S. Representing videos using mid-level discriminative patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 23–28 June 2013; pp. 2571–2578. [Google Scholar]
  22. Wang, L.; Qiao, Y.; Tang, X. Motionlets: Mid-level 3d parts for human motion recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2013; pp. 2674–2681. [Google Scholar]
  23. Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
  24. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
  25. Ning, F.; Delhomme, D.; LeCun, Y.; Piano, F.; Bottou, L.; Barbano, P.E. Toward automatic phenotyping of developing embryos from videos. IEEE Trans. Image Process. 2005, 14, 1360–1371. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Donahue, J.; Hendricks, L.A.; Rohrbach, M.; Venugopalan, S.; Guadarrama, S.; Saenko, K.; Darrell, T. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 677–691. [Google Scholar] [CrossRef] [PubMed]
  27. Wu, Z.; Wang, X.; Jiang, Y.G.; Ye, H.; Xue, X. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 461–470. [Google Scholar]
  28. Sepp, H.; Jürgen, S. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar]
  29. Taylor, G.W.; Fergus, R.; LeCun, Y.; Bregler, C. Convolutional learning of spatio-temporal features. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2010; pp. 140–153. [Google Scholar]
  30. Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14), Montreal, QC, Canada, 8–13 December 2014; MIT Press: Cambridge, MA, USA, 2014; Volume 1, pp. 568–576. [Google Scholar]
  31. Wu, Z.; Jiang, Y.G.; Wang, X.; Ye, H.; Xue, X. Multi-stream multi-class fusion of deep networks for video classification. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 791–800. [Google Scholar]
  32. Chéron, G.; Laptev, I.; Schmid, C. P-cnn: Pose-based cnn features for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3218–3226. [Google Scholar]
  33. Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional Two-Stream Network Fusion for Video Action Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar] [CrossRef] [Green Version]
  34. Weinzaepfel, P.; Harchaoui, Z.; Schmid, C. Learning to track for spatio-temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3164–3172. [Google Scholar]
  35. Goyal, R.; Kahou, S.E.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5843–5851. [Google Scholar] [CrossRef] [Green Version]
  36. Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2556–2563. [Google Scholar] [CrossRef] [Green Version]
  37. Soomro, K.; Roshan Zamir, A.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
  38. Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The Kinetics Human Action Video Dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
  39. Monfort, M.; Andonian, A.; Zhou, B.; Ramakrishnan, K.; Bargal, S.A.; Yan, T.; Brown, L.; Fan, Q.; Gutfreund, D.; Vondrick, C.; et al. Moments in time dataset: One million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 502–508. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  40. Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 961–970. [Google Scholar]
  41. Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual Attention Network for Image Classification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6450–6458. [Google Scholar] [CrossRef] [Green Version]
  42. Zach, C.; Pock, T.; Bischof, H. A Duality Based Approach for Realtime TV-L1 Optical Flow. In Pattern Recognition; Hamprecht, F.A., Schnörr, C., Jähne, B., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 214–223. [Google Scholar]
Figure 1. A schematic diagram of a network convolution module with attention branch. It consists of two independent branches. The lower branch represents the original convolution network structure, and the upper branch is our proposed 3D attention branch. X i represents the input of the i-th network module, X i c represents the output of the original convolution module, X i a represents the output of the 3D attention branch, and X i + 1 represents the output of the convolution module with added attention branch.
Figure 1. A schematic diagram of a network convolution module with attention branch. It consists of two independent branches. The lower branch represents the original convolution network structure, and the upper branch is our proposed 3D attention branch. X i represents the input of the i-th network module, X i c represents the output of the original convolution module, X i a represents the output of the 3D attention branch, and X i + 1 represents the output of the convolution module with added attention branch.
Symmetry 14 00639 g001
Figure 2. The structure of adding attention branches to network modules with n sub-branches. (A) Represents the original module structure containing N convolution sub-branches. These sub-branches convolute the same input X i , and input the results into concat method to obtain the output X i + 1 of the module. (B) Represents the original module as a whole to add the attention branch. That is, after the concat method operation, the result X i c is added bitwise with the attention feature matrix of the same dimension to obtain the final output X i + 1 .
Figure 2. The structure of adding attention branches to network modules with n sub-branches. (A) Represents the original module structure containing N convolution sub-branches. These sub-branches convolute the same input X i , and input the results into concat method to obtain the output X i + 1 of the module. (B) Represents the original module as a whole to add the attention branch. That is, after the concat method operation, the result X i c is added bitwise with the attention feature matrix of the same dimension to obtain the final output X i + 1 .
Symmetry 14 00639 g002
Figure 3. The structure of adding micro-attention branches to each sub-branch.
Figure 3. The structure of adding micro-attention branches to each sub-branch.
Symmetry 14 00639 g003
Figure 4. The original network structure diagram of I3D. The experiments are based on the structure of the network, and construct different deep neural networks by replacing the Base Module with different attention modules.
Figure 4. The original network structure diagram of I3D. The experiments are based on the structure of the network, and construct different deep neural networks by replacing the Base Module with different attention modules.
Symmetry 14 00639 g004
Figure 5. The structure of four basic network modules for replacement. (A) Represents the basic module of I3D, which contains four sub branches. (B) Represents a new module constructed by adding micro attention branches to the two sub branches of the basic module. (C) Represents a new module constructed by adding a micro attention branch to a sub branch of the basic module. (D) Represents a new module constructed by adding attention branches to the basic module as a whole.
Figure 5. The structure of four basic network modules for replacement. (A) Represents the basic module of I3D, which contains four sub branches. (B) Represents a new module constructed by adding micro attention branches to the two sub branches of the basic module. (C) Represents a new module constructed by adding a micro attention branch to a sub branch of the basic module. (D) Represents a new module constructed by adding attention branches to the basic module as a whole.
Symmetry 14 00639 g005
Figure 6. The prediction accuracy of different networks under different action types.
Figure 6. The prediction accuracy of different networks under different action types.
Symmetry 14 00639 g006
Figure 7. This figure shows the training time and memory usage of a single batch of data. The x-axis represents time in 1000 s, and the y-axis represents memory usage in GB.
Figure 7. This figure shows the training time and memory usage of a single batch of data. The x-axis represents time in 1000 s, and the y-axis represents memory usage in GB.
Symmetry 14 00639 g007
Table 1. Summary of existing action-recognition network structures.
Table 1. Summary of existing action-recognition network structures.
Network Structure2D ConvNet2D ConvNet + LSTM3D ConvNetMulti-Stream
InputOne FrameFramesFramesFrames + Other (optical flow, sound)
Table 2. Statistics for recent human action-recognition datasets. ‘Year’ indicates the time when the dataset was published; ‘Actions’ indicates the number of action categories contained in the dataset; ‘Clips’ indicates the number of video clips under each category; ‘Total’ indicates the number of all video clips; and ‘Videos’ indicates the number of all video clips after de-duplication. Clips are extracted.
Table 2. Statistics for recent human action-recognition datasets. ‘Year’ indicates the time when the dataset was published; ‘Actions’ indicates the number of action categories contained in the dataset; ‘Clips’ indicates the number of video clips under each category; ‘Total’ indicates the number of all video clips; and ‘Videos’ indicates the number of all video clips after de-duplication. Clips are extracted.
DatasetYearActionsClipsTotalVideos
HMDB-51201151min 10267663312
UCF-1012012101min 10113,3202500
ActivityNet2002015200avg 14128,10819,994
Kinetics2017400min 400306,24530,6245
Moments2019339avg 17571,000,0001,000,000
Table 3. The structure of the first basic module.
Table 3. The structure of the first basic module.
Branch NameBranch StructureModule AModule BModule CModule D
Branch1 C o n v 1 1 1 96
C o n v 3 3 3 128
TRUETRUETRUETRUE
Branch1Att C o n v 1 1 1 128
M a x P o o l 1 3 3 3 1
TRUETRUETRUEFALSE
Branch2 C o n v 1 1 1 16
C o n v 3 3 3 32
TRUETRUETRUETRUE
Branch2Att C o n v 1 1 1 32
M a x P o o l 1 3 3 3 1
TRUEFALSETRUEFALSE
Branch3 M a x P o o l 1 3 3 3 1
C o n v 1 1 1 32
TRUETRUETRUETRUE
Branch4 C o n v 1 1 1 64 TRUETRUETRUETRUE
ModuleAtt C o n v 1 1 1 256
M a x P o o l 1 3 3 3 1
FALSEFALSEFALSETRUE
Table 4. Statistics of video clips in the dataset.
Table 4. Statistics of video clips in the dataset.
Class NameExercisingParasailingWashing DishesPull UpsCleaning ShoesFolding PapePumping Fist
Train543,866543,835543,997543,898543,901544,824543,992
Validate34,12934,13234,13034,12934,13034,17934,131
Table 5. The results of the comparative experiment between the network with micro attention branches and the original network.
Table 5. The results of the comparative experiment between the network with micro attention branches and the original network.
NetworkParamAvg-MemTrain-TimeTOP-1TOP-2Testing Time
I3D12.30 M7.5 GB57,924 s40.1%58.8%2.52 s
R m A 3 D 1 12.41 M8.07 GB57,838 s43.7%63.1%2.61 s
R m A 3 D 2 12.38 M8.11 GB56,985 s42.9%60.5%2.60 s
Table 6. The results of the comparative experiment between the network with micro attention branches and the scarecrow model.
Table 6. The results of the comparative experiment between the network with micro attention branches and the scarecrow model.
NetworkParamAvg-MemTrain-TimeTOP-1TOP-2Testing Time
R m A 3 D 1 12.41 M8.07 GB57,838 s43.7%63.1%2.61 s
RA3D15.51 M8.7 GB72,917 s38.2%49.3%2.78 s
Table 7. The results of the comparative experiment between the network with micro attention branches and other attention network.
Table 7. The results of the comparative experiment between the network with micro attention branches and other attention network.
NetworkInputParamAvg-MemTOP-1
R m A 3 D 1 RGB frames12.41 M8.07 GB43.7%
2D + LSTMRGB frames18.22 M16.05 GB27.9%
Table 8. The results of the comparative experiment between the two-stream network with micro attention branches and the original network.
Table 8. The results of the comparative experiment between the two-stream network with micro attention branches and the original network.
NetworkInputParamAvg-MemTOP-1
Two-Stream R m A 3 D 1 RGB frames + optical flow24.82 M16.0 GB49.1%
Two-Stream I3DRGB frames + optical flow24.80 M15.1 GB45.4%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, Y.; Tu, T.; Wang, H.; Yin, W. Micro-Attention Branch, a Flexible Plug-In That Enhances Existing 3D ConvNets. Symmetry 2022, 14, 639. https://doi.org/10.3390/sym14040639

AMA Style

Li Y, Tu T, Wang H, Yin W. Micro-Attention Branch, a Flexible Plug-In That Enhances Existing 3D ConvNets. Symmetry. 2022; 14(4):639. https://doi.org/10.3390/sym14040639

Chicago/Turabian Style

Li, Yongsheng, Tengfei Tu, Huawei Wang, and Wei Yin. 2022. "Micro-Attention Branch, a Flexible Plug-In That Enhances Existing 3D ConvNets" Symmetry 14, no. 4: 639. https://doi.org/10.3390/sym14040639

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop