1. Introduction
The recognition of human actions is a crucial task in video understanding, with practical applications such as abnormal behavior analysis, video retrieval, and human–robot interaction. Video action recognition, which involves identifying human behaviors, is a fundamental aspect of this task. In the past decade, there has been a surge of interest in video action recognition, initially driven by hand-crafted features [
1,
2,
3,
4], but increasingly by deep learning models [
5,
6,
7,
8,
9,
10], due to the availability of large-scale action recognition datasets, such as HMDB51 [
11], UCF101 [
12], and Kinetics [
13].
Convolutional neural networks (ConvNets), which have shown remarkable performance in static image classification, have also been applied to video tasks. However, video differs from still images in that it contains continuous and variable spatiotemporal information, making action recognition more challenging. Existing deep learning approaches can be broadly categorized into three types: two-stream networks [
5,
6,
14], 3D convolutional kernels, and computationally efficient methods. Two-stream networks utilize a separate ConvNet to learn the temporal information from optical flow, but this method is computationally expensive. 3D convolutional kernels, such as C3D [
7], I3D [
8], and R3D [
15], model temporal information, but significantly increase the number of parameters, making them difficult to optimize. Computationally efficient methods reduce the number of parameters by exploring the idea of 3D factorization. For example, P3D [
16] explores the idea of 3D factorization. More specifically, a 3D kernel (e.g.,
) can be factorized to two separate operations, a 2D spatial convolution (e.g.,
) and a 1D temporal convolution (e.g.,
). This operation simplifies 3D ConvNets and reduces the number of parameters. Other examples include R(2+1)D [
17], TSM [
18], S3D [
19], etc.
Motivated by methods that improve computational efficiency, we propose a network, called multi-scale receptive fields ConvNet (MSRFNet), consisting of pseudo-3D operations. The key to our method is a multi-scale receptive fields block, which uses a variety of receptive fields to extract features from input video frames. This block is designed to handle variously sized moving objects, which vary in size across different video samples. The input frames are processed with receptive fields of different sizes to account for this variation. MSRFNet improves upon the performance of existing methods due to the introduction of the multi-scale receptive fields block.
The contributions in this work can be summarized as follows:
The multi-scale receptive fields block is utilized to handle variously sized moving objects, which can extract better action representations.
The different branches in MSRFBlock shares common parameters to learn action features collaboratively. Meanwhile, it does not result in an increase in the number of parameters of MSRFNet.
MSRFNet provides comparable or even better results than other leading methods on three benchmarks, UCF101 [
12], HMDB51 [
11], and Kinetics-400 [
13].
The rest of the paper is organised as follows. In
Section 2, we discuss some related works for action recognition. In
Section 3, a detailed explanation of MSRFNet is provided.
Section 4 shows the experimental results with our analysis.
Section 5 concludes the paper.
2. Related Work
Video action recognition is one of the representative tasks for video understanding, which has drawn significant attention from researchers over the last decade. The existing methods of action recognition can be summarized into two kinds of methods: the methods based on hand-crafted features and the methods based on deep learning models.
Although deep learning methods have become the standard for action recognition, hand-crafted features dominated the video understanding literature before 2015 due to their high accuracy and good robustness. Hand-crafted features can be divided into global features and local features. Global features provide a holistic description of the motion target. For example, Bobick et al. [
20] first used contour information as global features to describe human action, and proposed two temporal template features, motion energy image (MEI) and motion history image (MHI). The former indicates the part of the action that occurs, while the latter reflects the sequence of the action. Unlike global features, local features describe only part of the regional information of the motion target and are more widely used. Laptev et al. [
21] extended the two-dimensional Harris corner point detector to three-dimensional space, thus detecting the most drastically changing space-time interest points (STIP) in the spatio-temporal dimension, which promotes the development of local features. Later, various feature descriptors were successively proposed to describe the action features in the vicinity of STIP, such as histogram of oriented gradients (HoG) [
1], histogram of oriented optical flow (HoF) [
2], and motion boundary histograms (MBH) [
22]. The state-of-the-art hand-crafted feature is based on dense trajectory [
3,
4], particularly improved dense trajectories (IDT) [
4], which obtains the action trajectory by calculating dense optical flow and then extracting local features along the trajectory. However, hand-crafted features have a heavy computational cost and are hard to scale and deploy.
In recent years, deep learning has made significant breakthroughs in image classification [
23], semantic segmentation [
24], object detection [
25], and other fields. Action recognition methods based on deep learning have gradually become the focus of research. Unlike hand-crafted features, deep learning models can automatically learn feature representations from video data, and the learned features are more general. The common deep learning method currently used in the field of vision is ConvNet. The seminal work was proposed by Karpathy et al. [
26], which uses a single 2D ConvNet on each video frame independently and investigates late fusion, early fusion, and slow fusion to find the best temporal connectivity patterns. However, its performance on UCF101 is 20% lower than IDT. To learn better spatio-temporal features, Simonyan et al. [
5] proposed two-stream networks for action recognition. This method includes two independent 2D ConvNets. One takes RGB frames as input to capture appearance information, called the spatial stream, while the other takes a stack of optical flow images as input to capture motion information between video frames, called the temporal stream. The prediction scores from the two streams are then averaged to obtain the final prediction. By feeding the extra optical flow information to ConvNets, ConvNet-based approaches achieved similar performance to the previous best hand-crafted feature IDT. However, optical flow cannot capture long-range temporal information due to its characteristics. To tackle this problem, Wang et al. [
14] proposed the temporal segment network (TSN). TSN divides a whole video into several segments, and all the segments share the same two-stream networks. Then, a segmental consensus, such as average pooling and max pooling, is performed to aggregate information from the sampled segments. Thanks to this sparse sampling strategy, TSN is able to see the content from the entire video, thus modeling long-range temporal information. There are also some works that combine ConvNets with recurrent neural networks (RNN) [
27] for action recognition, such as LRCN [
28], scLSTM [
29], and CapsGaNet [
30]. Although RNNs have a strong ability to process data with temporal relationships, their training processes are unstable, which makes them hard to train. Besides, it is still challenging for 2D ConvNets to learn temporal information directly from raw video frames.
Optical flow is an effective motion representation for describing object movement and provides orthogonal information compared with RGB images. This allows 2D ConvNets to learn temporal information. However, calculating optical flow is computationally intensive and requires significant storage. A natural way to understand a video is by using 3D convolution kernels to handle video data, which leads to the usage of 3D ConvNets. The seminal work for using 3D ConvNets for action recognition is [
31]. However, the network in [
31] is not deep enough to show the potential of 3D ConvNets. Therefore, Tran et al. [
7] proposed to use a deeper 3D ConvNet, named C3D, for training and testing. Although the performance of C3D on standard benchmarks is not as good as expected, it shows strong generalization capability and is available as a generic feature extractor for other video tasks. However, using 3D kernels leads to an increase in the number of parameters, making 3D ConvNets hard to optimize and require large-scale action recognition datasets. This situation was alleviated when Carreira et al. [
8] proposed inflated 3D ConvNets (I3D). The main contribution of I3D is inflating the ImageNet pre-trained 2D model weights to their counterparts in the 3D model. Thus, 3D ConvNets can avoid being trained from scratch. Moreover, I3D achieves high recognition accuracy on UCF101 and HMDB51 after pre-training on a new large-scale dataset, Kinetics-400. Wu et al. [
32] proposed a method for human activity recognition using depth videos with both spatial and temporal characteristics. They constructed a hierarchical difference image and fed it to a pre-trained CNN, which then classified human activities by combining the videos. Although this method achieved better accuracy than skeleton-based models, it should be noted that the limited availability of datasets may have contributed to this result.There are also works aiming to reduce the training complexity of 3D ConvNets. For example, P3D [
16] and R(2+1)D [
17] explore the idea of 3D factorization into two separate operations, a 2D spatial convolution and a 1D temporal convolution. This split operation allows spatial and temporal features to be optimized separately, thus reducing the optimization difficulty of the model. The difference between P3D and R(2+1)D is how they arrange the two factorized operations and how they formulate each residual block. Zhou et al. [
33] propose another way of simplifying 3D ConvNets. They integrate 2D and 3D ConvNets in a single network, termed MiCTNet. MiCTNet is proposed on the hypothesis of the uneven distribution of spatiotemporal information in video, aiming to generate deeper and more informative feature maps. S3D [
19] combines the merits of the approaches mentioned above, proving that it is feasible to replace 3D convolution with low-cost 2D convolution.
Table 1 shows the number of parameters of ResNet-50 in 2D, 3D, and P3D structures. To avoid a large number of parameters, our network employs the structure of P3D networks.
In this paper, we aim to handle the situation that the size of moving objects in different video samples is not consistent. Some of them look very small in a video frame while some occupy the whole frame. We propose a method of multi-scale receptive fields to process large or small moving objects. The structure of P3D is adopted, with dilated convolution being deployed in some related layers to change the receptive field.