Next Article in Journal
Modification of Mineral Content in Bread with the Addition of Buckwheat Husk
Previous Article in Journal
Selective Alkali Activation of Limestone for Additive Manufacturing in Construction: Influence of Alkali Concentration on Physical and Mechanical Properties
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improved Video Action Recognition Based on Pyramid Pooling and Dual-Stream C3D Networks

College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(8), 4454; https://doi.org/10.3390/app15084454
Submission received: 5 March 2025 / Revised: 26 March 2025 / Accepted: 31 March 2025 / Published: 17 April 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
This paper proposes an innovative video behaviour classification method based on pyramid pooling and a variable-scale training strategy, which aims to improve the video behaviour-recognition performance of a 3D convolutional neural network (3D-CNN) and a dual-stream C3D network. By introducing pyramid pooling and secondary pooling operations, the number of pooling layers is optimised, the number of parameters of the model is significantly reduced, and the recognition accuracy is effectively improved. In the improved dual-stream C3D network, the early fusion strategy is adopted to better combine the spatio-temporal features and improve the accuracy of the model. In addition, by introducing the optical flow feature, the model’s perception ability of video dynamic information is enhanced, and the recognition performance is further improved. Experimental results show that the proposed method performs well on multiple video datasets, which is better than the existing mainstream methods, which proves the innovation and efficiency of the proposed method in the field of video behaviour recognition.

1. Introduction

Action recognition techniques aim to identify and classify different behaviours by analysing features such as human pose, movement trajectory, speed, etc, in video sequences. In recent years, deep learning, especially convolutional neural networks (CNNs), has become a mainstream approach for action recognition due to its powerful image and video-processing capabilities. CNNs are able to automatically learn feature representations from low-level to high-level, avoiding the complex process of manually designing feature extractors, and thus significantly improving recognition accuracy. As technology continues to advance, researchers have begun to introduce more complex network architectures, such as long short-term memory networks (LSTMs) and recurrent neural networks (RNNs), in order to better process time-series data and capture dynamic features of actions [1]. Despite significant progress in action recognition, deep learning methods still face many challenges, mainly including the lack of generalisation ability of the models, the high real-time requirement, and the effect of environmental disturbances. Due to the diversity and complexity of actions, the performance of existing models in different environments or datasets often varies greatly, making it difficult to effectively migrate-in new application scenarios; many practical applications (e.g., video surveillance, human-computer interaction, etc.) require the system to have a high real-time processing capability [2]; when dealing with large-scale video data, the existing methods often face speed bottlenecks; furthermore, lighting changes, occlusion, background clutter, and other environmental factors can also affect the quality of video data, thus reducing the accuracy of action recognition.
In order to address the limitations of existing methods, this paper proposes an improved video character classification method based on deep learning. The method combines a gated recurrent unit (GRU) with a three-dimensional convolutional neural network (3D-CNN) and an improved two-stream C3D network for more efficient behaviour classification. Specific objectives include enhancing the adaptability of the model to improve its performance in different environments and datasets through feature fusion and dynamic feature learning; optimising the real-time performance by improving the network structure to increase the processing speed and reduce the latency; and enhancing the robustness by automatically extracting features with robustness to reduce the effects of factors such as illumination changes and occlusion. The research steps include extracting static and dynamic features from video data and fusing them, designing a GRU-based video character classification method, and combining 3D-CNN with an improved C3D network for behaviour recognition. Finally, this paper conducts experiments on standard datasets such as UCF101 and HMDB51 to validate the effectiveness of the proposed method and analyses the impact of different feature fusion strategies and network structures on recognition accuracy. Through these improvements, this paper expects to improve the generalisation ability and real-time processing ability of the action recognition system while overcoming the interference of environmental factors and promoting the wide application of action recognition technology [3]. Through these improvements, this paper expects to enhance the generalisation ability and real-time processing capability of the existing action recognition system while overcoming the interference of environmental factors so as to promote the breakthrough of action recognition technology in a wider range of applications.

2. Methodology

Figure 1 illustrates the contents of this chapter.

2.1. Static and Dynamic Feature Extraction for Character Movement

2.1.1. First Frame Static Feature Extraction Based on CNN

The static information in video frames can effectively reflect the characteristics of people or objects in the scene; therefore, extracting the features of the first frame of a video has become a common method. We use a convolutional neural network (CNN) to extract the features of the first frame of each video clip. CNNs perform well in image classification, so in this section, a neural network architecture is constructed based on the network design for image classification to extract the features of the first frame of a video. The network contains five convolutional layers, two pooling layers, two fully connected layers and one Softmax layer [4], and the outputs of the convolutional and fully connected layers are nonlinearly processed by the ReLU activation function, and finally, the Softmax layer outputs a 1000-dimensional vector as the static features of the video frame. Figure 2 illustrates the structure of this neural network.

2.1.2. Trajectory-Based Motion Frame Generation

In dynamic feature extraction, the optical flow algorithm describes the motion features by calculating video trajectories, which provide pixel-level motion information reflecting the dynamic features of the object, such as direction, velocity, and acceleration, and accurately capturing the spatial position and motion changes of the object. Based on these trajectories, it is possible to describe the motion patterns of objects in detail. In order to extract effective dynamic information, this section uses the density-based DBSCA algorithm to cluster the trajectories and identify significant motion regions. (The DBSCA algorithm is a density-based trajectory-clustering method that automatically identifies core points and clusters by evaluating the density around the data points and is able to handle clusters with complex shapes and effectively identify noise. It does not require a preset number of clusters and is widely used in trajectory analysis, traffic monitoring, and object tracking.) This approach not only reveals the motion patterns in the video but also provides accurate feature inputs for subsequent target tracking and behavioural analysis. The specific algorithm is shown below (Algorithm 1) [5]:
Algorithm 1. DBSCAN-Based Trajectory-Clustering Algorithm
Input: D: region of significant motion in the frame
ε: radius parameter of the cluster
MinPoints: field density thresholds
Output: set of density-based clusters
1: c < 0
2: for each P ∈ D do
3:   if P is visited then
4:    continue
5: end if
6: NeighborPts = getAllPoints(P,ε)
7: if size (NeighborPts) < MinPoints then
8:   mark p as noise
9: else
10:   c = next cluster
11:   addToCluster(P,NeighborPts,c,e,MinPoints)
12: end if
13:   mark P as visited
14: end for
15: function addToCluster(P,NeighborPts,c,ε,MinPoints)
16:   add P to cluster c
17:   for each point np ∈ NeighborPts do
18:      if P is not visited then
19:         mark np as visited
20:         NeighborPts = getAllPoints(np,ε)
21:         if size (NeighborPts) ≥ MinPoints then
22:           NeighborPts < NeighborPts joined with NeighborPts
22:           NeighborPts < NeighborPts joined with NeighborPts
23:          end if
24:       end if
25:       if np is not yet member of any cluster then
26:        add np to cluster c
27:       end if
28:       end for
29: end function
After clustering the trajectories, multiple clusters are formed for key motion regions in the video frames. To reduce the influence of noise and unimportant regions, some clusters can be culled. According to the culling rule, if the number of trajectory points of a cluster is less than 50% of the maximum cluster trajectory points of the current frame, the cluster will be culled, and the remaining clusters are used as motion candidate frames. Since each motion candidate frame is centred on the cluster centre and ε is the radius, circular clusters need to be converted to rectangles in order to achieve scale uniformity. By calculating the Chebyshev distance from each clustered point to the centre, the 20% with the furthest distance are eliminated, thus performing motion frame noise reduction based on the Chebyshev distance. The Chebyshev distance is a distance metric commonly used in mathematics and computer science, especially in the fields of image processing and pattern recognition [6]. It is inspired by the ‘king’s move’ on a chessboard and is therefore also known as the ‘chessboard distance’ or ‘king’s distance’. The Chebyshev distance is a measure of the distance between two points and is particularly useful for considering the distance of the largest difference between all axes in a standard coordinate system (e.g., a 2D or 3D coordinate system). The specific algorithm is shown below (Algorithm 2):
Algorithm 2. Chebyshev’s algorithm [7]:
Input: Dis: largest Chebyshev value in the cluster
            C: clusters of clusters in the frame
Output: the set of clusters after noise reduction
1: totalPoints ← points within Dis of center of C
2: currentPoints ← totalPoints
3: while true do
4:    if COUNT (currentPoints) < COUNT(totalPoints)*0.8 then return current Points
5:    end if
6:    Dis < Dis-1
7:    currentPoints ← points within Dis of center C
8: end while
The clustered clusters of each frame are subjected to motion frame noise reduction based on the Chebyshev technique to create motion frames, which are described by the array b = (x, y, r, f), where x and y denote the horizontal and vertical coordinates of the upper left corner of the motion frame, r denotes the height of the motion frame, and f denotes the frame where the motion frame is located in the video sequence.

2.1.3. Dynamic Feature Extraction Based on Motion Tubes

Motion pipelines are motion trajectories extracted in video frames that are linked together by consecutive time frames to form a spatio-temporal structure connecting multiple time steps [8]. In the previous section, after generating motion candidate frames by clustering the trajectories in the video frames with the DBSCAN algorithm, these motion frames need to be further processed to construct a standardised dynamic feature representation. Since each video frame has a different number and size of motion frames, the number and size of motion frames in each frame must be standardised for subsequent dynamic feature analysis.
Assuming that the input video is Vi, it can be expressed as follows after firstly splitting it into n video segments: since there are several important motion regions in each video frame, thus forming different motion frames, the video segment vi,t can be described by the motion frames in each frame as shown in Equation (1), Vi = [vi, 1, vi,2, …,]
g v i , t = b t , 1 , 1 , b t , 1 , 2 b t , 1 , p , b t , 2 , 1 , b t , 2 , 2 , b t , 2 , p b t , n , 1 , b t , n , 2 , b t , n , p
where bt,n,r denote the rth motion frame. It is worth mentioning that in the nth frame of the tth video clip, each frame has a different number of motion frames. When the number of motion frames in a video frame is different, some motion frames may not be able to form a motion pipeline, and some motion frames that are not part of the motion pipeline may end up in the motion pipeline, which will change the way the motion is described. Therefore, in order to ensure that each motion frame is connected between nearby frames, it is necessary to unify the number of motion frames in each frame if one wishes to use motion frames to create a motion pipeline between subsequent frames [9].
We first determine the average number of motion frames for each frame in a video clip, assuming that each frame of a video clip with w consecutive frames has an average of N motion frames. This allows us to unify the number of motion frames for each video frame in the video clip. If the number of motion frames in the current frame is more than N, all subsequent motion frames after N are deleted in order to ensure that the video frames contain the same number of motion frames; if the number of motion frames in the current frame is less than N, motion frames are generated manually using linear regression for each x, y, and r in the vector b = (x, y, r, f). By following the above steps, we obtain the video clips with a consistent number of motion frames as shown in Equation (2):
h ( g v i , t ) = b t , 1 , 1 , b t , 1 , 2 b t , 1 , k , b t , 2 , 1 , b t , 2 , 2 , b t , 2 , k b t , n , 1 , b t , n , 2 , b t , n , k
Now all that needs to be carried out is to use Equation (2) to construct the motion pipe describing the dynamic features; it is known that 1, 1, bt, m + 1, 1, bt, m + 1, 2, …, bt, m + 1, bt, m, k, and k are all motion frames in frame m and frame m + 1, respectively, and all motion frames in these two frames are now computed to form the Euclidean distance matrix between motion tubes in Equation (3) [10]:
D = D 11 D 12 D 1 k D 21 D 22 D 2 k D k 1 D k 2 D k k
where Dij is the Euclidean distance between the ith motion frame in m frames and the jth motion frame in m + 1 frames, and D is the Euclidean distance between all motion frames in m frames and m + 1 frames. After determining the Euclidean distance between each motion frame and each neighbouring frame, the connection of all motion frames in the neighbouring frames can be performed. A specific connection is made by assuming that Dij is the smallest value in row i. The ith motion frame is connected to the jth motion frame in frame m [11]. The description of the motion pipeline for a specific video clip is shown in Equation (4):
V i = k 1 x z , 1 y z , 1 r z , 1 k 2 x z , 2 y z , 2 r z , 2 k n x z , n y z , n r z , n
where the first column denotes the ith video clip, the kth frame in the kth video clip denotes the first column of the matrix, the operating frame in the kth frame denotes the second column of the matrix, and the last three columns denote the horizontal coordinates, vertical coordinates, and elevation angle of the motion frame in the zth frame that has to be connected to the motion tube. The MBH features formed by the kinematic tubes are computed and encoded using FV, and the resulting dynamic feature vector (n = 1000) is written as H = [H1, H2, …, Hn].

2.2. Feature Fusion Based on Dynamic and Static Features

In video analysis, both dynamic and static features can provide important information for scene understanding. To improve the performance of the model, these two types of features are usually fused together. Several common feature fusion methods are introduced next, namely Cholesky variant-based feature fusion, Gaussian distribution-based feature fusion and PCA-based feature fusion.

2.2.1. Feature Fusion Based on Cholesky Variation

The Cholesky variant is mainly used in matrix factorisation, which decomposes a positive definite Hermitian matrix into the product of a lower triangular matrix and its conjugate transpose. In feature fusion, the Cholesky variant is used to combine static and dynamic features, considering the relationship between the two. Assuming that S represents static feature vectors and M represents dynamic feature vectors, it is first necessary to check whether there exists a matrix that can be represented by S and M as shown in Equation (5).
A B = 1 ρ 0 1 ρ 1 2 × S M
The following inferences can be drawn from the above matrix:
A = S B = ρ 1 S + 1 ρ 1 2 M
Meanwhile, exchanging the positions of S and M can obtain Equation (7), similar to Equation (5):
T D = 1 0 ρ 2 1 ρ 2 2 × M S
Similarly, the following conclusions can be drawn:
T = M D = ρ 2 M + 1 ρ 2 2 S
Equations (6) and (8) lead to the conclusion that B and S are only related to ρ 1, and similarly D and M are only related to ρ 2. If we find suitable ρ 1 and ρ 2 based on ρ 2 = 1 ρ 1 2 , we have B = D for any S and M. Then both B and D can be used as fusion features C, i.e., C = B = D. The next task to be carried out is to experimentally find suitable ρ 1 and ρ 2 so that the above equation holds as much as possible. In general, depending on the dataset, ρ 1 and ρ 2 will have different contributions [12].

2.2.2. Gaussian Distribution-Based Feature Fusion

Both dynamic and static features can be regarded as a set of vectors, and the initial description of these two types of features through histograms can be further fitted using Gaussian distribution. Specifically, assuming that the static and dynamic features are represented as two vectors, these features can be described by a two-dimensional Gaussian distribution function as shown in Equation (9) [13].
P ( N ) = 1 2 π | | e [ 1 2 ( N μ ) T 1 ( N μ ) ]
where N is a vector consisting of static and dynamic features, μ is a mean vector containing the mean μ s of the static features and the mean μ m of the dynamic features, i.e., μ = μ s μ m is the covariance matrix, which represents the correlation between the static features and the dynamic features and is usually denoted as = μ s 2 ρ σ s σ m ρ σ s σ m μ m , where σ s is the standard deviation of the static features. σ m is the standard deviation of the dynamic features, and ρ is the correlation coefficient between the static features and the dynamic features [14].
As the scales of static and dynamic features are usually inconsistent, normalisation is required. To eliminate the effect of scale inconsistency on the results, scaling can be performed using a scale matrix. Specifically, the static and dynamic features are scaled using the following scale matrix as in Equation (10) [15]
S c a l i n g m a t r i x = σ s σ s + σ m 0 0 σ m σ s + σ m
This scaling process can help to balance the scale differences between static and dynamic features, thus ensuring that static and dynamic features can work together to form a more stable feature fusion in a Gaussian mixture model [16].

2.2.3. PCA-Based Feature Fusion

Principal component analysis (PCA) is a commonly used data dimensionality reduction method aimed at mapping high-dimensional data into a low-dimensional space while retaining as much information as possible. PCA determines the principal components by calculating the eigenvectors of the covariance matrix and reconstructs the data using these principal components. In feature fusion, static and dynamic features can be considered as 2D data containing 1000 dimensions each, and after dimensionality reduction by PCA, the top m principal components (m < k) are selected as the new feature representations. PCA improves the efficiency of the feature representations by retaining the maximum variance and reducing redundant features [17]. In this section, PCA-based feature fusion processes dynamic and static features into 2D data containing 1000 dimensions, extracts principal components through PCA, and finally uses the first principal component as the new feature. Specifically, the original dimension of our feature vector is 1000 dimensions (static and dynamic features). Through PCA dimensionality reduction, we select the top m principal components (where m < 1000), retaining as much variance as possible. The explained variance of each principal component indicates the contribution of that principal component to the variability of the data, and we choose which dimensions to retain based on the cumulative percentage of variance. We usually choose dimensions that retain more than 95% of the variance to ensure that the downscaled features still contain most of the information.

2.3. GRU-Based Video Character Classification

2.3.1. GRU-Based Video Character Classification Model

After fusing the above motion features, the fused features are processed to further record the temporal feature information of the motion features. The structure of Figure 3 shows the GRU-based video character classification model.
The lower part of Figure 3 represents the feature fusion of static and dynamic features.The fused sequence C = [C1, C2, … Cn] is input to the GRUs to form a neural network to capture the temporal features of the fused features of the video sequences. The whole network has 128 GRU units with a dropout of 0.8. The final output is passed through the Softmax layer and completes the classification of the video characters’ actions [18].

2.3.2. Video Classification Process for GRU Networks

GRU is a more efficient variant of LSTM that is able to handle time-series problems efficiently, and hence GRU has become a more popular choice for solving such problems. Since GRU is an optimisation of the LSTM structure, it can effectively solve the gradient vanishing and gradient explosion problems in RNNs (recurrent neural networks). Figure 4 shows the structure of the GRU unit.
In contrast to LSTM, GRU contains only update gates and reset gates. In Figure 4, zt represents the update gate, and rt represents the reset gate. The input and forget gates in the LSTM are equivalent to the update gate zt, which determines the discarding and adding of information, whereas the extent to which the previous data-are erased is determined by the reset gate rt [19]:
z t = σ W z x t + U z h t 1
r t = σ W r x t + U r h t 1
where xt denotes the input,   σ at the current moment denotes the sigmoid function, and W and U denote the weight parameter matrices that the GRU unit needs to be trained in advance. ht denotes the candidate value of the hidden layer hiding condition as shown in Equation (13):
h ^ = tanh W h x t + r t U h h t 1
h t = 1 z t h ^ + z t h t 1
The current state of the hidden layer can be found by combining ht in Equation (14):
When the reset gate rt tends to 0, then most of the state information of the previous hidden layer will be ignored, and the inputs of the currently fused dynamic and static features will take up a larger proportion of the inputs whereas tz is used to determine the ratio of the state of the buried layer at the previous moment to the current-state candidate of the current hidden layer, and thus to determine the hidden layer that is currently in the state of use.
The fused static and dynamic features trained by the video character classification model in GRU are able to ensure that the video is characterised over the time-series, which can improve the recognition accuracy [20].

2.4. Classification of Movements

2.4.1. Three-Dimensional Convolutional Neural Network Variable-Scale-Based Feature Extraction

Spatial pyramid pooling (SPP) is a widely used feature extraction method for image classification and target detection [15], which, unlike the traditional 2D convolutional methods, is capable of handling images with variable dimensions. In this study, the three-dimensional pyramid pooling (SPP) method is used and introduced into the C3D network, replacing the original maximum pooling layer, so as to optimise the spatio-temporal feature extraction of the video. The 3D pyramid pooling is performed by pooling the cubic feature maps of different sizes output from the Conv5-b layer of the C3D network, which are finally fused into a fixed-length feature vector, as shown on the left side of Figure 5.
In 2D pyramid pooling, assuming that the last convolutional layer produces a number of future maps of different sizes and assuming 256 dimensions in the future map, for these future maps do not use the pooling operations of 4 × 4, 2 × 2, and 1 × 1, and finally combined, it is guaranteed that the different sizes of future maps after the pyramid pooling are all able to produce 21 (= 4 × 4 + 2 × 2 + 1 × 1) features [21].
In this work, we accept the concept of 2D pyramid pooling and introduce it into the C3D network by adding a 3D pyramid pooling layer after Conv5-b of the C3D network to replace Pool5 in the original C3D network. The specific 3D pyramid pooling process is shown in the right half of Figure 5. Combined with the pooling kernel formula and step size formula of the 2D pyramid network [22], the pooling kernel formula and step size formula of the 3D pyramid pooling are shown in Equation (15) and Equation (16), respectively:
P o o l i n g   k e r n e l = L p t i × H p s i × W p s i
S t e p   l e n g t h = L p t i × H p s i × W p s i
where L × H × W is the size of the cubic future map output by the C3D network Conv5-b, and pti and psi describe the 3D pooling level of the ith layer of the pyramid; i {1, 2, 3}, ipt is set to 1, and ips is set to 2i.
In summary, the whole pooling process of the 3D pyramid is equivalent to dividing the different sizes of cube future maps generated by Conv5-b of the C3D network according to the different sizes of cube grids and performing the maximum pooling operation on each small cube grid and finally fusing these features to obtain a fixed-length feature vector with the length of [1 × (21 × k)], which combines the C3D network of the Conv5-b output, taking a k value of 512.

2.4.2. Video Character Behaviour Classification Network Based on Dual-Stream Improved C3D

Although the traditional C3D network is capable of extracting spatio-temporal features of the video, its extraction of temporal features is rougher, and thus it is inferior to the dual-stream method [23] in terms of recognition accuracy. In the video action recognition task, spatial and temporal streams are two key components that are responsible for extracting different types of information from the video to ensure that the video content can be fully understood. The spatial stream is mainly responsible for extracting static spatial features from video frames, and by processing each frame, the spatial stream is able to capture structural features in the image, such as the pose of a person, the shape of an object, and the static parts of the background. Since each frame in a video is essentially a still image, spatial streaming focuses on extracting the static information of these images, such as edges, textures, colours, and so on. In this way, spatial streaming helps to capture local details in the video and build an image-level spatial representation; temporal streaming focuses on extracting the dynamic information in the video. Unlike spatial streaming, which deals with static images, temporal streaming deals with the temporal dimension of the video and captures the dynamic behaviour in the video mainly by analysing the motion and changes between consecutive frames [24]. By introducing optical flow features, temporal flow is able to effectively capture the trajectory of objects and the movement process of characters, thus identifying the actions and activities in the video. These dynamic features are essential for recognising rapid changes and continuous actions, so temporal flow often complements spatial flow information by extracting motion patterns in the video. Combining the dual-stream method and the C3D network and using the C3D network to extract RGB features and optical flow features, respectively, can effectively improve the extraction of spatio-temporal features.
Figure 6 illustrates the improved two-stream C3D network architecture. The network is divided into two streams, the RGB stream for extracting spatial features and the optical stream for extracting temporal features [25]. The dual-stream C3D network has two main architectural improvements: 1. the maximum pooling layer is replaced with a 3D pyramid pooling layer, and 2. a fully-connected layer is added based on the number of action categories to enhance the representation of action features. In dual-stream networks (e.g., dual-stream C3D networks), spatial and temporal streams are usually combined by early or late fusion. The spatial stream provides the static spatial characteristics of the video, while the temporal stream complements the dynamic information of the video. By combining these two types of information, the model is able to understand both the static structure and the dynamic changes in the video, thus improving the accuracy of action recognition. In the improved approach of this paper, an early fusion strategy is used, which fuses the features of spatial and temporal streams at an early stage of the network to ensure that both complement each other at a preliminary stage and provide a more comprehensive representation of the features [26]. This fusion approach is able to better preserve the features of each stream compared to the traditional late fusion approach, thus improving the accuracy of the recognition.

3. Experimental Section

The experiments were performed on a workstation with an NVIDIA GTX 1080Ti graphics card using the TensorFlow framework. The training parameters were set to a learning rate of 0.001 and a batch size of 32; the optimiser was selected as Adam, and the loss function was cross-entropy. The model training lasted for 10 epochs, and an early-stop strategy was introduced to prevent overfitting, where training is automatically terminated when the validation set loss stops decreasing.

3.1. Experiment 1: Experimental Study of Video Behaviour Recognition with Different Feature-Contribution Ratios

The experiments in this section aim to achieve the optimal combination of static and dynamic features through optimal feature fusion to improve the accuracy of video behaviour recognition. Considering the different effects of different datasets on the feature-contribution ratio, different dynamic and static feature-contribution ratios are used in the experiments. Specifically, we set up four groups of different dynamic and static feature-contribution ratios, as shown in Table 1. By combining the second column of the ρ 1   o r   ρ 2 relationship with ρ 1 = 1 ρ 2 2  [27], by adjusting these ratios, we are able to evaluate their effects on recognition under different feature combinations, thus verifying the effectiveness of feature fusion.
After determining the contribution of different motion features, this experiment will be conducted on the UCF101 and Hollywood2 datasets. The Hollywood2 dataset contains 10 scene types and 12 character motion types, with a total of 3669 video clips of 20.1 h, all from 69 Hollywood films [28], thus making it more challenging. Table 2 shows the results of the feature fusion experiments for the UCF101 and Hollywood2 datasets based on different dynamic and static feature contributions.
From the results, it can be seen that a ratio of 8:2 performs best for recognition accuracy for both datasets. However, the best dynamic and static feature-contribution ratio of 6:4 for certain action categories, such as standing, sitting up, etc., suggests that certain actions rely on more dynamic features, which affects the final classification results.
Figure 7 demonstrates the performance of different action categories in the Hollywood2 dataset when S:M is 8:2 and 6:4. Whether S:M = 8:2 or S:M = 6:4, the contribution of static features accounts for a relatively large proportion, especially in the later stage, where the fusion of GRU features effectively compensates for the timeline features ignored by the CNN network in single-frame image recognition.

3.2. Experiment 2: Behavioural Feature Extraction for Variable-Scale Video

First, the effectiveness of video feature extraction using the variable-scale behavioural algorithm is tested on the UCF101 dataset, a widely used behavioural recognition dataset containing 101 different behavioural categories [29], with videos from YouTube and other web sources. The dataset is widely used for behaviour recognition, action classification, and video-understanding research.
Two feature extraction methods are used in this experiment for video analysis (Table 3): a regular 3D convolutional neural network is used to extract standard video 3D convolutional features; a pyramid pooling 3D convolutional neural network introduces a pyramid pooling layer, which aims to enhance the feature extraction capability under multi-scale video input, especially when dealing with videos of different scales, and can effectively enhance the recognition accuracy. Two training strategies are also used in the experiments: variable input scale training and fixed input scale training. In variable input size training, the video size is set to different scales (e.g., 161 × 112, 162 × 220, 320 × 112, etc.) [30], the initial learning rate is 0.001 and adjusted to one-tenth of the original learning rate after 15,000 and 20,000 iterations, respectively, and each training batch contains 30 video data. Fixed input size training, on the other hand, unifies the video input size and is initialised using 3D convolutional network weights pre-trained on the Sport-1M dataset, with the same learning rate adjustment strategy as variable input size training. During the experiments, all models are trained on the UCF101 Split 1 dataset, and only the RGB frames of the original video are used as inputs, and finally, the behaviour-recognition accuracy of each model on the dataset is calculated.
Experiments were conducted to observe the effect of pyramid pooling levels and parameters on feature extraction, with the aim of investigating whether the pooling level affects the performance of the extracted features in behavioural recognition and how many parameters are needed for training. Different pyramid pooling settings lead to different feature extraction qualities Two types of pyramid pooling are set up in this chapter: the first is a two-level pyramid pooling, and the other is a three-level pyramid pooling, and the pooling level for each level is calculated in the same way as in the introduction. Table 4 shows the experimental results, and it can be concluded that using the first type of pyramid pooling gives slightly higher accuracy than the general 3D convolutional neural network architecture but requires fewer training parameters. Using the second type of pyramid pooling gives the best results but also has a relatively large number of parameters.
The results show that the model with a 3D pyramid pooling layer outperforms the standard 3D convolutional neural network. With both training strategies, variable input size training is about 1% more effective than fixed input size training. This result may be attributed to the fact that the pre-trained weights and 3D pyramid pooling layer are more effective in extracting spatio-temporal features, while the variable-scale training helps to reduce overfitting.

3.3. Experiment 3: Video Character Behaviour Classification Experiment Based on Two-Stream Improved C3D Network

Video behaviour classification experiments based on a dual-stream improved the C3D network on UCF101 and HMDB51 datasets. The HMDB51 dataset contains 51 behaviours, and the data sources include movies, web videos, etc., which are highly challenging. In this experiment, features are trained on RGB stream and optical stream 3D network streams, respectively. In the RGB stream, the input is a sequence of 16 unstacked video frames with the same parameters as the original C3D structure; in the optical flow 3D network, the initial learning rate is set to 0.003, and after 20,000 and 40,000 iterations, the learning rate decreases to one-tenth of the initial value. Thirty data are used for each training. After training, feature fusion of the two-stream network is performed, using early fusion and late fusion, respectively. Early fusion involves L2 normalisation of the output vectors of each C3D network after the FC6 layer of the two streams and then concatenating the vectors of the two streams; late fusion involves averaging the category probabilities after the Softmax output of each stream.
In order to evaluate the impact of early fusion and late fusion on behaviour-recognition accuracy, this experiment was conducted on the UCF101 and HMDB51 datasets, and the results are shown in Table 5. The experiments show that early fusion is superior to late fusion, which may be due to the fact that early fusion is able to retain more feature information in the convolutional layer, which can be fused before maximum pooling, and more effectively characterise the movement of people in the video.
On the UCF101 dataset, the original 3D convolutional neural network shows a significant improvement in recognition accuracy compared to our proposed two-stream improved C3D network. This may be attributed to the 3D pyramid pooling layer, which can more effectively extract and retain spatio-temporal information feature. Meanwhile, the dual-stream improved C3D network performs better in spatio-temporal feature extraction, especially through the extraction of optical flow features, which further enhances the model’s ability to capture the spatio-temporal features of people’s movements in the video. Finally, the early fusion of dual-stream features better preserves the features of the respective streams compared to the late fusion, thus improving the recognition accuracy.

3.4. Experimental Results and Analyses

The experiment mainly adopts the Top-1 accuracy rate as the key index for performance evaluation and compares it with traditional methods (e.g., SVM, traditional RNN, etc.). The specific evaluation indexes include the Top-1 accuracy rate, which measures whether the most probable category in the classification result is consistent with the real category and the Top-5 accuracy rate, which evaluates whether the first five predicted categories in the classification result contain the real category. These standard assessment metrics provide the basis for the comparison of the experimental results.
From the above experimental results, it can be seen that the GRU + 3D-CNN method proposed in this paper achieves significant performance improvement in both the UCF101 and HMDB51 datasets (Table 6 and Table 7). Compared with the traditional RNN model, GRU reduces the problem of gradient vanishing through a more effective gating mechanism and thus performs better in long time-dependent tasks. And the introduction of GRU helps to better capture the temporal dynamics and further improves the classification accuracy compared to the 3D-CNN approach alone. In addition, the model is also analyzed by confusion matrix and found to have high accuracy for recognition of most action categories, especially in scenes with complex actions and fast movements, where the advantage of the GRU layer is more obvious.
In terms of computational efficiency, the training and inference times of the different models were compared. The training time and inference time of the GRU + 3D-CNN model are improved compared to the traditional RNN, but the significant improvement in accuracy makes it a worthy option. The specific training times are shown in Table 8.
Although GRU + 3D-CNN takes longer to train, this added time is worthwhile compared to the improved recognition accuracy.
The experimental results show that the dual-stream C3D network (especially the late fusion) and the pyramid pooling technique perform well on several standard datasets and significantly improve the accuracy of action recognition (Table 9). Especially on the UCF101 dataset, the dual-stream C3D network achieves an accuracy of 88.3%, which is about 5% higher than other methods, proving the effectiveness of the dual-stream network in fusing spatial and temporal features. Combined with pyramid pooling, the accuracy of the C3D network on UCF101 is improved to 85.9%, which enhances the spatial-information-processing capability of the model through multi-scale feature extraction. The experiments also show that dynamic features are crucial for improving the recognition accuracy, and the two-stream C3D network outperforms the traditional 3D-CNN on the UCF101 and HMDB51 datasets. In addition, the late fusion outperforms the early fusion on most of the datasets, and the fusion is significantly better on UCF101 in particular. Pyramid pooling also effectively enhances the multi-scale feature extraction capability of the model when dealing with complex scenes such as Hollywood2 and Something-Something V1. Overall, combining these techniques not only improves the accuracy of video action recognition, but also enhances the generalisation ability of the model in complex video scenes, which promotes the potential of actionrecognition technology in practical applications.

4. Conclusions

The video behaviourrecognition method proposed in this study significantly improves the accuracy and computational efficiency of video behaviour recognition by combining pyramid pooling with an improved two-stream C3D network. Experimental validation shows that the method has good performance on multiple video datasets, especially in reducing computational complexity while maintaining high recognition accuracy. In the future, the network structure can be further optimised, and more feature fusion strategies can be explored to cope with more complex behaviourrecognition tasks. Meanwhile, the model combining optical flow features and spatio-temporal information has a promising application in dynamic video recognition, which is expected to promote further development in the field of video analytics.

Author Contributions

Conceptualization, Y.T. and H.L.; methodology, Y.T.; software, Y.T.; validation, Y.T., X.F. and H.L.; formal analysis, Y.T.; investigation, Y.T.; resources, Y.T.; data curation, Y.T.; writing—original draft preparation, Y.T.; writing—review and editing, Y.T., X.F. and H.L.; visualization, Y.T.; supervision, H.L.; project administration, H.L.; funding acquisition, X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant number 61962047), the Inner Mongolia Autonomous Region Science and Technology Major Special Project (grant number 2021ZD0005), the Inner Mongolia Autonomous Region Natural Science Foundation (grant number 2024MS06002), the Inner Mongolia Autonomous Region Universities and Colleges Innovative Research Team Program (grant number NMGIRT2313), the Basic Research Business Fund for Inner Mongolia Autonomous Region Directly Affiliated Universities (grant number BR22-14-05), and the Collaborative Innovation Projects between Universities and Institutions in Hohhot (grant numbers XTCX2023-20, XTCX2023-24).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in UCF10, HMDB51, and Hollywood2 datasets, which are publicly accessible. These data were derived from the following resources available in the public domain: UCF10: URL: UCF101 Dataset; HMDB51: URL: http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/ (accessed on 4 March 2025); Hollywood2: URL: http://www.di.ens.fr/~laptev/actions/hollywood2/ (accessed on 4 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Liu, W.T.; Liu, M.F.; Liu, H.H. A review of human behaviour recognition based on deep learning. Mod. Inf. Technol. 2024, 8, 50–55. [Google Scholar]
  2. Huang, S.; Mao, J. A surface EMG gesture recognition method based on improved deep forest. J. Shanghai Univ. Eng. Technol. 2023, 37, 190–197. [Google Scholar]
  3. Zhang, Y.C. Research on Human Movement Recognition Algorithm Based on Ultra-Wideband Radar. Master’s Thesis, Liaoning University of Engineering and Technology, Liaoning, China, 2022. [Google Scholar]
  4. Li, R. Research on Key Technology of Intelligent Identification of Passenger Flow in High-Speed Railway Stations Based on Deep Learning. Ph.D. Thesis, China Academy of Railway Science, Beijing, China, 2022. [Google Scholar]
  5. Su, X.Y. Research and Application of Human Movement Behaviour Recognition Method Based on AlphaPose and LSTM. Master’s Thesis, Southwest Jiaotong University, Chengdu, China, 2022. [Google Scholar]
  6. Liu, Y.; Zhang, L.; Xin, S.; Zhang, Y. Deep learning web video action classification incorporating spatio-temporal attention mechanism. Chin. Sci. Technol. Pap. 2022, 17, 281–287. [Google Scholar]
  7. Wang, C.; Wei, Z.L.; Chen, S.H. An Action Recognition Method for Borderless Applications Based on Self-Attention Mechanism. Comput. Res. Dev. 2022, 59, 1092–1104. [Google Scholar]
  8. Zhang, A. Human Action Recognition in Table Tennis Based on Improved GoogLeNet Network. Master’s Thesis, Shenyang University of Technology, Shenyang, China, 2021. [Google Scholar]
  9. Wang, F. Through-Wall Radar Human Action Recognition Based on Feature Enhancement and Shallow Neural Network. Master’s Thesis, Taiyuan University of Technology, Taiyuan, China, 2021. [Google Scholar]
  10. Bai, F. Action Recognition Based on Channel State Information; China University of Mining and Technology: Xuzhou, China, 2021. [Google Scholar]
  11. Gong, F.M.; Ma, Y.H. Research on human action recognition based on spatio-temporal two-branch network. Comput. Technol. Dev. 2020, 30, 23–28. [Google Scholar]
  12. Yang, J.T. Human Continuous Action Recognition Based on LSTM. Master’s Thesis, Xi’an University of Technology, Xi’an, China, 2020. [Google Scholar]
  13. Chu, J.H.; Zhang, S.; Tang, W.H.; Lu, W. Driving behaviour recognition method based on tutor-student network. Adv. Lasers Optoelectron. 2020, 57, 211–218. [Google Scholar]
  14. Chen, X.H. Research on Limb Movement Recognition Based on 3D Skeleton. Master’s Thesis, University of Electronic Science and Technology, Chengdu, China, 2019. [Google Scholar]
  15. Ding, H.J.; Gong, F.M. Human activity state recognition and localisation based on time series analysis. Comput. Technol. Dev. 2019, 29, 82–86+90. [Google Scholar]
  16. Han, A. Multimodal action recognition based on deep learning framework. Comput. Mod. 2017, 07, 48–52. [Google Scholar]
  17. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 1–9. [Google Scholar]
  18. Tran, D.; Ray, J.; Le, Q.V. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
  19. Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Ng, A.Y. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar]
  20. Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
  21. Du, Y.; Wang, L.; Wang, X. Hierarchical recurrent neural network for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 111–119. [Google Scholar]
  22. Liu, Z.; Shah, M.; Gool, L.V. Spatiotemporal convolutional networks for video action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1260–1269. [Google Scholar]
  23. Choutas, V.; Kompatsiaris, I.; Ferrari, V. Pseudo-3D residual networks for action recognition in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5238–5246. [Google Scholar]
  24. Carreira, J.; Zisserman, A. Quo vadis, action recognition? In A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6296–6305. [Google Scholar]
  25. Ng, X.; Socher, R. Action recognition with attention-based LSTMs. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 19–24. [Google Scholar]
  26. Wu, Z.; Xiong, Y.; Yu, S. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 284–293. [Google Scholar]
  27. Zheng, T.H.; Long, W.; Shen, B.; Zhang, Y.J.; Lu, Y.J.; Ma, K.J. Seismic Stability Assessment of Single-Layer Reticulated Dome Structures by the Development of Deep Seismic Stability Assessment of Single-Layer Reticu. Int. J. Struct. Stab. Dyn. 2024. [Google Scholar] [CrossRef]
  28. Wang, M.D.; Zhang, X.L.; Chen, S.Q.; Li, X.M.; Zhang, Y. Modeling the skeleton-language uncertainty for 3D action recognition. Neurocomputing 2024, 608, 128426. [Google Scholar] [CrossRef]
  29. Wu, G.S.; Wen, C.H.; Jiang, H.C. Wushu Movement Recognition System Based on DTW Attitude Matching Algorithm. Entertain. Comput. 2025, 52, 100877. [Google Scholar] [CrossRef]
  30. Su, Y.X.; Zhao, Q. Efficient spatio-temporal network for action recognition. J. Real-Time Image Process. 2024, 21, 158. [Google Scholar] [CrossRef]
Figure 1. Related process.
Figure 1. Related process.
Applsci 15 04454 g001
Figure 2. Structure of CNN for extracting static extraction.
Figure 2. Structure of CNN for extracting static extraction.
Applsci 15 04454 g002
Figure 3. A model for classifying video characters based on GRU.
Figure 3. A model for classifying video characters based on GRU.
Applsci 15 04454 g003
Figure 4. GRU unit structure.
Figure 4. GRU unit structure.
Applsci 15 04454 g004
Figure 5. Two-dimensional pyramid pooling and three-dimensional pyramid pooling.
Figure 5. Two-dimensional pyramid pooling and three-dimensional pyramid pooling.
Applsci 15 04454 g005
Figure 6. Structure of dual-stream improved C3D network.
Figure 6. Structure of dual-stream improved C3D network.
Applsci 15 04454 g006
Figure 7. Different categories of motion on Hollywood2 for S:M of 8:2 and 6:4.
Figure 7. Different categories of motion on Hollywood2 for S:M of 8:2 and 6:4.
Applsci 15 04454 g007
Table 1. Eigenvector expressions for different dynamic and static feature contributions.
Table 1. Eigenvector expressions for different dynamic and static feature contributions.
Dynamic and Static RatiosRelationship Between P1 and P2Eigenvector (Math.)
20%S, 80%Mp1 = 4p2 C = 4 17 M + 1 17 S
40%S, 60%M2p1 = 3p2 C = 3 13 M + 2 13 S
60%S, 40%M3p1 = 2p2 C = 2 13 M + 3 13 S
80%S, 20%M4p1 = P2 C = 1 17 M + 4 17 S
Table 2. Results of the contribution of various dynamic and static features for both datasets.
Table 2. Results of the contribution of various dynamic and static features for both datasets.
UCF101Hollywood2
S:M = 8:296.5%76.9%
S:M = 6:495.2%75.3%
S:M = 4:694.9%74.7%
S:M = 2:891.4%71.4%
Table 3. Effects of variable-scale training and pyramid pooling.
Table 3. Effects of variable-scale training and pyramid pooling.
Training ModalitiesOriginal 3D Convolutional Neural Network3D Convolutional Neural Networks for 3D Pyramid Pooling
Fixed Scale
(training from scratch)
77.8%80.2%
Fixed scale
(using pre-trained weights)
82.1%82.4%
Variable scale
(using pre-trained weights)
unsupported83.8%
Table 4. Comparison of accuracy and number of parameters for different layers of pyramid pooling versus raw 3D.
Table 4. Comparison of accuracy and number of parameters for different layers of pyramid pooling versus raw 3D.
Network InfrastructureAccuracyParameters (Millions)
Raw 3D convolutional neural network82.1%77.9
Pyramid pooling on two levels82.7%54.9
Pyramid pooling on three levels83.6%88.4
Table 5. Recognition rates of early fusion and late fusion in both datasets.
Table 5. Recognition rates of early fusion and late fusion in both datasets.
UCF101HMDB51
Early integration89.23%64.32%
late fusion87.34%58.43%
Table 6. Experimental results of the UCF101 dataset.
Table 6. Experimental results of the UCF101 dataset.
MethodologiesTop-1 Accuracy (%)Top-5 Accuracy (%)
traditional RNN75.292.1
3D-CNN84.396.3
GRU + 3D-CNN89.698.1
Table 7. Experimental results on the HMDB51 dataset.
Table 7. Experimental results on the HMDB51 dataset.
MethodologiesTop-1 Accuracy (%)Top-5 Accuracy (%)
traditional RNN61.785.5
3D-CNN70.290.8
GRU + 3D-CNN75.892.5
Table 8. Timetable.
Table 8. Timetable.
ModellingTimes (Hours)
Traditional RNN training time10
3D-CNN training time12
GRU + 3D-CNN training time14
Table 9. Method comparison results.
Table 9. Method comparison results.
MethodUCF101 Top-1 Accuracy (%)HMDB51 Top-1 Accuracy (%)Hollywood2 mAP (%)Something-Someing V1 Top-1 Accuracy (%)
C3D75.846.859.732.1
Dual-Stream Late Fusion81.250.163.435.4
Dual-Stream Late Fusion83.752.565.937.2
Pyramid pool + C3D85.955.368.240.8
Dual-Stream C3D Early87.156.470.542.3
Dual-Stream C3D Late88.357.871.843.6
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tan, Y.; Fu, X.; Li, H. Improved Video Action Recognition Based on Pyramid Pooling and Dual-Stream C3D Networks. Appl. Sci. 2025, 15, 4454. https://doi.org/10.3390/app15084454

AMA Style

Tan Y, Fu X, Li H. Improved Video Action Recognition Based on Pyramid Pooling and Dual-Stream C3D Networks. Applied Sciences. 2025; 15(8):4454. https://doi.org/10.3390/app15084454

Chicago/Turabian Style

Tan, Yuzhe, Xueliang Fu, and Honghui Li. 2025. "Improved Video Action Recognition Based on Pyramid Pooling and Dual-Stream C3D Networks" Applied Sciences 15, no. 8: 4454. https://doi.org/10.3390/app15084454

APA Style

Tan, Y., Fu, X., & Li, H. (2025). Improved Video Action Recognition Based on Pyramid Pooling and Dual-Stream C3D Networks. Applied Sciences, 15(8), 4454. https://doi.org/10.3390/app15084454

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop