Pose action normalized feature is a 3D human pose-based action recognition feature description operator, which can effectively remove the scene interference to the body posture differences. Pose action normalized feature is mainly composed of the normalized joints series and key angle change series. For an action video sequence
G of length
t, it can be expressed as:
where
represents a frame containing the human body. After the 3D pose estimation,
G obtains the 3D pose sequence
p:
where
represents the position coordinate
of the joint point
i in the frame
j.
v is the number of joint points; in this paper,
. After transformation,
P can get
v joint series
, which can be expressed as:
After the pose action normalized feature estimation process, FOLLOWER can get the pose action normalized feature set
F, which can be expressed as:
2.2.1. 3D Human Pose Estimation
To guarantee performance and efficiency, 3D human pose estimation is composed of the human detector Yolo-v3 [
24], 2D human pose estimation Pruned HRnet, and 2D to 3D pose estimation Video Pose 3D [
25]. Pruned HRnet is a lightweight model we designed to improve the real-time performance of the algorithm, which is obtained based on HRnet using the channel pruning method with self-determined pruning parameters. Yolo-v3 and Video Pose 3D use pre-training model directly.
As shown in
Figure 2, we utilize a video action sequence
G to generate 3D pose series
P. Yolo-v3 generates the detection box
H of the human body.
H generates 2D human pose
S through Prund HRnet. VideoPose3D performs 3D pose estimation through adjacent multi-frame sequences, and finally obtains 3D pose series
P.
To lighten the algorithm, we prun the original HRnet model. In the original HRnet [
26], to achieve reliable high-resolution representations, the algorithm connects multiple high-resolution subnetworks in parallel. It performs various multi-scale fusions, which leads to a complicated model structure, and effective pruning is complicated.
The essential components of HRnet can be divided into BasicBlock, Bottleneck, and Multi-scale Fusions layer. The multi-scale fusion layer contains more information and fewer parameters, so it is not pruned during the pruning process. The structures of BasicBlock and Bottleneck are shown in
Figure 3, both of which are residual structures. The convolution operation before the Add operation does not involve pruning because it involves multi-scale fusions. The rest of the layers include pruning, and the pruning area is shown in the red box in
Figure 3.
The pruning strategy draws on the network slimming algorithm [
27]. The batch-normalization layer (BN layer) coefficient
is directly used as the criterion for measuring the importance of the channel. The function of the BN layer is generally expressed as:
Each channel
C corresponds to a set of
and
.
represents scale parameters,
represents shift parameter, and
represents normalization parameter. The higher the value of
is, the more important
C is. Given a pruning rate 999,999,999, each
in the BN layer that needs to be pruned is collected and sorted to obtain the sequence
L, whose length is
m. The global threshold is
:
For a BN layer to be pruned, the
sequence is
B, and its local threshold of
is
:
Within a BN layer, C has a that is smaller than , which is the target of pruning.
The pruning rate
is selected by plotting the scatterplot of the pruning rate
and the performance indexes
and
of the model after pruning on the COCOVal2017 data set [
28], making sure to maximize pruning rate
with particular
and
.
We sample 50 candidates
interval in the interval
and obtain 50 candidates Pruned HRnet by pruning. After testing, we obtain the corresponding
and
, and the corresponding set is
and
. Let the abscissa be
x, and the ordinate is
y:
By drawing a scatterplot of y versus x, we find that the distribution of scatters is in the form of a convex function, so the model corresponding to the inflection point is selected as Prund HRnet. Simultaneously, according to the size of the model, we conducted pruning training with a higher pruning rate and achieved a higher degree of model compression.
We select w32-256x192-HRnet and w48-384x288-HRnet in the original model and sample 50 candidate pruning rates 999,999,999 equally in the interval
to prune the models. Candidate models are quickly tested on the validation set to obtain the corresponding performance indicators of
and
. The scatter plot is shown in
Figure 4.
We selected four candidate models for training: w48-best, w48-extreme, w32-best, and w32-extreme. After training, we have a model that balances accuracy and computational complexity as the 2D pose estimation model.
2.2.2. Pose Action Normalized Feature Estimation
Pose action normalized feature estimation mainly includes coordinate transform, scale transform, and key angle calculation. Among them, coordinate transform and scale transform correspond to the function Norm.
The visual representation of the 3D pose is shown in
Figure 5. There are 17 joints in one pose.
The coordinates of each joint generated by 3D human pose estimation are absolute coordinates relative to
. As the
coordinate system changes, the value of the joint coordinates also changes. Therefore, coordinate transform is needed to obtain coordinate descriptions that are not related to
. In FOLLOWER, the revised human body central point is selected as the coordinate origin:
Mid is the midpoint calculation function.
Affected by human body shape and shooting location, the human body posture generated by 3D human body posture estimation will have great differences in scale. Therefore, this paper designs a human body scale description method based on the human body scale. As shown in
Figure 5, the pose is divided into 11 blocks with the staff
r:
S and
E represent the coordinate pair set of connection relationship between the joints, whose length is
e. The line between
and
represents a set of skeletons. From this, we can get the normalized function of the joint:
At the same time, we consider that some key angle information can also describe human limb movements. Based on this, as shown in
Figure 5, we selected nine key angles as features to represent the changes in the progress of the legs, arms, and torso, which is used for motion threshold analysis and action filtering.