4.1. Datasets and Metrics
Datasets. We use the Human3.6M [
21] and HumanEva-I [
43] datasets for experiments. Human3.6M is the most widely used dataset for 3D pose estimation tasks. It contains 3.6 million images captured in different views by four simultaneous cameras. The dataset consisted of 11 human subjects performing 15 indoor daily activities, such as walking, talking on the phone, sitting, and participating in discussions. The dataset uses a motion capture system to detect precise 3D coordinates and then obtains 2D poses based on the projection of internal and external camera parameters. HumanEva-I is a relatively earlier dataset for 3D human pose estimation. It has a small amount of data with relatively simple pose estimation scenarios. We used the same training and testing split approach as in previous work [
10,
41,
44,
45], evaluating multiple subject actions including walking, jogging, and boxing. Following previous work [
6,
7,
8], standard normalization is used to handle the distribution of 2D and 3D poses before data input.
Metrics. There are two standard protocols to evaluate our model on Human3.6M. Following previous work [
6,
7,
8], five subjects (S1, S5, S6, S7, and S8) were used as the training set and two subjects (S9 and S11) were used as the test set, which are evaluated under protocol 1 and protocol 2. As in previous work [
3,
5,
7,
12,
46], we used two metrics in the dataset Human3.6M to evaluate our method under protocol 1 and protocol 2. Protocol 1 is the metric used is the mean position error per joint (MPJPE), which measures the average Euclidean distance between the ground truth and prediction after alignment of the root joint. Protocol 2 uses the mean per-joint position error (P-MPJPE) after alignment so that it is not affected by rotation and scaling and has better robustness.
4.2. Implementation Details
Following previous work [
6,
7], we use the predicted 2D keypoints released by [
41] from the Cascaded Pyramid Network (CPN) as the input of our 3D pose model. Since there is a strong correlation between clips of the same video screen, we sample from different video clips to avoid biased statistics for batch normalization [
47].
We trained 100 epochs using the AMSGrad optimizer [
48]. An exponential decay learning rate scheme was employed, starting with
and applying a shrink factor of
per epoch. We also set the batch size and dropout rate to 1024 and 0.2, respectively. The pose data are enhanced by horizontal flipping.
4.3. Experimental Results
Learned weighting matrices. The proposed U-Net network contains S-GCN layers in each level.
Figure 5 visualizes learned weighting matrices in the network, including the weights in the original map with 17 keypoints and the fusion map with five keypoints. The weight in the upper left is larger than the lower right, which means that the central node has a higher impact than the end node. In other words, the keypoint information is passed through the S-GCN based on the skeleton structure, which proves that the skeleton structure information is fully utilized. In addition, we observe that the weights learn the structural features of the human skeleton. For example, the head, nose, and neck have relatively fixed structural relationships, and the connection weights obtained through training are relatively high.
Figure 5 demonstrates that S-GCN correctly resolves the structure of key points of the human skeleton, thus improving the performance of 3D body pose estimation.
Comparison with the state of the art.
Table 1 and
Table 2 show the results of comparing our proposed STUNet model with other baselines on Human3.6M. We show the performance of our 27-frame, 81-frame, and 243-frame models to compare the performance differences of our models under different perceptual domains. Among them, the bold result is the best. The experiments show that our 243-frame model achieved the best mean error values under protocols 1 and 2. Specifically, we had six tasks reach the optimum under protocol 1 and four tasks outperform the other models under protocol 2. The results show that the model error increases as the number of input frames decreases. Our 81-frame and 27-frame models have about 0.5 mm and 1.9 mm higher average error than the 243-frame model, respectively. The 81-frame and 27-frame models do not outperform the current best results, but are still a relatively competitive result. For protocol 1 in
Table 1, our model has an average lead of 0.6 mm compared to the previous best result [
49]. Compared to the baseline model [
41], our 243-frame model is 6.8 mm and 5.7 mm ahead for the “sitting” and “seated” actions, respectively, indicating that our model is better able to cope with complex situations such as occlusion and overlay. For protocol 2 in
Table 2, our method achieves a minimum error of 35.4 mm, which is a reduction of 0.2 mm [
49] compared to the best result. It is worth mentioning that our model has a great advantage on highly dynamic action sequences, especially the ”Walk” and ”Walk Together” sequences, which are reduced by more than 4 mm in protocol 1 compared to the base model.
The test results of the HumanEva-I dataset are shown in
Table 3, where “-” indicates that no corresponding results are reported in this work. The experiments show that we achieved the best results in seven out of nine sequence tasks for Walk, Jog, and Box. It is worth mentioning that since the HumanEva-I dataset is a relatively easy task, our result metrics still have a slight lead over existing methods. We attribute it to spatio-temporal semantic fusion that smooths dynamic action prediction to improve the error of 3D pose estimation. In general, our model has high performance under various behavioral tasks and multiple evaluation protocols compared to existing models.
To visualize the output of the model,
Figure 6 shows the visual qualitative results of our model under multiple action sequences, being the ”Walk”, ”Wait”, ”Pose”, and ”Purch” sequences. We follow our previous work [
41] and use the 2D keypoints estimation results of the Cascaded Pyramid Network (CPN) as input. The figure shows that our STUNet model has stable and accurate results on multiple action sequences.
4.5. Ablation Study and Analysis
Granularity of spatio-temporal features. We performed ablation experiments on our models using the Human3.6M dataset. To explore the impact of spatio-temporal fusion at different granularities, we tested the 81-frame and 243-frame models and compared them using the MPJPE metric. The ablation studies are designed to investigate the effects of the order and granularity of temporal and spatial fusion.
and
represent the locations of spatial graph pooling operations, respectively, which determine the timing of spatio-temporal feature fusion in the framework. At higher levels of the spatio-temporal dimension, features of specific keypoints are grouped and fused, and some fine-grained information is lost. This fine-grained information is concatenated in the subsequent upsampling process through skip connections. The larger the values of
and
, the later the spatio-temporal features are fused, which means that more layers of the model process features in the fine-grained spatio-temporal dimension. As a result, we achieve the best results at (1,3) and (2,4), respectively. The best results for
and
for both frames 81 and 243 take intermediate values. It is worth mentioning that
Table 5 shows that our model is not sensitive to the hyper-parameter values of
and
and the effect on the results is within 1mm, indicating its strong robustness.
U-Net architecture. This work presents a novel STUNet architecture with a cascade structure of temporal convolution layers and graph convolution layers to gradually integrate the semantic features of local time and space. Skip connections combine coarse-grained feature embeddings from the decoder sub-network with fine-grained spatio-temporal feature embeddings from the encoder sub-network to boost the final 3D pose estimation performance. Experiments demonstrate that skip connections are effective in recovering and fusing spatio-temporal features of different granularities.
Spatio-temporal feature fusion. Our approach performs feature fusion and compression in both the temporal and spatial dimensions. The proposed structural temporal dilated convolution layer can retain spatial structured graph information while fusing long-time key point sequences in the temporal dimension to eliminate jitter and blur in 3D pose estimation in the single frame case. The semantic features of the human body structure are processed by semantic graph convolution, which leads to a significant improvement in the accuracy of our model compared to the base model.
Optimization of multi-frame input. Models with multiple frames of input are usually more difficult to practice for specific scenarios than single frames. However, we applied dilated TCN to handle temporal dimensional features, which provides the possibility of implementing application optimization. Similar to other dilated TCN-based approaches [
41], since the time-dimensional features are processed hierarchically, features at different moments can be subsequently reused as long as they are computed once at the inference time. Therefore, the input of 243 frames only affects the start and end of model inference, while the runtime inference is able to be optimized for parallelism and reuse, maintaining a good performance.