Next Article in Journal
Real-Time Updating High-Order Extended Kalman Filtering Method Based on Fixed-Step Life Prediction for Vehicle Lithium-Ion Batteries
Previous Article in Journal
A New Hypersonic Wind Tunnel Force Measurement System to Reduce Additional Bending Moment and Avoid Time-Varying Stiffness
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Skeleton-Based Spatio-Temporal U-Network for 3D Human Pose Estimation in Video

1
Institute of Microelectronics of the Chinese Academy of Sciences, Beijing 100029, China
2
University of Chinese Academy of Sciences, Beijing 100029, China
*
Author to whom correspondence should be addressed.
Sensors 2022, 22(7), 2573; https://doi.org/10.3390/s22072573
Submission received: 22 February 2022 / Revised: 16 March 2022 / Accepted: 21 March 2022 / Published: 28 March 2022
(This article belongs to the Section Biosensors)

Abstract

:
Despite the great progress in 3D pose estimation from videos, there is still a lack of effective means to extract spatio-temporal features of different granularity from complex dynamic skeleton sequences. To tackle this problem, we propose a novel, skeleton-based spatio-temporal U-Net(STUNet) scheme to deal with spatio-temporal features in multiple scales for 3D human pose estimation in video. The proposed STUNet architecture consists of a cascade structure of semantic graph convolution layers and structural temporal dilated convolution layers, progressively extracting and fusing the spatio-temporal semantic features from fine-grained to coarse-grained. This U-shaped network achieves scale compression and feature squeezing by downscaling and upscaling, while abstracting multi-resolution spatio-temporal dependencies through skip connections. Experiments demonstrate that our model effectively captures comprehensive spatio-temporal features in multiple scales and achieves substantial improvements over mainstream methods on real-world datasets.

1. Introduction

Recently, 2D keypoint detection and 3D pose estimation have received more and more attention [1,2,3,4,5,6,7,8]. The difficulty with 3D pose estimation is that multiple 3D poses can be mapped to the same 2D keypoints. Some studies address this ambiguity by modeling temporal information using recurrent neural networks(RNNs) [9,10] or graph convolutional networks(GCNs) [11,12,13]. These studies simply connect temporal dimension features to form a spatio-temporal grid of keypoints for subsequent feature extraction and pose estimation. However, we observe two main problems with these existing methods:
  • the previous methods have difficulty in handling complex long-time sequence action features;
  • the existing approach neglects to consider the compression and fusion of features in both temporal and spatial dimensions for human pose estimation.
First, previous RNN-based and graph-based methods have difficulty in handling complex, long-time sequence action features. A typical approach is to connect keypoints into spatio-temporal sequences based on skeletal structure and use a recurrent neural network [9] or graph convolution network [11] for pose estimation. Some recent studies have proved that temporal convolutional networks have better performance than traditional RNN and other methods in modeling temporal information, such as machine translation [14], language modeling [15], speech generation [16], and speech recognition [17]. Therefore, we employ temporal convolutions to capture long-term pose information for 3D pose estimation tasks.
Second, the existing methods neglect to consider the compression and fusion of features in both temporal and spatial dimensions for human pose estimation. Most of the existing work on human pose recognition focuses on the fusion of spatial skeletal-based keypoint features, such as ESR [12] and ST-GCN [13]. The difficulty lies in the need to aggregate features in both spatio-temporal dimensions to obtain some intermediate representations in local space-time for pose estimation. For example, the concept of "a waving arm” is a semantic fusion representation of multiple keypoints and frames in local space-time. The representation of local spatio-temporal semantics may play a key role in the estimation of the final human pose. The downsampling and upsampling structures work as a bottleneck, encouraging the network to compress feature representations to obtain high-level semantics. The proposed spatio-temporal U-Net architecture helps the network to take full advantage of this bottleneck structure to achieve scale compression and feature squeezing by the downsampling and upsampling of spatio-temporal semantic features.
U-Net structure [18] extracts different resolutions to capture visual patterns or semantics to improve the algorithm performance, and has been successfully applied in many fields such as semantic segmentation, image compression, and denoising. U-Net performs bottom-up processing through the upsampling of feature maps, combined with the underlying high-resolution features, such as the method proposed in the superimposed hourglass network for 2D pose estimation. This U-Net structure only handles different resolutions in the spatial dimension, but cannot handle spatio-temporal information. Inspired by this, as shown in Figure 1, we extend this structure and propose a U-Net structural model scheme for spatio-temporal information. Its purpose is to learn temporal and spatial semantic fusion features for 3D human pose estimation. The data flow is bottom-up, where 2D keypoints are fed into the proposed network and 3D pose estimates are then generated as output.
To the best of our knowledge, we are the first to leverage the temporal convolution and graph convolution to deal with spatio-temporal features of different granularities. Our main contributions are summarized as follows:
  • this work presents a novel spatio-temporal U-Net architecture with a cascade structure of temporal convolution layers and semantic graph convolution layers to gradually integrate the semantic features of local time and space;
  • the proposed structural temporal dilated convolution layer fuses long-time key point sequences in the temporal dimension to eliminate jitter and blur in 3D pose estimation in the single frame case;
  • the proposed semantic graph convolution layer fuses the semantic features of the human body in the spatial dimension with novel graph convolution, pooling, and unpooling layers.

2. Related Work

2.1. 3D Human Pose Estimation

The 3D Human Pose Estimation task aims to infer 3D body keypoints from a single image. Prior to the success of deep learning, most of the work [19,20,21,22] used feature engineering and modeling on bone and joint mobility to estimate the 3D pose. Next, a convolutional neural network (CNN) method is used for end-to-end 3D pose reconstruction [1,23,24,25]. Unlike previous model-based methods, they estimate 3D pose directly from RGB images without intermediate supervision.
Two-step 3D pose estimation. 3D pose estimation is usually built on top of a 2D pose estimator, first using 2D pose estimation to predict 2D joint positions in image space, and then lifting it to 3D [1,2,3,9,26,27]. Some work [3] shows that predicting 3D pose is relatively straightforward given real 2D keypoints, and the quality of 2D keypoint estimates has a large impact on the final result. Some methods [1,2,28] use both image features and 2D keypoints for 3D pose estimation. Recent work [29] predicts 3D pose by predicting the depth of keypoints. There are methods [30] for 3D pose estimation using prior knowledge about bone length and projection consistency. Some recent studies [31,32] apply transformer networks to human pose estimation tasks. Recently, a differentiable epipolar transformer network in a synchronized and calibrated multi-view setup was proposed [31], enabling the 2D detector to leverage 3D-aware features to improve 2D pose estimation. A spatial-temporal two-stream transformer network [32] is proposed to model dependencies between joints using the Transformer self-attention operator. In addition, the human skeleton can be represented as a directed graph [33] to explicitly reflect the hierarchical relationships among the nodes and leverage varying non-local dependence for different poses by conditioning the graph topology on input poses.
Skeletal-based keypoint feature fusion. GCNs are introduced to learn high-level representations of relationships between nodes based on skeletal graphs. A recent study [11] designed a semantic GCN capable of capturing local and global relationships between human joints for human pose estimation. GCNs are used to learn multi-scale representations [12] to encode human skeletal joints, thereby converting 2D human joints to 3D. Most of these existing studies focus on the analysis of the spatial features of skeleton-based keypoints. Inspired by this, we extend the spatial dimensional fusion method to achieve the fusion of pose features in the spatio-temporal dimension.
Action recognition. Unlike pose estimation, which outputs 3d keypoint coordinates, action recognition directly classifies human behavior. Although the tasks are slightly different, there is also a lot of work in the field of action recognition that focuses on the analysis of the human skeleton structure. A spatial-temporal two-stream transformer network [32] is proposed to model dependencies between joints using the Transformer self-attention operator. Additionally, some work [34] has been done to explore and compare different ways of extracting human pose features, and to extend a TCN-like unit to extract the most relevant spatial and temporal characteristics for a sequence of frames.

2.2. Video Pose Estimation

Most of the previous work takes a single frame or a single image as input, and more recent studies use the temporal information of video to disambiguate pose estimation, producing more reliable and robust results. Previous studies [35,36] have used LSTMs to predict 3D poses predicted from a single image. Additionally, an LSTM sequence-to-sequence learning model [37] is introduced to encode 2D pose sequences in videos into fixed-size vectors, which are then decoded into 3D pose sequences. There is also some work on RNN methods that consider prior information on body part connectivity [10]. Some research [13,38,39,40] connects skeleton graphs into keypoint sequences and use GCNs for action recognition. Further, ref. [41] uses a TCN to process pose-encoded sequences, but this method ignores the structural features of the skeleton.
Since none of the existing 3D pose estimation methods consider the representation of features from different granularities of time and space at the same time, we propose a spatio-temporal U-Net model scheme to learn the semantic fusion features of time and space, and perform 3D human pose estimation.

3. Skeleton-Based Spatio-Temporal U-Net

As shown in Figure 2, we propose a novel spatio-temporal U-Net scheme to deal with spatio-temporal features of different granularities for 3D human pose estimation in video. The STUNet architecture consists of a cascading structure of structural temporal convolution network(S-TCN) layers and semantic graph convolution network (S-GCN) layers to progressively integrate semantic features in local time and space.
We improve upon the underlying U-Net [18] structure, using skip connections to connect spatio-temporal features from the encoding stage to the decoding stage of each decoder layer. This structure can gradually abstract complex spatio-temporal information to obtain high-level semantic features of pose, and preserve local spatio-temporal information through skip connections. Modeling the 3D pose estimation problem as a U-Net model helps to predict more accurate 3D coordinates through high-level abstractions, and also helps to discover potential relationships between keypoints in temporal and spatial dimensions. The spatio-temporal features of different granularities are extracted and fused in the U-Net structure, which ultimately improves the accuracy of 3D human pose estimation.

3.1. Structural Temporal Dilated Convolutional Layer

To improve long-range temporal perception while avoiding excessive increases in training parameters, we employ time-dilated convolutional layers to fuse long-term keypoint sequences in the temporal dimension to alleviate jitter and ambiguity in 3D pose estimation. Dilated convolution is a sparsely structured convolution with a uniform distribution of kernel points and zero padding in between. Suppose there are dilated convolutions of two signals f and h with lengths N and 2 M + 1 , respectively, which can be calculated as:
( f h D ) [ n ] = m = M M f [ D ( n m ) ] h [ m ]
where D represents the dilation factor, and n and m represent the indices of the signals f and h, respectively. In Figure 3, we describe the structure of our model in the temporal dimension, whose perceptual domain grows with increasing horizontal layers. In the implementation, we use a similar approach to our previous work [41]. However, the difference is that our S-TCN retains the skeleton information and is able to fuse temporal dimensional features from graph structures of different granularity. The proposed S-TCN yields roughly the same computational cost as conventional convolution while increasing the perceptual field.

3.2. Semantic Graph Convolutional Network

The semantic graph convolution layer fuses the semantic features of the human body in the spatial dimension with novel graph convolution, pooling, and unpooling layers. The network consists of a skeletal structure-based graph network layer and a data-dependent non-local layer in series. The structure-based graph layer is used to capture the spatial dimensional human skeletal structure information and progressively pool it into a high-level feature representation. The data-dependent non-local layer is used to analyze the features of long-range nodes since the graph convolution network does not handle long-range relationships well.

3.2.1. Structure-Based Graph Layer

In the classic GCN [13], the graph convolution operation on vertex v i is expressed as:
f o u t ( v i ) = v j B i 1 Z i j f i n · w ( l i ( v j ) )
where v represents the joint vertex of the skeletal graph and f is the feature map. B i represents the convolution sampling region of v i , defined as the neighboring vertices v j of the target vertex v i . w is a weighting function that processes the input values to provide a weight vector. Using a design similar to SemGCN [11], we introduce learnable matrices that transform traditional GCN (2) as follows:
f o u t = k k v W k ( f i n A k ) M k
A k = D k 1 2 ( A ˜ k + I ) D k 1 2 , D i i = k k v ( A ˜ k i j + I i j )
where k v is the kernel size on the spatial dimension, A ˜ k is the adjacency matrix of the keypoint graph representing connections, I is the identity matrix, and W k is the trainable weight matrix.

3.2.2. Graph Pooling and Upsampling

As shown in Figure 4, we divide the body key point nodes into five subsets according to the characteristics of the skeleton structure, and then perform a maximum pooling operation on each subset. Next, the coarsened graph is maximally assembled into a node that contains global information for the entire skeleton. In this process, the skeleton space structure is gradually fused and connected with the corresponding decoded layer in the skip connection of U-Net. Vertex features in the graph of the same granularity are assigned to corresponding vertices in the graph during the upsampling process to fully preserve local spatio-temporal features.
Based on the skeletal structure, the associated neighborhoods are established on the graph to perform the pooling operation, and the semantically similar vertices are clustered together to learn the key representations based on the graph. In this work, we progressively cluster the entire skeleton at each frame according to the structure of human limbs. For the bottom-up process, we use a simple upsampling procedure to copy the features of the vertices in the coarser graph to the corresponding vertices in the fine-grained graph. These higher-level features are concatenated with the lower-level features from skip connections for subsequent processing. Furthermore, temporal connections remain unchanged across different levels of spatio-temporal abstraction.
The fusion of the spatial skeleton contains only two graph pooling processes. In contrast, according to Figure 3, the fusion process of the input time series increases as the length of the series increases. For example, the perceptual domains of 27-frame and 81-frame are fused three and four times, respectively. Therefore, graph pooling layers are selectively inserted into TCN layers, which is related to the granularity of temporal and spatial fusion. We discuss this issue in detail in the subsequent ablation experiments.

3.2.3. Data-Dependent Non-Local Layer

Since the basic GCN has difficulty handling long-distance relationships, we design a data-dependent non-local layer to capture the global and long-distance relationships between joints in the body skeletal map. We follow the non-local [42] concept and define the operation as:
x i l + 1 = x i l + W x K j = 1 k f ( x i l , x j l ) · g ( x j l )
where W x is the weight, f denotes the pairwise function to calculate the affinity between node i and other nodes j, and g is the function of calculating node representation. In implementation, we adopt [42] in a similar way as Equation (5).

4. Experiments and Results

4.1. Datasets and Metrics

Datasets. We use the Human3.6M [21] and HumanEva-I [43] datasets for experiments. Human3.6M is the most widely used dataset for 3D pose estimation tasks. It contains 3.6 million images captured in different views by four simultaneous cameras. The dataset consisted of 11 human subjects performing 15 indoor daily activities, such as walking, talking on the phone, sitting, and participating in discussions. The dataset uses a motion capture system to detect precise 3D coordinates and then obtains 2D poses based on the projection of internal and external camera parameters. HumanEva-I is a relatively earlier dataset for 3D human pose estimation. It has a small amount of data with relatively simple pose estimation scenarios. We used the same training and testing split approach as in previous work [10,41,44,45], evaluating multiple subject actions including walking, jogging, and boxing. Following previous work [6,7,8], standard normalization is used to handle the distribution of 2D and 3D poses before data input.
Metrics. There are two standard protocols to evaluate our model on Human3.6M. Following previous work [6,7,8], five subjects (S1, S5, S6, S7, and S8) were used as the training set and two subjects (S9 and S11) were used as the test set, which are evaluated under protocol 1 and protocol 2. As in previous work [3,5,7,12,46], we used two metrics in the dataset Human3.6M to evaluate our method under protocol 1 and protocol 2. Protocol 1 is the metric used is the mean position error per joint (MPJPE), which measures the average Euclidean distance between the ground truth and prediction after alignment of the root joint. Protocol 2 uses the mean per-joint position error (P-MPJPE) after alignment so that it is not affected by rotation and scaling and has better robustness.

4.2. Implementation Details

Following previous work [6,7], we use the predicted 2D keypoints released by [41] from the Cascaded Pyramid Network (CPN) as the input of our 3D pose model. Since there is a strong correlation between clips of the same video screen, we sample from different video clips to avoid biased statistics for batch normalization [47].
We trained 100 epochs using the AMSGrad optimizer [48]. An exponential decay learning rate scheme was employed, starting with η = 0.001 and applying a shrink factor of α = 0.95 per epoch. We also set the batch size and dropout rate to 1024 and 0.2, respectively. The pose data are enhanced by horizontal flipping.

4.3. Experimental Results

Learned weighting matrices. The proposed U-Net network contains S-GCN layers in each level. Figure 5 visualizes learned weighting matrices in the network, including the weights in the original map with 17 keypoints and the fusion map with five keypoints. The weight in the upper left is larger than the lower right, which means that the central node has a higher impact than the end node. In other words, the keypoint information is passed through the S-GCN based on the skeleton structure, which proves that the skeleton structure information is fully utilized. In addition, we observe that the weights learn the structural features of the human skeleton. For example, the head, nose, and neck have relatively fixed structural relationships, and the connection weights obtained through training are relatively high. Figure 5 demonstrates that S-GCN correctly resolves the structure of key points of the human skeleton, thus improving the performance of 3D body pose estimation.
Comparison with the state of the art. Table 1 and Table 2 show the results of comparing our proposed STUNet model with other baselines on Human3.6M. We show the performance of our 27-frame, 81-frame, and 243-frame models to compare the performance differences of our models under different perceptual domains. Among them, the bold result is the best. The experiments show that our 243-frame model achieved the best mean error values under protocols 1 and 2. Specifically, we had six tasks reach the optimum under protocol 1 and four tasks outperform the other models under protocol 2. The results show that the model error increases as the number of input frames decreases. Our 81-frame and 27-frame models have about 0.5 mm and 1.9 mm higher average error than the 243-frame model, respectively. The 81-frame and 27-frame models do not outperform the current best results, but are still a relatively competitive result. For protocol 1 in Table 1, our model has an average lead of 0.6 mm compared to the previous best result [49]. Compared to the baseline model [41], our 243-frame model is 6.8 mm and 5.7 mm ahead for the “sitting” and “seated” actions, respectively, indicating that our model is better able to cope with complex situations such as occlusion and overlay. For protocol 2 in Table 2, our method achieves a minimum error of 35.4 mm, which is a reduction of 0.2 mm [49] compared to the best result. It is worth mentioning that our model has a great advantage on highly dynamic action sequences, especially the ”Walk” and ”Walk Together” sequences, which are reduced by more than 4 mm in protocol 1 compared to the base model.
The test results of the HumanEva-I dataset are shown in Table 3, where “-” indicates that no corresponding results are reported in this work. The experiments show that we achieved the best results in seven out of nine sequence tasks for Walk, Jog, and Box. It is worth mentioning that since the HumanEva-I dataset is a relatively easy task, our result metrics still have a slight lead over existing methods. We attribute it to spatio-temporal semantic fusion that smooths dynamic action prediction to improve the error of 3D pose estimation. In general, our model has high performance under various behavioral tasks and multiple evaluation protocols compared to existing models.
To visualize the output of the model, Figure 6 shows the visual qualitative results of our model under multiple action sequences, being the ”Walk”, ”Wait”, ”Pose”, and ”Purch” sequences. We follow our previous work [41] and use the 2D keypoints estimation results of the Cascaded Pyramid Network (CPN) as input. The figure shows that our STUNet model has stable and accurate results on multiple action sequences.

4.4. Computational Complexity

As shown in Table 4, we report the number of model parameters and floating point operations (FLOP) and compare them with previous work. The performance comparison under protocol 1 is also shown in Table 4. For the 243-frame model, with a slight increase in parameters and FLOPs, the average error metric of the model reaches the best. In the 27-frame and 81-frame models, the number of parameters and computations are reduced substantially as the acceptance domain is reduced, while also largely maintaining a relatively competitive performance. This is attributed to the fact that we retain the skeleton map information when performing feature fusion to improve accuracy, while the U-Net structure compresses the spatio-temporal information to control the number of parameters.

4.5. Ablation Study and Analysis

Granularity of spatio-temporal features. We performed ablation experiments on our models using the Human3.6M dataset. To explore the impact of spatio-temporal fusion at different granularities, we tested the 81-frame and 243-frame models and compared them using the MPJPE metric. The ablation studies are designed to investigate the effects of the order and granularity of temporal and spatial fusion. N 1 and N 2 represent the locations of spatial graph pooling operations, respectively, which determine the timing of spatio-temporal feature fusion in the framework. At higher levels of the spatio-temporal dimension, features of specific keypoints are grouped and fused, and some fine-grained information is lost. This fine-grained information is concatenated in the subsequent upsampling process through skip connections. The larger the values of N 1 and N 2 , the later the spatio-temporal features are fused, which means that more layers of the model process features in the fine-grained spatio-temporal dimension. As a result, we achieve the best results at (1,3) and (2,4), respectively. The best results for N 1 and N 2 for both frames 81 and 243 take intermediate values. It is worth mentioning that Table 5 shows that our model is not sensitive to the hyper-parameter values of N 1 and N 2 and the effect on the results is within 1mm, indicating its strong robustness.
U-Net architecture. This work presents a novel STUNet architecture with a cascade structure of temporal convolution layers and graph convolution layers to gradually integrate the semantic features of local time and space. Skip connections combine coarse-grained feature embeddings from the decoder sub-network with fine-grained spatio-temporal feature embeddings from the encoder sub-network to boost the final 3D pose estimation performance. Experiments demonstrate that skip connections are effective in recovering and fusing spatio-temporal features of different granularities.
Spatio-temporal feature fusion. Our approach performs feature fusion and compression in both the temporal and spatial dimensions. The proposed structural temporal dilated convolution layer can retain spatial structured graph information while fusing long-time key point sequences in the temporal dimension to eliminate jitter and blur in 3D pose estimation in the single frame case. The semantic features of the human body structure are processed by semantic graph convolution, which leads to a significant improvement in the accuracy of our model compared to the base model.
Optimization of multi-frame input. Models with multiple frames of input are usually more difficult to practice for specific scenarios than single frames. However, we applied dilated TCN to handle temporal dimensional features, which provides the possibility of implementing application optimization. Similar to other dilated TCN-based approaches [41], since the time-dimensional features are processed hierarchically, features at different moments can be subsequently reused as long as they are computed once at the inference time. Therefore, the input of 243 frames only affects the start and end of model inference, while the runtime inference is able to be optimized for parallelism and reuse, maintaining a good performance.

5. Conclusions

In this paper, we address the problem that existing methods lack effective means to extract dynamic complex features from spatio-temporal structures. We propose a novel spatio-temporal U-Net scheme to deal with spatio-temporal features in multiple scales for 3D human pose estimation in video. The proposed STUNet architecture consists of a cascade structure of semantic-structural graph convolution layers and temporal dilated convolution layers, progressively extracting and fusing the spatio-temporal semantic features from fine-grained to coarse-grained. Our method achieves competitive performance on 3D body pose estimation benchmarks. It is experimentally demonstrated that the U-shaped network structure optimizes 3D pose estimation by the downscaling and upscaling of spatio-temporal fusion features. The experiments also show the importance of the representation of spatio-temporal features at different granularities in the pose recognition task, which opens up many possible directions for future work. For example, how to get rid of manual hard rules and apply a graph pooling method with trainable rules [50,51] to automatically fuse spatio-temporal features of different granularities for pose estimation tasks.

Author Contributions

W.L., R.D. and S.C. were all responsible for the design of the research, the implementation of the approach, and the analysis of the results. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant XDC02070600.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
GCNGraph Convolutional Network
TCNTemporal Convolutional Network
CNNConvolutional Neural Network
ST-GCNSpatial Temporal Graph Convolutional Network

References

  1. Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1263–1272. [Google Scholar]
  2. Tekin, B.; Márquez-Neila, P.; Salzmann, M.; Fua, P. Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3961–3970. [Google Scholar]
  3. Martinez, J.; Hossain, R.; Romero, J.; Little, J. A Simple Yet Effective Baseline for 3d Human Pose Estimation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2659–2668. [Google Scholar]
  4. Sun, X.; Shang, J.; Liang, S.; Wei, Y. Compositional Human Pose Regression. Comput. Vis. Image Underst. 2018, 176–177, 1–8. [Google Scholar]
  5. Fang, H.; Xu, Y.; Wang, W.; Liu, X.; Zhu, S.C. Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation. In Proceedings of the AAAI 2018, Arlington, VA, USA, 18–20 October 2018. [Google Scholar]
  6. Pavlakos, G.; Zhou, X.; Daniilidis, K. Ordinal Depth Supervision for 3D Human Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7307–7316. [Google Scholar]
  7. Yang, W.; Ouyang, W.; Wang, X.; Ren, J.S.J.; Li, H.; Wang, X. 3D Human Pose Estimation in the Wild by Adversarial Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5255–5264. [Google Scholar]
  8. Luvizon, D.C.; Picard, D.; Tabia, H. 2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5137–5146. [Google Scholar]
  9. Hossain, M.R.I.; Little, J. Exploiting temporal information for 3D pose estimation. arXiv 2017, arXiv:1711.08585. [Google Scholar]
  10. Lee, K.; Lee, I.; Lee, S. Propagating LSTM: 3D Pose Estimation Based on Joint Interdependency. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  11. Zhao, L.; Peng, X.; Tian, Y.; Kapadia, M.; Metaxas, D.N. Semantic Graph Convolutional Networks for 3D Human Pose Regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3420–3430. [Google Scholar]
  12. Cai, Y.; Ge, L.; Liu, J.; Cai, J.; Cham, T.; Yuan, J.; Magnenat-Thalmann, N. Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2272–2281. [Google Scholar]
  13. Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. arXiv 2018, arXiv:1801.07455. [Google Scholar]
  14. Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y. Convolutional Sequence to Sequence Learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
  15. Dauphin, Y.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional NetwoOrdinal depth supervision for 3d humanrks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
  16. van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.W.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
  17. Collobert, R.; Puhrsch, C.; Synnaeve, G. Wav2Letter: An End-to-End ConvNet-based Speech Recognition System. arXiv 2016, arXiv:1609.03193. [Google Scholar]
  18. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the MICCAI, Munich, Germany, 5–9 October 2015. [Google Scholar]
  19. Sminchisescu, C. 3D Human Motion Analysis in Monocular Video Techniques and Challenges. In Proceedings of the AVSS, Sydney, Australia, 22–24 November 2006. [Google Scholar]
  20. Ramakrishna, V.; Kanade, T.; Sheikh, Y. Reconstructing 3D Human Pose from 2D Image Landmarks. In Proceedings of the 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012. [Google Scholar]
  21. Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
  22. Ionescu, C.; Carreira, J.; Sminchisescu, C. Iterated Second-Order Label Sensitive Pooling for 3D Human Pose Estimation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1661–1668. [Google Scholar]
  23. Li, S.; Chan, A.B. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network. In Proceedings of the ACCV 2014, Singapore, 1–5 November 2014. [Google Scholar]
  24. Tekin, B.; Rozantsev, A.; Lepetit, V.; Fua, P.V. Direct Prediction of 3D Body Poses from Motion Compensated Sequences. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 991–1000. [Google Scholar]
  25. Tekin, B.; Katircioglu, I.; Salzmann, M.; Lepetit, V.; Fua, P.V. Structured Prediction of 3D Human Pose with Deep Neural Networks. arXiv 2016, arXiv:1605.05180. [Google Scholar]
  26. Jiang, H. 3D Human Pose Reconstruction Using Millions of Exemplars. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 1674–1677. [Google Scholar]
  27. Chen, C.H.; Ramanan, D. 3D Human Pose Estimation = 2D Pose Estimation + Matching. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5759–5767. [Google Scholar]
  28. Park, S.; Hwang, J.; Kwak, N. 3D Human Pose Estimation Using Convolutional Neural Networks with 2D Pose Information. In Proceedings of the ECCV Workshops, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
  29. Zhou, X.; Huang, Q.; Sun, X.; Xue, X.; Wei, Y. Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 398–407. [Google Scholar]
  30. Brau, E.; Jiang, H. 3D Human Pose Estimation via Deep Learning from 2D Annotations. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 582–591. [Google Scholar]
  31. He, Y.; Yan, R.; Fragkiadaki, K.; Yu, S.I. Epipolar Transformers. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7776–7785. [Google Scholar]
  32. Plizzari, C.; Cannici, M.; Matteucci, M. Spatial Temporal Transformer Network for Skeleton-based Action Recognition. In Proceedings of the ICPR Workshops, Virtual Event, 10–15 January2021. [Google Scholar]
  33. Hu, W.; Zhang, C.; Zhan, F.; Zhang, L.; Wong, T.T. Conditional Directed Graph Convolution for 3D Human Pose Estimation. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021. [Google Scholar]
  34. Nan, M.; Trascau, M.; Florea, A.M.; Iacob, C.C. Comparison between Recurrent Networks and Temporal Convolutional Networks Approaches for Skeleton-Based Action Recognition. Sensors 2021, 21, 2051. [Google Scholar] [CrossRef]
  35. Lin, M.; Lin, L.; Liang, X.; Wang, K.; Cheng, H. Recurrent 3D Pose Sequence Machines. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5543–5552. [Google Scholar]
  36. Katircioglu, I.; Tekin, B.; Salzmann, M.; Lepetit, V.; Fua, P.V. Learning Latent Representations of 3D Human Pose with Deep Neural Networks. Int. J. Comput. Vis. 2018, 126, 1326–1341. [Google Scholar] [CrossRef] [Green Version]
  37. Hossain, M.R.I.; Little, J. Exploiting Temporal Information for 3D Human Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  38. Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3590–3598. [Google Scholar]
  39. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition with Directed Graph Neural Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7904–7913. [Google Scholar]
  40. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12018–12027. [Google Scholar]
  41. Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  42. Wang, X.; Girshick, R.B.; Gupta, A.K.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
  43. Sigal, L.; Balan, A.O.; Black, M.J. HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion. Int. J. Comput. Vis. 2009, 87, 4–27. [Google Scholar] [CrossRef]
  44. Yeh, R.A.; Hu, Y.T.; Schwing, A.G. Chirality Nets for Human Pose Regression. In Proceedings of the NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  45. Xu, J.; Yu, Z.; Ni, B.; Yang, J.; Yang, X.; Zhang, W. Deep Kinematics Analysis for Monocular 3D Human Pose Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 896–905. [Google Scholar]
  46. Ci, H.; Wang, C.; Ma, X.; Wang, Y. Optimizing Network Structure for 3D Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2262–2271. [Google Scholar]
  47. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
  48. Reddi, S.J.; Kale, S.; Kumar, S. On the Convergence of Adam and Beyond. arXiv 2018, arXiv:1904.09237. [Google Scholar]
  49. Liu, R.; Shen, J.; Wang, H.; Chen, C.; Cheung, S.C.S.; Asari, V.K. Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5063–5072. [Google Scholar]
  50. Ying, R.; You, J.; Morris, C.; Ren, X.; Hamilton, W.L.; Leskovec, J. Hierarchical Graph Representation Learning with Differentiable Pooling. arXiv 2018, arXiv:1806.08804. [Google Scholar]
  51. Lee, J.; Lee, I.; Kang, J. Self-Attention Graph Pooling. arXiv 2019, arXiv:1904.08082. [Google Scholar]
Figure 1. Illustration of the proposed skeleton-based spatio-temporal U-Net. In the semantic pooling stage, spatio-temporal semantic features are gradually compressed and fused into different granularities. In the semantic upsampling phase, spatio-temporal features are decoded and multi-resolution spatio-temporal dependencies are abstracted by skipping connections in the U-Net structure.
Figure 1. Illustration of the proposed skeleton-based spatio-temporal U-Net. In the semantic pooling stage, spatio-temporal semantic features are gradually compressed and fused into different granularities. In the semantic upsampling phase, spatio-temporal features are decoded and multi-resolution spatio-temporal dependencies are abstracted by skipping connections in the U-Net structure.
Sensors 22 02573 g001
Figure 2. Overview of the proposed spatio-temporal U-Net scheme. The STUNet architecture consists of a cascade structure of semantic-structural graph convolution network(S-GCN) layers and structural temporal convolution network (S-TCN) layers to progressively integrate semantic features in local time and space. Taking 27 frames of input as an example, the model contains two layers of graph pooling in the spatial dimension and three layers of TCN compression in the temporal dimension.
Figure 2. Overview of the proposed spatio-temporal U-Net scheme. The STUNet architecture consists of a cascade structure of semantic-structural graph convolution network(S-GCN) layers and structural temporal convolution network (S-TCN) layers to progressively integrate semantic features in local time and space. Taking 27 frames of input as an example, the model contains two layers of graph pooling in the spatial dimension and three layers of TCN compression in the temporal dimension.
Sensors 22 02573 g002
Figure 3. Data flow in the proposed S-TCN model, from bottom input to top output.
Figure 3. Data flow in the proposed S-TCN model, from bottom input to top output.
Sensors 22 02573 g003
Figure 4. The defined hierarchical graph pooling strategy for the human body.
Figure 4. The defined hierarchical graph pooling strategy for the human body.
Sensors 22 02573 g004
Figure 5. Visualization of learned weighting matrices, M, of S-GCN in the network.
Figure 5. Visualization of learned weighting matrices, M, of S-GCN in the network.
Sensors 22 02573 g005
Figure 6. The visualized qualitative results of 3D pose estimation in video.
Figure 6. The visualized qualitative results of 3D pose estimation in video.
Sensors 22 02573 g006
Table 1. Reconstruction error on Human3.6M under protocol 1. Results are in millimeters.
Table 1. Reconstruction error on Human3.6M under protocol 1. Results are in millimeters.
Dir.Disc.EatGreetPhonePhotoPosePurch.SitSitD.SmokeWaitWalkD.WalkWalkT.Avg
Pavlakos et al. [1]67.471.966.769.172.077.065.068.383.796.571.765.874.959.163.271.9
Fang et al. [5]50.154.357.057.166.673.353.455.772.888.660.357.762.747.550.660.4
Pavlakos et al. [6]48.554.454.452.059.465.349.952.965.871.156.652.960.944.747.856.2
Yang et al. [7]51.558.950.457.062.165.449.852.769.285.257.458.443.660.147.758.6
Luvizon et al. [8]49.251.647.650.551.860.348.551.761.570.953.748.957.944.448.953.2
Hossain et al. [9]48.450.757.255.263.172.653.051.766.180.959.057.362.446.649.658.3
Lee et al. [10]40.249.247.852.650.175.050.243.055.873.954.155.658.243.343.352.8
Pavllo et al. [41]45.948.544.347.851.957.846.245.659.968.550.646.451.034.535.449.0
Cai et al. [12]44.647.445.648.850.859.047.243.957.961.949.746.651.337.139.448.8
Yeh et al. [44]44.846.143.346.449.055.244.644.058.362.747.143.948.632.733.346.7
Xu et al. [45]37.443.542.742.746.659.741.345.152.760.245.843.147.733.737.145.6
Liu et al. [49]41.844.841.144.947.454.143.442.256.263.645.343.545.331.332.245.1
Ours (27 frames)43.544.843.944.147.756.544.044.255.867.947.346.545.733.433.646.6
Ours (81 frames)42.643.642.843.146.154.643.342.453.563.245.844.244.931.932.045.0
Ours (243 frames)41.943.142.342.946.354.242.941.853.162.845.343.943.431.231.844.5
Table 2. Reconstruction error on Human3.6M under protocol 2. Results are in millimeters.
Table 2. Reconstruction error on Human3.6M under protocol 2. Results are in millimeters.
Dir.Disc.EatGreetPhonePhotoPosePurch.SitSitD.SmokeWaitWalkD.WalkWalkT.Avg
Martinez et al. [3]39.543.246.447.051.056.041.440.656.569.449.245.049.538.043.147.7
Sun et al. [4]42.144.345.045.451.553.043.241.359.373.351.044.048.038.344.848.3
Fang et al. [5]38.241.743.744.948.555.340.238.254.564.447.244.347.336.741.745.7
Pavlakos et al. [6]34.739.841.838.642.547.538.036.650.756.842.639.643.932.136.541.8
Yang et al. [7]26.930.936.339.943.947.428.829.436.958.441.530.529.542.532.237.7
Hossain et al. [9]35.739.344.643.047.254.038.337.551.661.346.541.447.334.239.444.1
Pavllo et al. [41]34.236.833.937.537.143.234.433.545.352.737.734.138.025.827.736.8
Cai et al. [12]35.737.836.940.739.645.237.434.546.950.140.536.141.029.633.239.0
Xu et al. [45]31.034.834.734.436.243.931.633.542.349.037.133.039.126.931.936.2
Liu et al. [49]32.335.233.335.835.941.533.232.744.650.937.032.437.025.227.235.6
Ours (27 frames)34.335.734.936.637.542.733.136.044.453.738.533.538.426.028.436.9
Ours (81 frames)33.535.133.936.036.942.132.334.542.950.137.733.037.825.627.636.0
Ours (243 frames)33.334.833.635.236.342.232.133.742.649.436.932.837.425.127.235.4
Table 3. Reconstruction error on HumanEva-I dataset under protocol 2. Results are in millimeters.
Table 3. Reconstruction error on HumanEva-I dataset under protocol 2. Results are in millimeters.
Walk Jog Box
S1S2S3S1S2S3S1S2S3
Pavlakos et al. [6]22.319.529.728.921.923.8---
Lee et al. [10]18.619.930.525.716.817.742.848.153.4
Pavllo et al. [41]13.910.246.620.913.113.823.833.732.0
Yeh et al. [44]15.210.347.021.813.113.722.831.831.0
Xu et al. [45]13.210.229.912.612.313.013.218.120.4
Liu et al. [49]13.19.826.816.912.813.3---
Ours (243 frames)12.89.726.516.012.212.714.616.919.3
Table 4. Computational complexity of various models under protocol 1.
Table 4. Computational complexity of various models under protocol 1.
ModelParametersFLOPsMPJPE (mm)
Hossain et al. [9]16.96M33.88M58.3
Pavllo (81 frames) et al. [41]12.75M25.48M47.7
Pavllo (243 frames) et al. [41]16.95M33.87M46.8
Ours (27 frames)14.80 M29.03 M45.8
Ours (81 frames)19.67 M38.45 M45.0
Ours (243 frames)29.58 M64.84 M44.5
Table 5. Ablation study for our 81-frame and 243-frame model under protocol 1 on Human3.6M.
Table 5. Ablation study for our 81-frame and 243-frame model under protocol 1 on Human3.6M.
Frames N 1 N 2 MPJPE (mm)
811245.8
811345.0
812345.3
2431345.5
2431445.3
2432344.9
2432444.5
2433444.7
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, W.; Du, R.; Chen, S. Skeleton-Based Spatio-Temporal U-Network for 3D Human Pose Estimation in Video. Sensors 2022, 22, 2573. https://doi.org/10.3390/s22072573

AMA Style

Li W, Du R, Chen S. Skeleton-Based Spatio-Temporal U-Network for 3D Human Pose Estimation in Video. Sensors. 2022; 22(7):2573. https://doi.org/10.3390/s22072573

Chicago/Turabian Style

Li, Weiwei, Rong Du, and Shudong Chen. 2022. "Skeleton-Based Spatio-Temporal U-Network for 3D Human Pose Estimation in Video" Sensors 22, no. 7: 2573. https://doi.org/10.3390/s22072573

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop