1. Introduction
Three-dimensional human pose estimation is a task that predicts the 3D spatial position of human joints from images or videos, which has broad application fields such as action recognition [
1,
2,
3,
4], human–computer interaction [
5], augmented reality [
6], and autonomous driving [
7]. Many advanced approaches [
8,
9,
10,
11] solve this task by decoupling it into two subtasks, i.e., first locating 2D keypoint coordinates with a 2D pose detector and then designing a 2D-to-3D lifting network to infer the joint positions in 3D space from the 2D keypoints. Despite their impressive performance, it is still a tricky problem as different poses in 3D space may have the same pose when projected to 2D due to depth ambiguity. Many works [
12,
13] have made significant progress in addressing this issue by employing multiple simultaneous cameras to view objects from various viewpoints. However, in contrast to monocular methods, multi-view methods have strict prerequisites for equipment and the environment, which are not practical in reality. Therefore, many approaches are beginning to explore spatial and temporal information in monocular videos [
8,
9,
14,
15]. These models take video as the input and perceive the depth information of moving objects using the temporal information of the sequence.
Currently, convolutional neural networks (CNNs) are widely applied to various tasks in computer vision [
16,
17,
18,
19,
20,
21], and all of them have achieved relatively favorable performance. Unlike CNNs, graph convolutional neural networks (GCNs) are more suitable for processing graph structured data. The GCN describes the spatial relationships between joints by a hand-crafted graph adjacency matrix. The graph is built on an articulated human skeleton, where the human joints represent nodes and the human skeleton forms edges. Cai et al. [
15] designed a spatio-temporal graph model based on a graph convolutional neural network to estimate 3D joint positions from 2D pose sequences using spatio-temporal relations. However, the GCN-based approach has two limitations: First, each node in the graph shares a transformation matrix. This weight sharing prevents the GCN from learning different patterns of relationships between different body joints. However, for different poses, the relationship between their joints is different [
22]. For example, in the running pose, there is a close relationship between the hands and the feet, while for the sitting pose, this is not the case. Such information is difficult to capture with a static skeleton graph. Liu et al. [
23] solved this problem by weight separation and performing different feature transformations on different nodes before aggregating the features. However, this significantly increases the size of the model. Second, GCN-based methods with small temporal receptive fields prevent modeling long-term dependencies between temporal sequences. In recent years, the Vision Transformer has seen widespread adoption across a broad range of computer vision tasks. Its internal self-attention mechanism allows for flexible modeling of long-range globally consistent information about the input sequence. Zheng et al. [
8] proposed a novel approach for 3D human pose estimation from video using a Transformer-based spatio-temporal network that does not rely on convolutional architectures. It first models the intrinsic structural relationships between joints in the spatial domain and then acquires the temporal consistency of the video sequences. However, this sequential network only models the static spatial relationship of each frame, but ignores the effect of temporal information on the spatial structure.
Based on this, we designed a Transformer-based spatial–temporal interaction enhancement network (STFormer) for estimating 3D body poses from monocular videos. It adopts the interaction of different domain features to enhance the representation of the current domain, which is crucial for accurately predicting the position of body joints. To accomplish this, STFormer starts with the generation of coarse temporal and spatial representations and then continuously communicates between them to eventually produce a more accurate 3D prediction. This framework more effectively extracts spatial and temporal features and also builds stronger connections between them. Specifically, in the first stage, we propose the the Feature Extraction (FE) module, which consists of two branches: Spatial Feature Extraction (SFE) branch and Temporal Feature Extraction (TFE) branch, which are responsible for extracting the intrinsic spatial structure of each frame and the temporal dependencies between frames. Although spatial and temporal features are extracted in the first stage, there is no information interaction between them. In view of the influence of temporal information on the spatial structure as discussed in the above paragraph, in the second stage, the connection between spatial and temporal information is established through the Cross-Domain Interaction (CDI) module. It consists of two blocks: the Spatial Reconstruction (SR) block, which is responsible for injecting temporal features into the spatial domain, and the Temporal Refinement (TR) block, which uses the reconstructed features to refine temporal features. CDI captures mutual spatio-temporal correlations to construct cross-domain communication, enabling messages to be passed between spatial and temporal representations for better interaction modeling.
With the proposed STFormer, temporal and spatial features are explicitly incorporated into the Transformer model. The spatial structure of each video frame is dynamically changed according to the video sequence information, and then, this adjusted spatial structure, in turn, contributes to the temporal representation. As a result, both representations are significantly enhanced to provide poses that are more accurate. The summarization of our contributions are as follows:
To predict 3D human pose more accurately from monocular videos, we designed a spatio-temporal interaction enhanced Transformer network, called STFormer. STFormer is a two-stage method, in which the first stage extracts features independently from the spatial and temporal domains, respectively, and the second stage interacts spatial and temporal information across domains to enrich the representations.
In the second stage, we designed the Spatial Reconstruction block and the Temporal Refinement block. The Spatial Reconstruction block injects the temporal features into the spatial domain to adjust the spatial structure relationship; the reconstructed features are then sent to the Temporal Refinement block to complement the weaker intra-frame structural information in the temporal features.
3. Method
Figure 1 illustrates the whole framework of our STFormer. We adopted the 2D to 3D lifting approach used in [
11,
14,
29], taking 2D video pose sequences as the input and predicting the 3D pose for intermediate frames. The presented two-stage network STFormer consists of Feature Extraction (FE) and Cross-Domain Interaction (CDI). Specifically, the FE contains two branches, Spatial Feature Extraction (SFE) branch and Temporal Feature Extraction (TFE) branch, which are responsible for modeling the inherent structural of human joints and temporal dependence, respectively. CDI includes a Spatial Reconstruction (SR) block and a Temporal Refinement (TR) block, which are responsible for the interaction of the information extracted in the previous stage.
3.1. Preliminary
Due to the excellent performance of Transformer across tasks, our model also adopted the Transformer-based architecture. In this part, we briefly describe the components of Transformer [
31], including scaled dot-product attention, multi-head self attention, and multi-layer perceptron.
Scaled dot-product attention is shown in
Figure 2a, and its input consists of queries, keys, and values. First, the dot-product of the query and all the keys are calculated and scaled, then the Softmax function is applied to obtain the weights and finally multiplied with the values to obtain the output. The process is expressed formulaically as follows:
To enhance the expressive capacity of the network, a multi-head self attention (MSA) mechanism is further proposed. MSA divides queries, keys, and values into
h heads, with each head performing scaled dot-product attention in parallel (
Figure 2b). The information in different representation subspaces at different locations is jointly modeled with multiple heads. After that, the values that are generated by the output of each head are aggregated and linearly projected to generate the final output.
where
is the index of the head and
is the linear function.
Multi-layer perceptron (MLP) consists of two linear layers and an activation function for non-linearity and feature transformation. The process is defined as
where
and
are the weights of the two linear layers,
and
are the bias terms, and
denotes the GELU activation function [
38].
3.2. Feature Extraction
In the first stage of STFormer, the input is a 2D pose sequence with N video frames and J joints per pose, where 2 represents the joint position coordinates, which is pre-processed and fed to the Spatial Feature Extraction branch and the Temporal Feature Extraction branch to extract features, respectively.
3.2.1. Spatial Feature Extraction Branch
For the spatial domain, we propose SFE composed of the Transformer encoder to model the intrinsic structural relationships of the human joints for each frame (see
Figure 3a,c). More specifically, we treated each joint of a 2D pose as a token and took a linear projection layer to embed it into high-dimension space:
, where
is the
i-th frame and
C is the joint embedding dimension. Then, to save the spatial location information, we added a learnable spatial position embedding
with this high-dimensional feature, and it became
. Finally,
were fed into the SFE to extract spatial structural information across all joints. These processes can be described as:
where
denotes the layer normalization layer and
is the index of the FE layers. After going through the
-layer FE module, the output of frame
i on the SFE branch will be
.
3.2.2. Temporal Feature Extraction Branch
Although SFE can obtain the intrinsic structure between joints, whereas it ignores the temporal dependencies between video frames. To explore temporally consistent information, we propose an SFE-like branch TFE (see
Figure 3b,c). We should construct a feature representation in the temporal domain. To do this, different from the operation in the spatial domain, we treated each frame of the 2D pose as a token in the temporal domain. Then, sent it to the TFE branch to learn the global relationships between the input sequences. For the input
, we first combined the coordinates of all joints in each frame, denoted as
. Then, as with the initial spatial features, we embedded them into a high-dimensional feature
(
D indicates the embedding dimension per frame) and added a learnable position encoding
to retain the frame position information. Finally, we fed the embedded features into the TFE. These procedures can be expressed as:
where
is the index of the FE layers. After the
-layer FE, the output of the TFE branch is
.
3.3. Cross-Domain Interaction
The structural relationships between the joints of the body are not invariable for different states of motion [
22]. For example, for the “running” pose, there is a strong connection between the hands and feet: reaching the left hand is accompanied by a corresponding step of the right foot, whereas for other poses, there may not be such a relationship. Therefore, the corresponding spatial structure relationships should be modeled for different poses. In the temporal domain, we treated each frame pose as a token to extract temporal information, however, ignoring the spatial structure relationship inside each frame. In order to tackle the above issues, a Cross-Domain Interaction module consisting of the Spatial Reconstruction block (SR) and Temporal Refinement block (TR) is proposed. The SR block injects the temporal action information into the spatial domain, allowing the network to adjust the relationship between joints based on this information. Then, the output of the SR block is converted to the temporal domain as the input to the TR block to complement the lack of spatial structure information in the temporal features. For the convenience of formula writing, we use
,
to denote spatial and temporal features, respectively, in the Cross-Domain Interaction module,
and
.
3.3.1. Spatial Reconstruction
In the first stage, we extracted features from the temporal and spatial domains, respectively. However, both types of information are extracted independently, and there is no interaction between them. To employ temporal action information to reconstruct the joint structural relationship, we need to inject the temporal features into the spatial domain. Before that, we must realize that the temporal and spatial features are inconsistent. In the temporal domain, each frame of a 2D pose sequence is represented by a token whose dimension is
, while in the spatial domain, the feature dimension is
, and each token represents one joint. This results in the information from different domains not being able to be directly interacted. To achieve cross-domain temporal information injection, we propose the SR block (see
Figure 4a), which consists of the feature transforming unit (FTU), multi-head cross-attention (MCA), MSA, and MLP.
We first utilized the FTU to convert the temporal features to the spatial domain. Specifically, a linear layer converts the number of temporal feature channels to
and then passes through a LayerNorm layer. Finally, the channels are grouped and reshaped into
. After the above operations, the temporal features are successfully transformed from the temporal domain to the spatial domain (
). Next,
and spatial feature
are fed to the MCA. The MCA interacts the information from different domains and has a similar structure to the MSA (see
Figure 2c). The feature
serves as the query, whereas the feature
serves as the key and value. We used MCA to inject temporal feature
into the spatial feature, then the result feature
goes through the MSA and MLP to restructure the relationship between the joints, which can be formulated as Equation (
6).
where
is the index of CDI layers. After these operations, we successfully injected the temporal features into the spatial domain and obtained the final spatial reconstruction features
from the SR block following the
-layer CDI module.
3.3.2. Temporal Refinement
We propose that the TR utilizes the output of the SR to refine for temporal features that have weaker spatial structure relationships. TR consists of the FTU, MCA, and MLP (see
Figure 4b). As in the previous operation in the SR, we need to convert the spatial features to the temporal domain. Each frame of the SR output
is reshaped into a vector
. Then, concatenating the
N frames vectors
and changing the feature dimension form
to
by a linear layer. Finally, the feature is normalized with the LayerNorm layer, denoted as
, and sent to MCA along with the temporal feature
to complement the weaker spatial structure relationship in the temporal domain. It differs from SR in that TR uses the temporal features as the query and the reconstructed spatial features as the key and value.
The output of the TR block in the last layer of the CDI module is .
3.4. Regression Head
In the regression head, our model learns two different linear regression functions to regress the 3D pose from the spatial and temporal domains, respectively. In the spatial domain, the spatial head is applied on to map the dimension of each joint from C to 3 for generating the 3D pose at frame i, where 3 denotes the joint coordinates in 3D space. The pose in the center frame is marked by . For the temporal domain, the temporal head is applied on to regress the 3D poses . Then, the pose of the intermediate frame is selected and denoted as . Ultimately, the final prediction of the model is the average of and .
3.5. Loss Function
We employed the mean-squared error (MSE) as the loss function in our model, which is a common tool in 3D human pose estimation. The MSE optimizes the parameters of the model for better performance by minimizing the error between the predicted and ground truth positions.
Spatial loss:
where
and
are the ground truth and estimated 3D positions of joint
j at frame
n, respectively.
The final loss of the whole model during the training stage:
where
and
are the weighting factors for spatial and temporal loss, respectively.
5. Conclusions
In this work, we proposed STFormer, a two-stage network for estimating 3D human poses from monocular video based on Transformer. Unlike previous sequential modeling approaches that first extract spatial structural features and then learn temporal consistency, STFormer not only extracts spatio-temporal features, but also considers the effect between information from different domains (temporal and spatial domains). The model consists of two stages: Feature Extraction (FE) and Cross-Domain Interaction (CDI). In FE (Stage 1), we proposed two similar blocks, SFE and TFE, for extracting the features from the spatial and temporal domains, respectively. However, there was no interaction between these two features. To enhance the representation of the current domain by utilizing features from other domains, we proposed SR and TR in the CDI module (Stage 2). The SR block injects temporal information into the spatial domain to adjust the spatial structure relationships, and then, the output of SR is fed to TR to compensate for the weaker spatial structure relationships of the temporal features. Detailed experiments demonstrated that the proposed STFormer has a basic benefit over purely spatial or temporal Transformer and achieved better performance than mainstream methods on benchmark datasets.