1. Introduction
With the rapid development of artificial intelligence technology, an increasing number of intelligent devices are being applied in industrial production and daily human life. Human motion prediction, a key technology for enhancing device intelligence, aims to capture the intrinsic temporal evolution within historical human motion sequences to generate predictions for future motion. Human motion prediction has been widely applied in fields such as autonomous driving [
1], human–computer interaction [
2,
3], human emotion recognition [
4], and human behavior analysis [
5,
6,
7]. However, due to the high dimensionality, joint spatial collaboration, hierarchical human structure, and strong temporality characteristics of human motion, capturing temporal dynamic information and spatial dependency features for precise human motion prediction remains a challenging research hotspot.
Human motion prediction is a typical task in the computer vision field. Traditional human motion prediction algorithms, such as hidden Markov models (HMMs) [
8], Gaussian process dynamic models (GPDMs) [
9], and restricted Boltzmann machines [
10], as shown in
Figure 1, often require extensive prior knowledge and assumptions, making it challenging to capture the complexity and diversity of human motion and so restricting their application impact.
As more and more large-scale motion capture datasets become available, an increasing number of deep learning models have been designed and have demonstrated excellent performance, such as convolutional neural networks (CNNs) [
11], graph neural networks (GNNs) [
12,
13,
14], temporal modules such as recurrent neural networks (RNNs) [
15,
16,
17,
18,
19,
20], temporal convolutional networks (TCNs) [
21,
22], and attention mechanisms [
23,
24]. Although these deep learning models have exhibited effectiveness in human motion prediction, there are still limitations in two aspects:
(a) Spatial relationship modeling: In most previous studies, spatial joint graphs were designed based on the human physical structure, typically utilizing graph neural networks (GNNs) [
25] to capture spatial correlations. However, GNNs are limited by the local and linear aggregation of node features and may not effectively capture the global and nonlinear dynamics of human motion. The introduction of adaptive graphs aimed to overcome these limitations, but they still have drawbacks, such as overlooking the correlation between critical 3D coordinate information, which results in a loss of relevant internal data feature information.
(b) Simultaneously capturing complex short-term and long-term temporal dependencies: Most research has employed temporal learning components to capture temporal correlations. RNNs are a classic approach, but they face gradient vanishing or exploding issues when learning long time sequences. More advanced models such as LSTM and GRU mitigate the issue of vanishing gradients to a certain degree, but pose challenges in training and lack a parallel computation capability. Self-attention mechanisms [
26,
27] attempt to capture temporal dependencies but still struggle to effectively model long-range dependencies. TCNs [
22,
28] capture long-term dependencies through fixed kernel sizes, adopting an independent module framework that can only capture single dependency relationships from a temporal scale perspective. Fixed receptive fields limit their ability to adaptively learn multi-scale temporal dependencies.
To tackle these challenges in human motion prediction, a novel method based on dual attention and multi-granularity temporal convolutional networks (DA-MgTCNs) was proposed. This approach effectively captures spatial correlations and multi-scale temporal dependencies. Specifically, joint attention and channel attention were combined to design a dual-attention structure for extracting spatial features and capturing information on spatial correlations between and within human joints. TCNs were employed to model long-term temporal dependencies, and the concept of multi-granularity was introduced into the TCN to further enhance performance. The multi-granularity TCN (MgTCN) employed convolution kernels of varying scales in its convolution operations across multiple branches, enabling it to effectively capture multi-scale temporal dependencies in a flexible manner.
The MgTCN module was comprised of a combination of multi-granularity causal convolutions, dilated convolutions, and residual connections. Each branch of the module was composed of multiple causal convolution layers with varying dilation factors. This design enabled the adaptive selection of different receptive fields based on varying motion styles and joint trajectory features for short-term and long-term human motion prediction.
The main contributions of this paper are as follows:
(1) We designed a dual-attention model for extracting inter-joint and intra-joint spatial features, more effectively mining spatial relationships between joints and different motion styles, providing richer information sources for motion prediction.
(2) We introduced a multi-granularity temporal convolutional network (MgTCN) that employed multi-channel TCNs with different receptive fields for learning, thus achieving discriminative fusion at different time granularities, flexibly capturing complex short-term and long-term temporal dependencies, and thereby further improving the model’s performance.
(3) We conducted extensive experiments on the Human3.6M and CMU-MoCap datasets, demonstrating that our method outperformed most state-of-the-art approaches in short-term and long-term prediction, verifying the effectiveness of the proposed algorithm.
The remainder of this paper is organized as follows:
Section 2 reviews related work.
Section 3 details the proposed methodology. In
Section 4, we describe experiments conducted on two large-scale datasets, comparing the performance of the proposed method with baselines.
Section 5 provides a summary and conclusion, as well as a discussion of future work.
3. Approach
3.1. Problem Formulation
Our goal was to forecast future human posture sequences based on previous 3D human pose sequences. Three-dimensional joint positions were employed as the pose representation to prevent the ambiguity produced by the joint angle representation. A graphical representation of the human pose was created by analysing the properties of human joint positions over time. Let
represent the set of joint positions for
T time steps, where
,
T specifies the number of input time steps;
J the number of human pose joints; and
the feature dimension (x, y, z). Our goal was to anticipate the pose’s future steps
. We began by copying the latest pose
times to build a time series of length
, as described in the literature [
25,
45]. As a result, the goal became generating a time series of length
from the input sequence
to produce the output sequence
, where
is commonly designated as the 3D coordinates of
N body joints.
3.2. Overview
We employed a residual depth network consisting of DA-MgTCN modules to capture the global spatial correlation and multi-scale temporal dependence of human motion. Each DA-MgTCN module consisted of a two-branch attention structure module (DA) and a multi-granularity TCN module (MgTCN) connected in series to capture the inter-temporal dependence of historical motion sequences. The DA module was used to extract spatially significant information from joint-level and channel-level dimensions. A combination of multi-granularity convolution and TCN was used in the MgTCN module to increase the prediction quality and adapt to varied forms of human motion and multi-scale time. The complete model architecture was trained end-to-end, with global and local residual connections improving the deep neural network’s performance. Each DA-MgTCN component is described in detail below.
Figure 2 shows a detailed description of the module. The specifics of the DA and MgTCN modules are provided below.
3.3. Dual Attention (DA)
The self-attentive mechanism is regarded as an efficient method for modeling remote dependencies. Tang et al. [
43] and Cai et al. [
44] used the attention module for information extraction along the temporal dimension and the modeling of global spatial dependencies, respectively. However, we observed that the 3D coordinate information from human joints is crucial for spatial representations.
As a result, we proposed a dual-attention module that took into account both joint-level attention and channel-level attention in order to extract joint-related and channel-related information for spatial correlation. The DA module is depicted in the lower left corner of
Figure 2 and is described in detail below.
Given a human motion feature
X, a linear transformation was first performed using the weight matrices
,
, and
to obtain the query
Q, the key
K, and the value
V. These two branches shared the same embeddings:
Q,
K, and
V. The embeddings are partially reorganized into
(for
Q,
K, and
V in the joint branch) and
(for
Q,
K, and
V in the channel branch) dimensions. The joint and channel attention were used to simultaneously mine the dependencies between joints in the space and channel dimensions. This was computed as follows:
where
and
represent the deformations of the
and
V matrices, respectively.
,
, and
are trainable weights, and
is the dimension of K. J and C denote joint-level and channel-level branches, respectively.
and
are the output features of the joint-level and channel-level networks. After obtaining the joint-level and channel-level features, we summed them one by one to obtain the spatially noticed feature representation
of the MgTCN, as shown in Equation (
5):
After obtaining the spatially noticed feature representation of the motion data, we could feed this representation into the subsequent layers of the network. This process helped to capture joint-level and channel-level contextual information, which is crucial for effective motion prediction modeling.
3.4. Multi-Granularity TCN (MgTCN)
To learn human motion temporal features efficiently, we extended the concept of temporal multi-granularity convolutional kernels to TCN networks and proposed MgTCNs for extracting temporal features at multiple scales for different motion styles. The MgTCN module is shown in the lower right corner of
Figure 2 and consisted of multi-granularity causal convolution, dilated convolution, and residual blocks. There were three causal convolution channels in the MgTCN, each using kernels with granularity sizes of 2, 3, and 5 for feature extraction. Each channel consisted of three residual blocks connected in series. These units increased the perceptual field at a dilation rate of [1, 2, 4] and used ReLU as the activation function. In addition, a dropout unit was included in each residual block for regularization.
Causal convolution: The output at the tth timestamp for standard 1D convolution is calculated from the k elements around the previous layer with time step t, which is not reasonable for the human motion prediction task [
31]. The goal of this research was to find the best function for generating human-like future poses based on previous motion capture sequences. As a result, the predicted pose at time step t could be derived only from all possible representations of previously observed frames and not from later poses. MgTCN’s causal convolution ensured that only past data were used as the model input, preventing future information leakage. This was easily accomplished by shifting the standard convolution output by a few time steps, as shown in the equation below:
where
is the output,
are the inputs,
is the convolution weight at time step
i, and
k is the kernel size.
Dilated convolution: Causal convolution captures historical data inadequately. Increasing the network’s depth or number of layers can help it capture historical data linearly. However, increasing the network depth exponentially increases the number of parameters, making network training more difficult. Oord et al. [
35] suggested using dilation convolution to extend causal convolutional networks’ receptive field to better capture historical information.
Dilated convolution is implemented by adding expansion parameters to the moving convolutional kernel. Compared to traditional deep convolutional networks, dilation convolution can obtain a larger receptive field without significantly increasing the number of parameters, thus capturing information over a longer time range. This approach can focus on both local details and motion trends over a longer time span when dealing with human motion prediction tasks.
Dilated causal convolution can be expressed by the following equation. For a filter
and
, denoting the given 1D time-series input, the dilated convolution operation
F on element
s of the sequence is computed as:
where
d is the dilation factor,
k is the size of the filter, and the convolution kernel is restricted to slide only at the current position and to its left (i.e., past information).
The receptive field
R for a three-layer convolution is calculated as:
where
,
, and
are the dilation factors of the three-layer convolution, which are used to calculate the size of the receptive field.
Our TCN was calculated as:
Figure 3 shows an example with a three-layer causal expansion convolutional network (TCN). The TCN’s elements in
Figure 3 include a series of dilation causal convolutions with dilation factors
and a filter size
.
Multi-granularity convolution: In order to handle complex, multi-action, multi-joint predictions of the human body, MgTCN required the use of convolutional kernel filters with different granularities to extract time-series features at different scales. This was necessary to meet the needs of short- and long-term predictions that require the capture of time-series features of different lengths. Three MgTCN time series were processed separately, which made it possible to combine multiple time granularities in the feature extraction process, which could better represent a large range of spatio-temporal features. Therefore, integrating time-series data with different time granularities to obtain better results is a challenge.
In order to handle the complex multi-action and multi-joint prediction of the human body, one must integrate time-series data with different time granularities to obtain better results. MgTCN used convolutional kernel filters with different granularities to extract time-series features at different scales. This satisfied the need for capturing short- and long-term forecasts of time-series features of different lengths. Three time series were treated separately in MgTCN, which allowed us to combine multiple temporal granularities in the feature extraction process to better represent certain large ranges of spatio-temporal features.
The MgTCN network output could be used to extract multi-temporal granularity features (short-term and long-term) using the aforementioned spatial and temporal feature extraction steps. We combined the data from these three TCN channels and used the equation below to make predictions in order to achieve the integration of the multi-granularity information.
where
is a learnable parameter to adjust the weights for different time periods, and
represents a mapping function that maps the fused features to the predicted values.
With this multi-granularity temporal convolution (MgTCN) method, we could both observe the general trend of human motion in the long-term and capture the outliers of short-term changes. This temporal correlation facilitated predictive power.
3.5. Global and Local Residual Connection
Residual connection skips a layer of the network and adds its output to the next layer’s output. This eliminates gradient fading by propagating the gradient straight from the back layer to the front layer. This architecture simplifies neural network representation learning in deeper structures.
Figure 2 illustrates the use of global residual connections between the encoder and decoder modules and local residual connections in each DA-MgTCN module to enhance neural network training and deeper structural performance. This method assisted the network in capturing complex data patterns in human motion prediction.
3.6. Loss Function
To train our DA-MgTCN model, we employed an end-to-end training technique. The mean position per joint error (MPJPE) loss function between the anticipated motion sequence and the ground truth motion sequence was used to analyze the difference between the predicted outcomes and the true pose, which was defined as follows:
where
N is the number of human joints,
T is the number of time steps in the future series,
is the prediction of the
ith joint at the
jth time step, and
is the corresponding ground truth.
We optimized the loss function using the improved Adam method (AdamW [
46]), which mitigated the overfitting problem by adding a weight decay term and could significantly improve the robustness of the model.
4. Experiments
In this section, we evaluate the performance of the proposed method using two large-scale human motion capture benchmark datasets: Human3.6M and CMU-Mocap.
4.1. Datasets
Human3.6M [
47] is the largest existing human motion analysis database, consisting of 7 actors (S1, S5, S6, S7, S8, S9, and S11) performing 15 actions: walking, eating, smoking, discussing, directions, greeting, phoning, posing, purchases, sitting, sitting Down, taking photos, waiting, walking a dog, and walking together. Some actions are periodic, such as walking, while others are non-periodic, such as taking photos. Each pose includes 32 joints, represented in the form of an exponential map. By converting these into 3D coordinates, eliminating redundant joints, global rotation, and translation, the resulting skeleton retains 17 joints that provide sufficient human motion details. These joints include key ones that locate major body parts (e.g., shoulders, knees, and elbows). This strategy ensures that no crucial joints are overlooked. We downsampled the frame rate to 25 fps and used S5 and S11 for testing and validation, while the remaining five actors were used for training.
CMU-MoCap, available at
http://mocap.cs.cmu.edu/, accessed on 13 June 2023, is a 3D human motion dataset released by Carnegie Mellon University that used 12 Vicon infrared MX-40 cameras to record the positions of 41 sensors attached to the human body, describing human motion. The dataset can be divided into six motion themes, including human interaction, interaction with environment, locomotion, physical activities and sports, situations and scenarios, and test motions.
These motion themes can be further subdivided into 23 sub-motion themes. The same data preprocessing method as in the literature [
25] was adopted, simplifying each human body and reducing the motion rate to 25 frames per second. Furthermore, eight actions (basketball, basketball signals, directing traffic, jumping, running, soccer, walking, and washing the face) were selected from the dataset to evaluate the model’s performance. No hyperparameters were adjusted in this dataset, and we only used the training and testing sets, applying a splitting method consistent with the common practice in the literature.
4.2. Implementation Details
All experiments in this paper were implemented using the PyTorch deep learning framework. The experimental environment was Ubuntu 20.04 with an NVIDIA A100 GPU. During the training process, the batch normalization size was set to 16, and the AdamW optimizer was used to optimize the model. The initial learning rate was set to 0.003, with decay by 5% every 5 epochs. The model was trained for 60 epochs, and each experiment was conducted three times. The average result was taken to ensure a more robust evaluation of the model’s performance. The input motion prediction length was 25 frames (1000 ms), and the prediction generated 25 frames (1000 ms). The choice and configuration of the relevant hyperparameters are shown in
Table 1.
4.3. Evaluation Metrics and Baselines
The same evaluation metrics as those used in existing algorithms [
25,
45] were employed for assessing model performance. The standard mean per joint position error (MPJPE) was used to measure the average Euclidean distance (in millimeters, mm) between the predicted joint 3D coordinates and the ground truth, as illustrated in Equation (
12). In addition, to further illustrate the advantages of the method, we conducted a comparative analysis of our method with Res. sup. [
17], convSeq2Seq [
11], DMGNN [
13], LTD [
25], LPJP [
44], Hisrep [
48], MSR [
49], and ST-DGCN [
45].
4.4. Experimental Results and Analysis
Human3.6M: Based on the existing work, we divided the prediction results into short-term (80–400 ms) and long-term predictions (500–1000 ms). The experimental results are shown in
Table 2, which demonstrates the joint position error and mean error for short-term (80 ms, 160 ms, 320 ms, 400 ms) and long-term (560 ms, 1000 ms) predictions for 15 kinds of movements. It was found that the existing methods usually showed high prediction accuracy when dealing with more periodic and regular movements, such as “walking” and “eating”. However, when dealing with more random and irregular movements, such as “directions”, “posing”, and “purchases”, the prediction accuracy decreased significantly. The algorithm proposed in this paper showed high prediction accuracy when dealing with highly complex, non-periodic, and irregular movements.
Our experimental results revealed that the proposed DA-MgTCN method outperformed most baseline methods in both short-term and long-term motion prediction. In particular, it can be observed from the experimental results that the proposed DA-MgTCN method outperformed most baseline methods in short-term motion prediction and improved more significantly in long-term prediction, with each MPJPE index reaching the optimum and obtaining excellent prediction results for both the 560 mm and 1000 mm MPJPE metrics. This success can be attributed to the ability of DA-MgTCN to fully capture spatial correlation and multi-granularity temporal features, which was a key factor in enhancing the model’s prediction accuracy.
Qualitative comparison: We visualized the results of the aforementioned motion prediction to further assess the model’s performance.
Figure 4 illustrates the visualization results for actions including “walking”, “discussion”, “posing”, and “sitting down”. The first row in every subplot shows the ground truth pose sequences (in black), followed by the predicted poses (in blue), i.e., each row displays the prediction results of one model. From the visualization results, it was observed that the predictions generated by the DA-MgTCN method showed higher similarity to the actual sequences and exhibited lower distortion and better continuity between frames. This was due to the dual-branch spatial attention and multi-granularity temporal convolution modeling joint motion trajectories, which provided richer and smoother joint motion temporal context information. The model could sufficiently capture global spatial dependencies, allowing it to encode joint information with distant hidden dependencies. For example, in the “sitting down” motion visualization, the motion between the hands and feet was more coordinated and coherent. This demonstrated once again how well the suggested DA-MgTCN forecasted very complicated irregular movements and complex periodic motions.
CMU-MoCap: To further validate the generalization of the DA-MgTCN method, we compared its performance with existing algorithms on the CMU-MoCap dataset, including Res. sup. [
17], convSeq2Seq [
11], DMGNN [
13], LTD [
25], LPJP [
44], MSR [
49], and ST-DGCN [
45]. The experimental results are shown in
Table 3, presenting the mean per joint position error and corresponding average error for short-term and long-term predictions across eight actions. From the table, it can be observed that the DA-MgTCN method’s short-term and long-term prediction accuracy was significantly higher than that of the other seven existing prediction algorithms, including Cai et al. [
44]’s method, even when handling relatively complex non-periodic actions. The DA-MgTCN method improved the average prediction accuracy by about 1.5% in short-term prediction and 3% in long-term prediction, respectively, compared to the state-of-the-art ST-DGCN method. Thus, the comprehensive experimental results once again confirmed the effectiveness and generalization capabilities of the DA-MgTCN method.
4.5. Ablation Study
To deeply evaluate the contribution of each component in our model, we conducted a series of ablation experiments on the Human3.6M dataset. These experiments focused on the impact of the channel-attention (channel-att) and multi-grained (Mg) convolution modules on the model’s performance. The results of the experiments are shown in
Table 4.
In terms of channel attention, the prediction accuracy significantly decreased when only joint attention was used without dual attention. The multi-granularity convolutional TCN module showed excellent performance in capturing long-term temporal dependence, thus improving the long-term prediction accuracy. Furthermore, when the channel-att or Mg module was removed, the error at 1000 ms increased by 1.9% and 4.0%, respectively, on the Human3.6M dataset, and by 2.9% and 4.0%, respectively, on the CMU-MoCap dataset. The best performance could be achieved by combining these two components. The multi-granularity model demonstrated better performance compared to the single-granularity model, especially for long prediction cycles. Additionally, the use of learnable weight parameters led to better prediction performance compared to fixed weights. This suggested that by designing a multi-granularity temporal structure, we could extract the temporal correlation between different time periods more effectively, thus improving the prediction performance.
Effects of the Number of DA-MgTCNs: Further, to validate the effect of multiple DA-MgTCNs in the model, we increased the number of DA-MgTCNs from 6 to 10 in step 2 and determined the prediction error and running time cost for both dataset predictions, as shown in
Table 5. The experimental results showed that when 6 to 10 DA-MgTCNs were used, the predicted MPJPE decreased, while the time cost continued to increase. When 12 or 14 DA-MgTCNs were used, the prediction error remained stable at a lower level, but the time cost increased. Therefore, the use of 10 MgTCNs was chosen to achieve higher prediction accuracy and operational efficiency.
In summary, the experimental results in this paper revealed the importance of the DA-MgTCN method using the dual-attention and multi-granularity convolutional design in terms of performance improvement. Modeling joint motion trajectories using dual-attention, dual-branch spatial attention, and multi-granularity temporal convolution could provide richer and smoother temporal contextual information related to joint motion, which was conducive to adequately modeling spatial global dependencies and enabling the model to encode joint information with hidden dependencies at a distance, thus improving the overall performance of the model for both short-term and long-term motion prediction.
4.6. Limitations
In addition to the qualitative results presented in
Figure 4, challenging cases encountered by the DA-MgTCN model were also investigated.
Figure 5 illustrates an example of a predicted skeleton for the "walking a dog" action. It was evident that the last few frames did not perfectly align with the ground truth pose. This misalignment resulted from the high degree of uncertainty inherent in human motion, where a series of past poses can suggest various possible future outcomes. As a result, predicting long-term dependencies between joints and frames becomes more difficult. Furthermore, the experiments were constrained by more realistic data scenarios and experimental conditions, which may have posed challenges to our algorithm’s validation. In the future, we will consider motion prediction in more intricate scenarios to investigate novel methods for multi-grain human motion prediction in multi-domain contexts. The aim is to enhance the adaptability and performance of the model.