Efficient Human Motion Prediction in 3D Using Parallel Enhanced Attention with Sparse Spatiotemporal Modeling
Abstract
:1. Introduction
- We propose a novel PEA module with a dual-branch channel and temporal architecture. It dynamically perceives joint importance by integrating soft pooling and depthwise separable convolution, enhancing key-frame localization while preserving sensitivity to multi-scale spatiotemporal features.
- We designed an STSSA module that combines single-head compression architecture with adaptive sparsification. This reduces model parameters and triples the inference speed with only minimal accuracy loss.
- We propose an efficient, lightweight, and end-to-end diffusion model for HMC, achieving competitive results on the Human3.6M, HumanEva-I, and AMASS datasets, providing a practical solution for real-world applications.
2. Related Works
2.1. Diffusion Models
2.2. Diffusion-Based Human Motion Prediction
2.3. Attention Mechanisms in Human Motion Analysis
3. Methodology
3.1. Problem Definition and Notations
3.2. Transformer-Based Diffusion Model
3.3. Model Structure
3.3.1. PEA
3.3.2. STSSA
3.3.3. Historical Information Guiding Mechanisms
Algorithm 1: Training Procedure |
Algorithm 2: Inference Procedure |
4. Results and Discussion
4.1. Datasets and Evaluation Metrics
4.2. Baselines
4.3. Mathematical Model Validation
4.4. Comparison with State-of-the-Art Models and Discussion
4.5. Ablation Experiment
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zheng, W.; Song, R.; Guo, X.; Zhang, C.; Chen, L. Genad: Generative end-to-end autonomous driving. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2025; pp. 87–104. [Google Scholar]
- Bajcsy, A.; Siththaranjan, A.; Tomlin, C.J.; Dragan, A.D. Analyzing Human Models That Adapt Online. In Proceedings of the IEEE International Conference on Robotics and Automation, Xi’an, China, 30 May–5 June 2021; pp. 2754–2760. [Google Scholar]
- Luber, M.; Stork, J.A.; Tipaldi, G.D.; Arras, K.O. People Tracking with Human Motion Predictions from Social Forces. In Proceedings of the IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; pp. 464–469. [Google Scholar]
- Adeli, V.; Ehsanpour, M.; Reid, I.; Niebles, J.C.; Savarese, S.; Adeli, E.; Rezatofighi, H. Tripod: Human trajectory and pose dynamics forecasting in the wild. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 13390–13400. [Google Scholar]
- Dang, L.; Nie, Y.; Long, C.; Zhang, Q.; Li, G. Diverse Human Motion Prediction via Gumbel-Softmax Sampling from an Auxiliary Space. In Proceedings of the ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 5162–5171. [Google Scholar]
- Bouazizi, A.; Holzbock, A.; Kressel, U.; Dietmayer, K.; Belagiannis, V. Motionmixer: Mlp-based 3D human body pose forecasting. arXiv 2022, arXiv:2207.00499. [Google Scholar]
- Ren, Z.; Jin, M.; Nie, H.; Shen, J.; Dong, A.; Zhang, Q. Towards Realistic Human Motion Prediction with Latent Diffusion and Physics-Based Models. Electronics 2025, 14, 605. [Google Scholar] [CrossRef]
- Kim, Y.; Yoo, H.; Ryu, J.-H.; Lee, S.; Lee, J.H.; Kim, J. TransSMPL: Efficient Human Pose Estimation with Pruned and Quantized Transformer Networks. Electronics 2024, 13, 4980. [Google Scholar] [CrossRef]
- Barquero, G.; Escalera, S.; Palmero, C. BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2317–2327. [Google Scholar]
- Mao, W.; Liu, M.; Salzmann, M. Generating Smooth Pose Sequences for Diverse Human Motion Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 13309–13318. [Google Scholar]
- Gurumurthy, S.; Sarvadevabhatla, K.R.; Babu, V.R. DeLiGAN: Generative Adversarial Networks for Diverse and Limited Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 166–174. [Google Scholar]
- Yuan, Y.; Kitani, K. DLow: Diversifying Latent Flows for Diverse Human Motion Prediction. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 346–364. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 6840–6851. [Google Scholar]
- Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv. Neural Inf. Process. Syst. 2022, 35, 5775–5787. [Google Scholar]
- Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models. arXiv 2022, arXiv:2211.01095. [Google Scholar]
- Ju, X.; Zeng, A.; Zhao, C.; Wang, J.; Zhang, L.; Xu, Q. Humansd: A native skeleton-guided diffusion model for human image generation. In Proceedings of the International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 15988–15998. [Google Scholar]
- Chen, L.H.; Zhang, J.; Li, Y.; Pang, Y.; Xia, X.; Liu, T. Humanmac: Masked motion completion for human motion prediction. In Proceedings of the International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 9544–9555. [Google Scholar]
- Blattmann, A.; Milbich, T.; Dorkenwald, M.; Ommer, B. Behavior-Driven Synthesis of Human Dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12231–12241. [Google Scholar] [CrossRef]
- Xu, S.; Wang, Y.-X.; Gui, L.-Y. Diverse Human Motion Prediction Guided by Multi-Level Spatial-Temporal Anchors. In Proceedings of the Lecture Notes in Computer Science, Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 251–269. [Google Scholar]
- Wei, D.; Sun, H.; Li, B.; Lu, J.; Li, W.; Sun, X.; Hu, S. Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2023; Volume 37, pp. 6110–6118. [Google Scholar]
- Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3D Human Pose Estimation with Spatial and Temporal Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 11636–11645. [Google Scholar] [CrossRef]
- Saadatnejad, S.; Rasekh, A.; Mofayezi, M.; Medghalchi, Y.; Rajabzadeh, S.; Mordan, T.; Alahi, A. A generic diffusion-based approach for 3D human pose prediction in the wild. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 8246–8253. [Google Scholar] [CrossRef]
- Tian, S.; Zheng, M.; Liang, X. TransFusion: A Practical and Effective Transformer-Based Diffusion Model for 3D Human Motion Prediction. IEEE Robot. Autom. Lett. 2024, 9, 6232–6239. [Google Scholar] [CrossRef]
- Liu, J.; Wang, H.; Zhou, W.; Stawarz, K.; Corcoran, P.; Chen, Y.; Liu, H. Adaptive Spatiotemporal Graph Transformer Network for Action Quality Assessment. IEEE Trans. Circuits Syst. Video Technol. 2025. Early Access. [Google Scholar] [CrossRef]
- Wei, X.; Wang, Z. TCN-attention-HAR: Human activity recognition based on attention mechanism time convolutional network. Sci. Rep. 2024, 14, 7414. [Google Scholar] [CrossRef] [PubMed]
- Mao, W.; Liu, M.; Salzmann, M.; Li, H. Learning Trajectory Dependencies for Human Motion Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9489–9497. [Google Scholar]
- Salzmann, T.; Pavone, M.; Ryll, M. Motron: Multimodal probabilistic human motion forecasting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 6457–6466. [Google Scholar]
- Wang, Y.; Li, Y.; Wang, G.; Liu, X. PlainUSR: Chasing Faster ConvNet for Efficient Super-Resolution. arXiv 2024, arXiv:2409.13435. [Google Scholar] [CrossRef]
- Yu, W.; Zhou, P.; Yan, S.; Wang, X. InceptionNeXt: When Inception Meets ConvNeXt. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 5672–5683. [Google Scholar] [CrossRef]
- Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
- Sigal, L.; Balan, A.O.; Black, M.J. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 2010, 87, 4–27. [Google Scholar] [CrossRef]
- Mahmood, N.; Ghorbani, N.; Troje, N.F.; Pons-Moll, G.; Black, M.J. AMASS: Archive of motion capture as surface shapes. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5442–5451. [Google Scholar]
- Bhattacharyya, A.; Schiele, B.; Fritz, M. Accurate and Diverse Sampling of Sequences Based on a “Best of Many” Sample Objective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8485–8493. [Google Scholar]
- Yuan, Y.; Kitani, K. Diverse Trajectory Forecasting with Determinantal Point Processes. arXiv 2019, arXiv:1907.04967. [Google Scholar] [CrossRef]
- Zhang, Y.; Black, M.J.; Tang, S. We Are More Than Our Joints: Predicting How 3D Bodies Move. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3372–3382. [Google Scholar]
Human3.6M | HumanEva-I | AMASS | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | APD↑ | ADE↓ | FDE↓ | MMADE↓ | MMFDE↓ | APD↑ | ADE↓ | FDE↓ | MMADE↓ | MMFDE↓ | APD↑ | ADE↓ | FDE↓ | MMADE↓ | MMFDE↓ |
DeLiGAN | 6.509 | 0.483 | 0.534 | 0.520 | 0.545 | 2.177 | 0.306 | 0.322 | 0.385 | 0.371 | - | - | - | - | - |
Dlow | 11.741 | 0.425 | 0.518 | 0.495 | 0.531 | 4.855 | 0.251 | 0.268 | 0.362 | 0.339 | 13.170 | 0.590 | 0.612 | 0.618 | 0.617 |
DivSamp | 15.310 | 0.370 | 0.485 | 0.475 | 0.516 | 6.109 | 0.220 | 0.234 | 0.342 | 0.316 | 24.724 | 0.564 | 0.647 | 0.623 | 0.667 |
BoM | 6.265 | 0.448 | 0.533 | 0.514 | 0.544 | 2.846 | 0.271 | 0.279 | 0.373 | 0.351 | - | - | - | - | - |
DSF | 9.330 | 0.493 | 0.592 | 0.550 | 0.599 | 9.330 | 0.493 | 0.592 | 0.550 | 0.590 | - | - | - | - | - |
MOJO | 12.579 | 0.412 | 0.514 | 0.497 | 0.538 | 4.181 | 0.234 | 0.244 | 0.369 | 0.347 | - | - | - | - | - |
GSPS | 14.757 | 0.389 | 0.496 | 0.476 | 0.525 | 5.825 | 0.233 | 0.244 | 0.343 | 0.331 | 12.465 | 0.563 | 0.613 | 0.609 | 0.633 |
Motron | 7.168 | 0.375 | 0.488 | - | - | 7.168 | 0.375 | 0.488 | - | - | - | - | - | - | - |
MotionDiff | 15.353 | 0.411 | 0.509 | 0.508 | 0.536 | 5.931 | 0.232 | 0.236 | 0.352 | 0.320 | - | - | - | - | - |
TranFusion | 5.975 | 0.358 | 0.468 | 0.506 | 0.539 | 1.031 | 0.204 | 0.234 | 0.408 | 0.427 | 8.853 | 0.508 | 0.568 | 0.589 | 0.606 |
HumanMAC | 6.301 | 0.369 | 0.480 | 0.509 | 0.545 | 6.554 | 0.209 | 0.223 | 0.342 | 0.335 | 9.321 | 0.511 | 0.554 | 0.593 | 0.591 |
PEA-STDiff | 6.185 | 0.365 | 0.470 | 0.502 | 0.535 | 1.632 | 0.208 | 0.229 | 0.404 | 0.413 | 9.211 | 0.521 | 0.552 | 0.580 | 0.589 |
Model | Params | Avg. Inf. Time (Second) | APD | ADE | FDE | MMADE | MMFDE |
---|---|---|---|---|---|---|---|
HumanMAC | 28.40 M | 1.266 | 6.301 | 0.369 | 0.480 | 0.509 | 0.545 |
TransFusion | 19.73 M | 1.110 | 5.975 | 0.358 | 0.468 | 0.506 | 0.539 |
MotionDiff | 45.30 M | 1.524 | 15.353 | 0.411 | 0.509 | 0.508 | 0.536 |
PEA-STDiff | 10.71 M | 0.4104 | 6.185 | 0.365 | 0.470 | 0.502 | 0.535 |
Human3.6M | AMASS | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Method | APD | ADE | FDE | MMADE | MMFDE | APD | ADE | FDE | MMADE | MMFDE |
Baseline | 6.301 | 0.369 | 0.480 | 0.509 | 0.545 | 9.321 | 0.511 | 0.554 | 0.593 | >0.591 |
Baseline + PED | 6.197 | 0.362 | 0.475 | 0.498 | 0.540 | 9.123 | 0.505 | 0.553 | 0.586 | 0.591 |
Baseline + STSSA | 6.217 | 0.371 | 0.472 | 0.512 | 0.537 | 9.478 | 0.531 | 0.551 | 0.584 | 0.590 |
Baseline + SSA | 5.898 | 0.385 | 0.494 | 0.523 | 0.561 | 9.018 | 0.557 | 0.576 | 0.625 | 0.612 |
Baseline + SSA + PED | 6.058 | 0.372 | 0.481 | 0.511 | 0.547 | 9.126 | 0.533 | 0.562 | 0.591 | 0.603 |
PEA-STDiff | 6.185 | 0.365 | 0.470 | 0.502 | 0.535 | 9.211 | 0.521 | 0.552 | 0.580 | 0.589 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jiang, M.; Hu, L. Efficient Human Motion Prediction in 3D Using Parallel Enhanced Attention with Sparse Spatiotemporal Modeling. Electronics 2025, 14, 1773. https://doi.org/10.3390/electronics14091773
Jiang M, Hu L. Efficient Human Motion Prediction in 3D Using Parallel Enhanced Attention with Sparse Spatiotemporal Modeling. Electronics. 2025; 14(9):1773. https://doi.org/10.3390/electronics14091773
Chicago/Turabian StyleJiang, Mengxin, and Likun Hu. 2025. "Efficient Human Motion Prediction in 3D Using Parallel Enhanced Attention with Sparse Spatiotemporal Modeling" Electronics 14, no. 9: 1773. https://doi.org/10.3390/electronics14091773
APA StyleJiang, M., & Hu, L. (2025). Efficient Human Motion Prediction in 3D Using Parallel Enhanced Attention with Sparse Spatiotemporal Modeling. Electronics, 14(9), 1773. https://doi.org/10.3390/electronics14091773