Pyramidal Predictive Network V2: An Improved Predictive Architecture and Training Strategies for Future Perception Prediction
Abstract
:1. Introduction
- Temporal signal processing redesign: We resolve temporal inconsistencies in PPNV1 (e.g., sensory information lag and aliasing from integrated sensory-prediction computations) through redesigned signal pathways, improving interpretability and reducing computational overhead.
- Anti-aliasing architecture: We mitigate information aliasing via modulation modules (pre/post predictive units) and redesigned low-pass filtering for downsampling/upsampling, enabling smoother Fourier feature reconstruction.
- Multi-level hybrid training strategy: A novel training framework combines simultaneous multi-level loss calculation and hybrid LPIPS–Euclidean loss [29], enhancing robustness and prediction sharpness.
- Robotic generalization: We validate PPNV2 on robotic tasks (pedestrian/vehicle prediction, navigation) and align its hierarchical predictive coding with biological principles for real-world deployment.
2. Related Works
2.1. Video Prediction and Future Perception Prediction
2.2. Predictive Coding and Hierarchical Models
2.3. Pyramid Architectures and Multi-Scale Processing
2.4. Pyramidal Predictive Network (PPNV1)
3. PPNV1
4. Signal Flows for Prediction
4.1. Improving the Propagation of Sensory Information
4.2. Separating the Processing of Sensory Input and Prediction Error
5. Anti-Aliasing Design
5.1. Modulation Module
5.2. Downsampling Artifact
5.3. Upsampling Artifact
6. Improved Training Strategies
6.1. Long-Term Prediction Training Strategy
6.2. Loss Function
7. Results
7.1. Evaluation with SOTA
7.2. Ablations and Comparisons
8. Discussion and Conclusions
- Real-time embedded deployment: Adapt PPNV2 for resource-constrained robotic systems by optimizing its hierarchical architecture for real-time inference on embedded hardware (e.g., via quantization or neural architecture search).
- Closed-loop control integration: Embed PPNV2 within a reinforcement learning framework [78] to enable end-to-end training of perception–action loops, where predictions directly inform control policies.
- Neurosymbolic interpretability: Combine PPNV2’s hierarchical features with symbolic reasoning modules [79] to generate human-readable explanations of predicted trajectories (e.g., “pedestrian will turn left”).
- Biologically plausible learning: Collaborate with cognitive science to align PPNV2’s predictive coding mechanisms with empirical neural data (e.g., fMRI or EEG studies [80]), advancing neurorobotic models.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- DiCarlo, J.J.; Zoccolan, D.; Rust, N.C. How does the brain solve visual object recognition? Neuron 2012, 73, 415–434. [Google Scholar] [PubMed]
- Serre, T. Deep learning: The good, the bad, and the ugly. Annu. Rev. Vis. Sci. 2019, 5, 399–426. [Google Scholar] [PubMed]
- Felleman, D.J.; Van Essen, D.C. Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex 1991, 1, 1–47. [Google Scholar] [PubMed]
- Tanaka, K. Inferotemporal cortex and object vision. Annu. Rev. Neurosci. 1996, 19, 109–139. [Google Scholar]
- Friston, K. A theory of cortical responses. Philos. Trans. R. Soc. B Biol. Sci. 2005, 360, 815–836. [Google Scholar]
- Rao, R.P.; Ballard, D.H. Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci. 1999, 2, 79–87. [Google Scholar]
- Wu, B.; Nair, S.; Martin-Martin, R.; Fei-Fei, L.; Finn, C. Greedy hierarchical variational autoencoders for large-scale video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2318–2328. [Google Scholar]
- Wu, H.; Yao, Z.; Wang, J.; Long, M. MotionRNN: A flexible model for video prediction with spacetime-varying motions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15435–15444. [Google Scholar]
- Liu, B.; Chen, Y.; Liu, S.; Kim, H.S. Deep learning in latent space for video prediction and compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 701–710. [Google Scholar]
- Lee, S.; Kim, H.G.; Choi, D.H.; Kim, H.I.; Ro, Y.M. Video prediction recalling long-term motion context via memory alignment learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3054–3063. [Google Scholar]
- Jin, B.; Hu, Y.; Tang, Q.; Niu, J.; Shi, Z.; Han, Y.; Li, X. Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4554–4563. [Google Scholar]
- Chatterjee, M.; Ahuja, N.; Cherian, A. A hierarchical variational neural uncertainty model for stochastic video prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9751–9761. [Google Scholar]
- Franceschi, J.Y.; Delasalles, E.; Chen, M.; Lamprier, S.; Gallinari, P. Stochastic latent residual video prediction. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 13–18 July 2020; pp. 3233–3246. [Google Scholar]
- Chang, Z.; Zhang, X.; Wang, S.; Ma, S.; Gao, W. STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13946–13955. [Google Scholar]
- Morris, B.T.; Trivedi, M.M. Learning, modeling, and classification of vehicle track patterns from live video. IEEE Trans. Intell. Transp. Syst. 2008, 9, 425–437. [Google Scholar]
- Wei, J.; Dolan, J.M.; Litkouhi, B. A prediction-and cost function-based algorithm for robust autonomous freeway driving. In Proceedings of the 2010 IEEE Intelligent Vehicles Symposium, La Jolla, CA, USA, 21–24 June 2010; pp. 512–517. [Google Scholar]
- Finn, C.; Levine, S. Deep visual foresight for planning robot motion. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 2786–2793. [Google Scholar]
- Gao, X.; Jin, Y.; Zhao, Z.; Dou, Q.; Heng, P.-A. Future Frame Prediction for Robot-Assisted Surgery. In Information Processing in Medical Imaging; Feragen, A., Sommer, S., Schnabel, J., Nielsen, M., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 533–544. [Google Scholar]
- Shi, X.; Gao, Z.; Lausen, L.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Deep learning for precipitation nowcasting: A benchmark and a new model. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
- Han, T.; Xie, W.; Zisserman, A. Video representation learning by dense predictive coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
- Wang, J.; Jiao, J.; Liu, Y.H. Self-supervised video representation learning by pace prediction. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 504–521. [Google Scholar]
- Ling, C.; Zhong, J.; Li, W. Pyramidal Predictive Network: A Model for Visual-Frame Prediction Based on Predictive Coding Theory. Electronics 2022, 11, 2969. [Google Scholar] [CrossRef]
- Ling, C.; Li, W.; Zeng, J.; Zhong, J. Combined Deterministic and Stochastic Streams for Visual Prediction Using Predictive Coding. In Proceedings of the 2023 IEEE International Conference on Development and Learning (ICDL), Macau, China, 9–11 November 2023; pp. 467–472. [Google Scholar]
- Softky, W.R. Unsupervised pixel-prediction. Adv. Neural Inf. Process. Syst. 1996, 8, 809–815. [Google Scholar]
- Deco, G.; Schürmann, B. Predictive coding in the visual cortex by a recurrent network with gabor receptive fields. Neural Process. Lett. 2001, 14, 107–114. [Google Scholar] [CrossRef]
- Hollingworth, A. Constructing visual representations of natural scenes: The roles of short-and long-term visual memory. J. Exp. Psychol. Hum. Percept. Perform. 2004, 30, 519. [Google Scholar] [CrossRef]
- Lotter, W.; Kreiman, G.; Cox, D. Deep predictive coding networks for video prediction and unsupervised learning. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
- Hosseini, M.; Maida, A.S.; Hosseini, M.; Raju, G. Inception-inspired lstm for next-frame video prediction. arXiv 2019, arXiv:1909.05622. [Google Scholar]
- Srivastava, N.; Mansimov, E.; Salakhudinov, R. Unsupervised learning of video representations using lstms. In Proceedings of the International Conference on Machine Learning. PMLR, Lille, France, 7–9 July 2015; pp. 843–852. [Google Scholar]
- Finn, C.; Goodfellow, I.; Levine, S. Unsupervised learning for physical interaction through video prediction. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
- Mathieu, M.; Couprie, C.; LeCun, Y. Deep multi-scale video prediction beyond mean square error. arXiv 2015, arXiv:1511.05440. [Google Scholar]
- Vondrick, C.; Pirsiavash, H.; Torralba, A. Generating videos with scene dynamics. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
- Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social LSTM: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 961–971. [Google Scholar]
- Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; Alahi, A. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2255–2264. [Google Scholar]
- Friston, K. The free-energy principle: A unified brain theory? Nat. Rev. Neurosci. 2010, 11, 127–138. [Google Scholar] [CrossRef]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Burt, P.J.; Adelson, E.H. The Laplacian pyramid as a compact image code. In Readings in Computer Vision; Elsevier: Amsterdam, The Netherlands, 1987; pp. 671–679. [Google Scholar]
- Ke, T.W.; Maire, M.; Yu, S.X. Multigrid neural architectures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6665–6673. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Ling, C.; Zhong, J.; Li, W. Predictive Coding Based Multiscale Network with Encoder-Decoder LSTM for Video Prediction. arXiv 2022, arXiv:2212.11642. [Google Scholar]
- Friston, K. Learning and inference in the brain. Neural Netw. 2003, 16, 1325–1352. [Google Scholar] [CrossRef]
- Friston, K. Hierarchical models in the brain. PLoS Comput. Biol. 2008, 4, e1000211. [Google Scholar]
- Hohwy, J.; Roepstorff, A.; Friston, K. Predictive coding explains binocular rivalry: An epistemological review. Cognition 2008, 108, 687–701. [Google Scholar] [CrossRef] [PubMed]
- Xu, Z.Q.J.; Zhang, Y.; Xiao, Y. Training behavior of deep neural network in frequency domain. In Proceedings of the International Conference on Neural Information Processing, Vancouver, BC, Canada, 8–14 December 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 264–274. [Google Scholar]
- Xu, Z.Q.J.; Zhang, Y.; Luo, T.; Xiao, Y.; Ma, Z. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv 2019, arXiv:1901.06523. [Google Scholar]
- Clark, A. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behav. Brain Sci. 2013, 36, 181–204. [Google Scholar]
- Aitchison, L.; Lengyel, M. With or without you: Predictive coding and Bayesian inference in the brain. Curr. Opin. Neurobiol. 2017, 46, 219–227. [Google Scholar]
- Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F.A.; Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv 2018, arXiv:1811.12231. [Google Scholar]
- Oprea, S.; Martinez-Gonzalez, P.; Garcia-Garcia, A.; Castro-Vargas, J.A.; Orts-Escolano, S.; Garcia-Rodriguez, J.; Argyros, A. A review on deep learning techniques for video prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2806–2826. [Google Scholar]
- Azulay, A.; Weiss, Y. Why do deep convolutional networks generalize so poorly to small image transformations? arXiv 2018, arXiv:1805.12177. [Google Scholar]
- Zou, X.; Xiao, F.; Yu, Z.; Li, Y.; Lee, Y.J. Delving deeper into anti-aliasing in convnets. Int. J. Comput. Vis. 2022, 131, 67–81. [Google Scholar]
- Vasconcelos, C.; Larochelle, H.; Dumoulin, V.; Romijnders, R.; Le Roux, N.; Goroshin, R. Impact of aliasing on generalization in deep convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10529–10538. [Google Scholar]
- Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. Adv. Neural Inf. Process. Syst. 2021, 34, 852–863. [Google Scholar]
- Shannon, C.E. Communication in the presence of noise. Proc. IRE 1949, 37, 10–21. [Google Scholar] [CrossRef]
- Madisetti, V. The Digital Signal Processing Handbook; CRC Press: Boca Raton, FL, USA, 1997. [Google Scholar]
- Schuldt, C.; Laptev, I.; Caputo, B. Recognizing human actions: A local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, 23–26 August 2004; Volume 3, pp. 32–36. [Google Scholar]
- Straka, Z.; Svoboda, T.; Hoffmann, M. PreCNet: Next Frame Video Prediction Based on Predictive Coding. arXiv 2020, arXiv:2004.14878. [Google Scholar]
- Lin, X.; Zou, Q.; Xu, X.; Huang, Y.; Tian, Y. Motion-aware feature enhancement network for video prediction. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 688–700. [Google Scholar]
- Lee, A.X.; Zhang, R.; Ebert, F.; Abbeel, P.; Finn, C.; Levine, S. Stochastic adversarial video prediction. arXiv 2018, arXiv:1804.01523. [Google Scholar]
- Su, J.; Byeon, W.; Kossaifi, J.; Huang, F.; Kautz, J.; Anandkumar, A. Convolutional tensor-train lstm for spatio-temporal learning. Adv. Neural Inf. Process. Syst. 2020, 33, 13714–13726. [Google Scholar]
- Villegas, R.; Yang, J.; Hong, S.; Lin, X.; Lee, H. Decomposing motion and content for natural video sequence prediction. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Oliu, M.; Selva, J.; Escalera, S. Folded recurrent neural networks for future video prediction. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 716–731. [Google Scholar]
- Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Wang, Y.; Gao, Z.; Long, M.; Wang, J.; Philip, S.Y. Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In Proceedings of the International Conference on Machine Learning. PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 5123–5132. [Google Scholar]
- Jin, B.; Hu, Y.; Zeng, Y.; Tang, Q.; Liu, S.; Ye, J. Varnet: Exploring variations for unsupervised video prediction. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 5801–5806. [Google Scholar]
- Wang, Y.; Jiang, L.; Yang, M.H.; Li, L.J.; Long, M.; Fei-Fei, L. Eidetic 3d lstm: A model for video prediction and beyond. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Liu, Z.; Yeh, R.A.; Tang, X.; Liu, Y.; Agarwala, A. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4463–4471. [Google Scholar]
- Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Liu, G.; Tao, A.; Kautz, J.; Catanzaro, B. Video-to-Video Synthesis. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Wu, Y.; Gao, R.; Park, J.; Chen, Q. Future video synthesis with object motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5539–5548. [Google Scholar]
- Shapiro, L. The embodied cognition research programme. Philos. Compass 2007, 2, 338–346. [Google Scholar]
- Smith, L.; Gasser, M. The development of embodied cognition: Six lessons from babies. Artif. Life 2005, 11, 13–29. [Google Scholar]
- Hoffmann, M.; Marques, H.; Arieta, A.; Sumioka, H.; Lungarella, M.; Pfeifer, R. Body schema in robotics: A review. IEEE Trans. Auton. Ment. Dev. 2010, 2, 304–324. [Google Scholar]
- Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part I 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 818–833. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Bhuyan, B.P.; Ramdane-Cherif, A.; Tomar, R.; Singh, T. Neuro-symbolic artificial intelligence: A survey. Neural Comput. Appl. 2024, 36, 12809–12844. [Google Scholar] [CrossRef]
- Kietzmann, T.C.; McClure, P.; Kriegeskorte, N. Deep neural networks in computational neuroscience. BioRxiv 2017, 133504. [Google Scholar]
Approach | Key Contribution | Shortcomings | PPNV2 Differentiation |
---|---|---|---|
RNNs/LSTMs [31,32] | Temporal modeling with recurrent architectures |
|
|
Adversarial Methods [33,34] | Sharp frame generation via GANs |
|
|
Social Models [35,36] | Socially aware trajectory prediction |
|
|
Predictive Coding [5,6] | Neuroscience-inspired error correction |
|
|
Pyramid Architectures [41,42] | Multi-scale feature extraction |
|
|
PPNV1 [43] | Temporal pyramid predictive coding |
|
|
Method | 10 → 20 | 10 → 40 | ||||
---|---|---|---|---|---|---|
SSIM↑ | PSNR↑ | LPIPS↓ | SSIM↑ | PSNR↑ | LPIPS↓ | |
MCNet [64] | 0.804 | 25.95 | - | 0.73 | 23.89 | - |
fRNN [65] | 0.771 | 26.12 | - | 0.678 | 23.77 | - |
PredRNN [66] | 0.839 | 27.55 | - | 0.703 | 24.16 | - |
PredRNN++ [67] | 0.865 | 28.47 | - | 0.741 | 25.21 | - |
VarNet [68] | 0.843 | 28.48 | - | 0.739 | 25.37 | - |
SAVP-VAE [62] | 0.852 | 27.77 | 8.36 | 0.811 | 26.18 | 11.33 |
E3D-LSTM [69] | 0.879 | 29.31 | - | 0.810 | 27.24 | - |
STMF [11] | 0.893 | 29.85 | 11.81 | 0.851 | 27.56 | 14.13 |
Conv-TT-LSTM [63] | 0.907 | 28.36 | 13.34 | 0.882 | 26.11 | 19.12 |
LMC-Memory [10] | 0.894 | 28.61 | 13.33 | 0.879 | 27.50 | 15.98 |
PPNet [23] | 0.886 | 31.02 | 13.12 | 0.821 | 28.37 | 23.19 |
MSPN [43] | 0.881 | 31.87 | 7.98 | 0.831 | 28.86 | 14.04 |
PPNV2 (Ours) | 0.893 | 32.05 | 4.76 | 0.833 | 28.97 | 8.93 |
Method | Metric | T = 2 | T = 4 | T = 6 | T = 8 | T = 10 |
---|---|---|---|---|---|---|
fRNN [65] | PSNR | 27.58 | 26.10 | 25.06 | 24.26 | 23.66 |
SSIM | 0.9000 | 0.8885 | 0.8799 | 0.8729 | 0.8675 | |
LPIPS | 0.0515 | 0.0530 | 0.0540 | 0.0539 | 0.0542 | |
MAFENet [61] | PSNR | 31.36 | 28.38 | 26.61 | 25.47 | 24.61 |
SSIM | 0.9663 | 0.9528 | 0.9414 | 0.9326 | 0.9235 | |
LPIPS | 0.0151 | 0.0219 | 0.0287 | 0.0339 | 0.0419 | |
MSPN [43] | PSNR | 31.95 | 29.19 | 27.46 | 26.44 | 25.52 |
SSIM | 0.9687 | 0.9577 | 0.9478 | 0.9382 | 0.9293 | |
LPIPS | 0.0146 | 0.0271 | 0.0384 | 0.0480 | 0.0571 | |
PPNV2 (Ours) | PSNR | 32.07 | 30.08 | 28.81 | 28.12 | 27.55 |
SSIM | 0.9645 | 0.9566 | 0.9510 | 0.9461 | 0.9421 | |
LPIPS | 0.0169 | 0.0239 | 0.0288 | 0.0337 | 0.0381 |
Method | Caltech () | KITTI () | ||||
---|---|---|---|---|---|---|
SSIM | PSNR | LPIPS | SSIM | PSNR | LPIPS | |
MCNet [64] | 0.705 | - | 37.34 | 0.555 | - | 37.39 |
PredNet [28] | 0.753 | - | 36.03 | 0.475 | - | 62.95 |
Voxel Flow [70] | 0.711 | - | 28.79 | 0.426 | - | 41.59 |
Vid2vid [71] | 0.751 | - | 20.14 | - | - | - |
FVSOMP [72] | 0.756 | - | 16.50 | 0.608 | - | 30.49 |
PPNet [23] | 0.812 | 21.3 | 14.83 | 0.617 | 18.24 | 31.07 |
MSPN [43] | 0.818 | 23.88 | 10.98 | 0.629 | 19.44 | 32.10 |
PPNV2 (Ours) | 0.865 | 25.44 | 5.287 | 0.621 | 19.32 | 15.45 |
Ablation | ||||||
---|---|---|---|---|---|---|
SSIM↑ | PSNR↑ | LPIPS↓ | SSIM↑ | PSNR↑ | LPIPS↓ | |
Default | 0.893 | 32.05 | 4.76 | 0.833 | 28.97 | 8.93 |
Add | 0.886 | 31.87 | 5.08 | 0.817 | 28.46 | 9.75 |
Concat | 0.888 | 31.92 | 5.16 | 0.820 | 28.71 | 9.56 |
NoFilter | 0.882 | 31.70 | 5.21 | 0.808 | 28.39 | 9.71 |
Bilinear | 0.887 | 31.85 | 5.10 | 0.816 | 28.38 | 9.37 |
NoLPIPS | 0.889 | 31.99 | 11.45 | 0.824 | 28.79 | 20.41 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ling, C.; Zhong, J.; Li, W.; Dong, R.; Dai, M. Pyramidal Predictive Network V2: An Improved Predictive Architecture and Training Strategies for Future Perception Prediction. Big Data Cogn. Comput. 2025, 9, 79. https://doi.org/10.3390/bdcc9040079
Ling C, Zhong J, Li W, Dong R, Dai M. Pyramidal Predictive Network V2: An Improved Predictive Architecture and Training Strategies for Future Perception Prediction. Big Data and Cognitive Computing. 2025; 9(4):79. https://doi.org/10.3390/bdcc9040079
Chicago/Turabian StyleLing, Chaofan, Junpei Zhong, Weihua Li, Ran Dong, and Mingjun Dai. 2025. "Pyramidal Predictive Network V2: An Improved Predictive Architecture and Training Strategies for Future Perception Prediction" Big Data and Cognitive Computing 9, no. 4: 79. https://doi.org/10.3390/bdcc9040079
APA StyleLing, C., Zhong, J., Li, W., Dong, R., & Dai, M. (2025). Pyramidal Predictive Network V2: An Improved Predictive Architecture and Training Strategies for Future Perception Prediction. Big Data and Cognitive Computing, 9(4), 79. https://doi.org/10.3390/bdcc9040079