Pyramidal Predictive Network V2: An Improved Predictive Architecture and Training Strategies for Future Perception Prediction

Ling, Chaofan; Zhong, Junpei; Li, Weihua; Dong, Ran; Dai, Mingjun

doi:10.3390/bdcc9040079

Open AccessArticle

Pyramidal Predictive Network V2: An Improved Predictive Architecture and Training Strategies for Future Perception Prediction

by

Chaofan Ling

¹,

Junpei Zhong

^2,*

,

Weihua Li

¹

,

Ran Dong

³

and

Mingjun Dai

⁴

¹

School of Shien-ming Wu Intelligent Engineering, South China University of Technology, Guangzhou 510641, China

²

Faculty of Science and Technology, University of Wollongong (College Hong Kong), Hong Kong, China

³

School of Engineering, Chukyo University, Nagoya 466-0825, Japan

⁴

College of Information Engineering, Shenzhen University, Shenzhen 518060, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(4), 79; https://doi.org/10.3390/bdcc9040079

Submission received: 13 January 2025 / Revised: 13 March 2025 / Accepted: 14 March 2025 / Published: 28 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we propose an improved version of the Pyramidal Predictive Network (PPNV2), a theoretical framework inspired by predictive coding, which addresses the limitations of its predecessor (PPNV1) in the task of future perception prediction. While PPNV1 employed a temporal pyramid architecture and demonstrated promising results, its innate signal processing led to aliasing in the prediction, restricting its application in robotic navigation. We analyze the signal dissemination and characteristic artifacts of PPNV1 and introduce architectural enhancements and training strategies to mitigate these issues. The improved architecture focuses on optimizing information dissemination and reducing aliasing in neural networks. We redesign the downsampling and upsampling components to enable the network to construct images more effectively from low-frequency-input Fourier features, replacing the simple concatenation of different inputs in the previous version. Furthermore, we refine the training strategies to alleviate input inconsistency during training and testing phases. The enhanced model exhibits increased interpretability, stronger prediction accuracy, and improved quality of predictions. The proposed PPNV2 offers a more robust and efficient approach to future video-frame prediction, overcoming the limitations of its predecessor and expanding its potential applications in various robotic domains, including pedestrian prediction, vehicle prediction, and navigation.

Keywords:

visual prediction; predictive coding; perception prediction; pyramid architecture

1. Introduction

Visual perception, the ability to interpret and understand visual information from the environment, is a crucial cognitive process in humans and animals. It involves the complex interplay of various brain regions and mechanisms to transform raw sensory input into meaningful representations [1]. The study of visual perception is the most well-studied subject in neuroscience and psychology. In artificial intelligence (AI), it also inspired a lot of work about computer vision [2], as it holds the key to understanding how organisms interact with their surroundings and make decisions based on visual cues.

Several neuroscience theories have been proposed to explain the underlying mechanisms of visual perception. The hierarchical model of visual processing suggests that visual information is processed in a series of stages, starting from the primary visual cortex (V1) and progressing to higher-level cortical areas [3]. Each stage extracts increasingly complex features, from simple edges and contours in V1 to object recognition and scene understanding in the inferotemporal cortex (IT) [4]. Another influential theory is the predictive coding framework, which posits that the brain constantly generates predictions about incoming sensory input and updates these predictions based on prediction errors [5,6]. This iterative process of prediction and error correction is thought to enable efficient and robust perception by reducing redundancy and focusing on informative features.

Future visual perception prediction has experienced rapid advancements in AI and robotics in recent years [7,8,9,10,11,12,13,14]. This looking-ahead ability has found applications in various fields, including autonomous driving [15,16], robotic systems [17,18], rainfall forecasting [19,20], and more. Visual perception prediction encompasses a wide range of tasks, such as predicting future video frames, estimating object trajectories, and anticipating scene changes. Technically, visual perception prediction is a pixel-level task that utilizes historical visual information to predict future perceptions. This self-supervised learning method for visual representation can also be transferred to downstream tasks [21,22]. By first performing self-supervised visual perception prediction learning on the backbone network and then fixing it for use in tasks such as video classification, the difficulties of label acquisition in supervised learning tasks can be addressed.

One of the state-of-the-art methods in this domain is the Pyramid Prediction Network (PPNV1) proposed by Ling et al. [23,24], which has achieved promising results with its unique temporal pyramid architecture. The PPNV1 model updates through a combination of bottom-up and top-down information flows, using prediction errors for effective feedback connections, distinguishing it from traditional generators [25,26,27,28]. The update frequency of neurons in this model decreases as the network level increases, allowing higher-layer neurons to capture information over a longer time range while reducing computational costs.

However, since it is a cognitive model which focused on its counterpart mechanism in neuroscience, it overlooked the importance of computer signal processing. In this work, we build upon the PPNV1 architecture and propose the Pyramidal Predictive Network V2 (PPNV2).

The scientific contributions of PPNV2 are as follows:

Temporal signal processing redesign: We resolve temporal inconsistencies in PPNV1 (e.g., sensory information lag and aliasing from integrated sensory-prediction computations) through redesigned signal pathways, improving interpretability and reducing computational overhead.
Anti-aliasing architecture: We mitigate information aliasing via modulation modules (pre/post predictive units) and redesigned low-pass filtering for downsampling/upsampling, enabling smoother Fourier feature reconstruction.
Multi-level hybrid training strategy: A novel training framework combines simultaneous multi-level loss calculation and hybrid LPIPS–Euclidean loss [29], enhancing robustness and prediction sharpness.
Robotic generalization: We validate PPNV2 on robotic tasks (pedestrian/vehicle prediction, navigation) and align its hierarchical predictive coding with biological principles for real-world deployment.

The rest of this paper is organized as follows. Section 2 provides an overview of the PPNV1 network and other related networks, as well as learning mechanisms that facilitate the development of PPNV2. Section 3 presents a detailed analysis of PPNV1, identifying its strengths and limitations. Section 4, Section 5 and Section 6 discuss the technical aspects of PPNV1, focusing on signal processing, the aliasing problem, and training strategies, respectively. Each of these sections also proposes solutions to address the identified issues and enhance the model’s performance. Section 7 presents the experimental results and evaluates the effectiveness of the proposed improvements. Finally, Section 8 concludes the paper and discusses future research directions.

2. Related Works

2.1. Video Prediction and Future Perception Prediction

Video prediction and future perception prediction have been active research areas in recent years, with numerous approaches proposed to tackle these challenging tasks. Recurrent neural networks (RNNs) and their variants, such as long short-term memory (LSTM) [30] and gated recurrent units (GRUs) [8], have been widely used for modeling temporal dependencies in video data. Srivastava et al. [31] employed LSTMs to learn video representations and predict future frames. Finn et al. [32] proposed a convolutional LSTM-based model for video prediction, which learned to predict future frames by iteratively transforming the input frames. More recently, adversarial training techniques have been applied to video prediction tasks to improve the quality and sharpness of the generated frames. Mathieu et al. [33] introduced a multi-scale adversarial network for video prediction, which utilized a generator and a discriminator to produce realistic future frames. Vondrick et al. [34] proposed a two-stream adversarial network, where one stream generated future frames, and the other stream aimed to distinguish between real and generated frames. Apart from pixel-level video prediction, higher-level future perception prediction tasks, such as predicting object trajectories and scene dynamics, have also gained attention. Alahi et al. [35] proposed the Social LSTM model for predicting pedestrian trajectories in crowded scenes. Gupta et al. [36] introduced the Social GAN, which employed a generative adversarial network to predict socially acceptable paths for multiple agents in a scene.

2.2. Predictive Coding and Hierarchical Models

Predictive coding is a neuroscience-inspired framework that suggests the brain continually generates predictions about incoming sensory input and updates these predictions based on prediction errors [5,6]. This concept has been applied to various domains, including visual perception and neural network architectures. Rao and Ballard [6] proposed a predictive coding model for explaining extra-classical receptive field effects in the visual cortex. Friston [37] extended the predictive coding framework to a more general theory of cortical responses, known as the free energy principle. Hierarchical models have also been influential in understanding visual perception and designing neural network architectures. Felleman and Van Essen [3] proposed a hierarchical model of visual processing in the primate cerebral cortex, which suggested that visual information is processed in a series of stages, with each stage extracting increasingly complex features. This hierarchical organization has inspired the development of deep convolutional neural networks (CNNs) [38], which have achieved remarkable success in various computer vision tasks.

2.3. Pyramid Architectures and Multi-Scale Processing

Pyramid architectures and multi-scale processing have been widely used in computer vision and deep learning to capture information at different scales and resolutions. The Laplacian pyramid [39] is one of the classic examples of multi-scale image representations, which have been used for tasks such as image compression and analysis. In the context of deep learning, pyramid architectures have been employed to improve the performance and efficiency of neural networks. Ke et al. [40] proposed the Multi-Scale Convolutional Neural Network (MSCNN), which utilized a pyramid structure to extract features at different scales for object detection. The Feature Pyramid Network (FPN) [41] introduced a top-down architecture with lateral connections to build high-level semantic feature maps at multiple scales, which improved object detection and segmentation performance. The Pyramid Scene Parsing Network (PSPNet) [42] employed a pyramid pooling module to aggregate context information from different regions, achieving state-of-the-art results in scene parsing tasks.

2.4. Pyramidal Predictive Network (PPNV1)

The Pyramidal Predictive Network (PPNV1) [24,43] is a recently proposed architecture for future video frame prediction, inspired by the predictive coding framework and hierarchical models of visual perception. PPNV1 employs a temporal pyramid architecture, where the update frequency of neurons decreases as the network level increases, allowing higher-layer neurons to capture information over a longer time range while reducing computational costs. The PPNV1 model updates through a combination of bottom-up and top-down information flows, using prediction errors for effective feedback connections. This distinguishes it from traditional generators and enables it to learn efficient and informative representations of video data. Despite its promising results, PPNV1 has limitations in its signal processing and suffers from characteristic artifacts, which can be attributed to its unconventional architecture and update mechanisms. In this work, we build upon the PPNV1 architecture and propose the Pyramidal Predictive Network V2 (PPNV2), which addresses the limitations of its predecessor and incorporates improved signal processing, aliasing reduction techniques, and enhanced training strategies. The aforementioned methods are summarized in Table 1. By leveraging insights from related works in video prediction, predictive coding, hierarchical models, and pyramid architectures, we aim to develop a more robust, efficient, and interpretable model for future perception prediction.

3. PPNV1

The Pyramidal Predictive Network V1 (PPNV1) employs a hierarchical architecture inspired by predictive coding mechanisms (PCMs), combining top-down predictions and bottom-up error correction. Unlike traditional PCMs, focused on current stimuli, PPNV1 directly predicts future frames (e.g.,

(I_{t + 1}

) using prior inputs (

I_{t}

)), with higher network levels generating low-resolution predictions and lower levels refining high-resolution details. This coarse-to-fine approach enables efficient long-term dependency modeling while reducing computational overhead.

The PPNV1 model consists of multiple levels of ConvLSTM units, with each level operating at a different temporal resolution. The predictions from higher levels are propagated downward to guide the predictions at lower levels, while the prediction errors from lower levels are propagated upward to refine the higher-level predictions. The model updates through a combination of bottom-up and top-down information flows, using prediction errors for effective feedback connections, distinguishing it from traditional generators. PPNV1 has shown promising results in future video frame prediction tasks, showcasing its ability to capture long-term dependencies and generate realistic future frames. The model’s architecture is designed to mimic the predictive coding mechanism in the human visual system, where higher-level areas make predictions about lower-level sensory inputs, and prediction errors are used to update the internal representations.

Central to PPNV1 is its encoder–decoder LSTM (EDLSTM) design, which integrates encoder–decoder networks into LSTM units via skip connections. The EDLSTM processes concatenated inputs (predictions, errors, and sensory data) through a compact hidden state (64 channels), reconstructing RGB outputs directly. This contrasts with traditional encoder–LSTM–decoder frameworks, which require double the parameters due to high-dimensional semantic-space computations. The model further enhances efficiency through semantic feature sharing, propagating encoded hidden variables (

v_{t}^{l + 1}

) from higher to lower levels to avoid redundant calculations.

However, despite its success, PPNV1 has limitations in its signal processing and suffers from characteristic artifacts. We observed inconsistencies in the temporal signal processing of PPNV1 (Figure 1). Firstly, the sensory input (green) in the neural network propagates to higher levels along with the prediction error (red), which has been proven necessary. However, there is an issue with the choice of sensory information. The model should transmit the sensory input of the next time step (

f_{l - 1}^{t + 1}

) instead of the current time step (

f_{l - 1}^{t}

). Secondly, the sensory input and prediction error are simply concatenated for calculation, causing information aliasing. Moreover, the predictive unit at higher levels needs to simultaneously predict the prediction error from a lower level, which is contradictory and challenging. We analyze these issues in detail and redesign the signal processing specification in Section 4 to make the model stronger and more interpretable.

Another issue in PPNV1 is the rough splicing of various signals. For instance, the prediction, prediction error, and sensory information from different levels are directly spliced and input into the predictive unit (ConvLSTM) for calculation, causing severe information aliasing and huge computational overhead. In the latest version of PPVN, we propose an alternative design to alleviate these problems by adding a modulation module before and after the predictive unit, allowing only the modulated information to be sent to the predictive unit for calculation. This special network module is designed to modulate the primary input with scaling and deviation tensors computed from the auxiliary input. Additionally, the downsampling and upsampling artifacts have been redesigned to ensure that the network can more easily construct images from Fourier features of low-frequency inputs. This new design enhances the model’s efficiency and interpretability, enabling further improvements in the training strategy. Details are introduced in Section 5.

In addition to the network architecture, input inconsistency between training and testing in temporal sequential tasks remains a challenging topic. Generally, the predicted perceptions need to serve as new inputs to enable continuous prediction, causing inconsistency between training and testing. We observe that the improved Pyramidal Predictive Network can alleviate this problem by computing losses at multiple levels. The new convolutional unit will learn to extract accurate and effective information from “less perfect” data, making the overall model more robust. Furthermore, we propose training the model with LPIPS loss [29]. In future visual perception prediction tasks, LPIPS is usually used as an evaluation metric to assess image similarity, but it is rarely used as a loss function. In this work, it is combined with Euclidean distance loss to generate more accurate and clearer future perception, which is introduced in Section 6

Despite its innovations, PPNV1 suffers from temporal mismatches (using

f_{t}^{l}

instead of

f_{t + 1}^{l}

for predictions), spectral aliasing from raw signal concatenation, and training–testing discrepancies (autoregressive frame drift). These limitations motivate the architectural refinements in PPNV2. The complete PPNV2 architecture is shown in Figure 2. It offers a more robust and efficient approach to future perception prediction, overcoming the limitations of its predecessor and expanding its potential applications in various robotic domains, including pedestrian prediction, vehicle prediction, and navigation. The improved architecture and training strategies enhance the model’s interpretability, prediction accuracy, and overall prediction quality. By focusing on visual perception prediction, PPNV2 aims to provide a comprehensive framework for anticipating future visual changes, enabling robotic systems to make more informed decisions and adapt to dynamic environments.

4. Signal Flows for Prediction

In this section, we analyze the signal processing aspects of the Pyramidal Predictive Network (PPNV1) [23] and propose improvements to address the identified issues. PPNV1 is a hierarchical predictive coding model that aims to capture the temporal dynamics of video sequences by propagating sensory information and prediction errors across multiple levels. While PPNV1 has shown promising results, we argue that its signal processing mechanisms have certain limitations that hinder its interpretability and performance. Specifically, we identify two main issues in PPNV1’s signal processing: (1) a lag in the upward propagation of sensory information, and (2) the integrated calculation of sensory input and prediction error, leading to information aliasing. These issues make the entire PPNV1 model difficult to interpret and may lead to suboptimal predictions. To address these problems, we propose the Pyramidal Predictive Network V2 (PPNV2), which incorporates improved signal processing techniques to enhance the model’s interpretability and prediction accuracy. Figure 3 illustrates the key differences between PPNV1 and PPNV2 in terms of signal processing.

4.1. Improving the Propagation of Sensory Information

In PPNV1, sensory information is input to the predictive unit (ConvLSTM) for prediction generation and propagated upward to higher levels along with the prediction error. While the authors have previously [23] confirmed the necessity of this operation, we observed a lag in the sensory information used in the upward propagation process.

As depicted in Figure 3a, the sensory input

f_{l}^{t + 1}

at level l and time step

t + 1

is a higher-level representation of the combination of the sensory input

f_{l - 1}^{t}

and the prediction error

E_{l - 1}^{t + 1}

from the lower level. The prediction

P_{l}^{t}

at level l and time step t is considered to predict

f_{l}^{t + 1}

. However, this implies that

P_{l}^{t}

is predicting the sensory input

f_{l - 1}^{t}

and the prediction error

E_{l - 1}^{t + 1}

from the previous time step, which leads to a temporal mismatch.

To address this issue, we propose a modification in PPNV2, where the sensory input

f_{l - 1}^{t + 1}

from the lower level at time step

t + 1

is propagated upward instead of

f_{l - 1}^{t}

(Figure 3b). This ensures that the sensory information being propagated aligns with the current time step, eliminating the temporal lag. By propagating

f_{l - 1}^{t + 1}

, the prediction

P_{l}^{t}

at level l can accurately predict the sensory input at the lower level for the same time step, leading to more coherent and meaningful predictions.

4.2. Separating the Processing of Sensory Input and Prediction Error

Another issue in PPNV1 is the integrated calculation of sensory input and prediction error. As shown in Figure 3a, the sensory input

f_{l - 1}^{t}

and the prediction error

E_{l - 1}^{t + 1}

from the lower level are concatenated and processed by a convolution unit to obtain the higher-level representation

f_{l}^{t + 1}

. This integrated calculation leads to information aliasing, where the higher-level sensory input becomes a mixture of the lower-level sensory input and the prediction error.

Moreover, predicting the prediction error in this integrated manner is puzzling and difficult. The prediction error

E_{l - 1}^{t + 1}

is calculated based on the difference between the prediction

P_{l - 1}^{t}

and the sensory input

f_{l - 1}^{t + 1}

at the lower level. However, the generation of

P_{l - 1}^{t}

depends on the higher-level prediction

P_{l}^{t}

, creating a circular dependency. To mitigate these issues, we propose a separate processing of sensory input and prediction error in PPNV2. As illustrated in Figure 3b, the sensory input

f_{l - 1}^{t + 1}

from the lower level is propagated upward independently of the prediction error. The higher-level representation

f_{l}^{t + 1}

is obtained by processing

f_{l - 1}^{t + 1}

with a dedicated convolutional unit. Similarly, the prediction error

E_{l - 1}^{t + 1}

is propagated upward separately and processed by another convolutional unit to generate a higher-level prediction error representation. This separation of sensory input and prediction error processing has several benefits. First, it prevents the aliasing of information, ensuring that the higher-level sensory input remains a pure representation of the lower-level sensory input. Second, it resolves the circular dependency issue in predicting the prediction error, as the generation of

P_{l - 1}^{t}

no longer relies on the higher-level prediction

P_{l}^{t}

. Finally, it improves the interpretability of the model, as the role of each signal becomes clearer and more distinct.

To further address the issue of signal aliasing and facilitate the integration of different input signals, we introduce a modulation module in PPNV2. As shown in Figure 4—Modulate, the modulation module takes two input signals,

x_{1}

and

x_{2}

, and performs a gated integration. The auxiliary signal

x_{2}

is processed by convolutional layers and activated by the sigmoid and tanh functions to obtain scaling and shifting matrices. These matrices are then used to modulate the main signal

x_{1}

through element-wise multiplication and addition. The modulation module offers several advantages over traditional integration methods, such as concatenation and element-wise addition (Figure 4—Concatenate). First, it prevents the direct leakage of information from the auxiliary signal

x_{2}

to the main signal

x_{1}

. Second, it provides a more stable gradient flow along the

x_{1}

path, as the gradients are not directly affected by the values of

x_{2}

. Third, it allows for a flexible and learnable integration of the two signals, as the scaling and shifting matrices are learned through the convolutional layers. The modulation module is employed in PPNV2 to integrate the sensory input and prediction error signals at each level of the hierarchy. By using this module, we aim to achieve a more effective and interpretable integration of the different signals involved in the predictive coding process.

5. Anti-Aliasing Design

In addition to the signal processing improvements discussed in the previous section, PPNV2 incorporates anti-aliasing techniques to address the issue of information aliasing caused by the careless handling of different signals in PPNV1. As illustrated in Figure 3a, PPNV1 simply concatenates different signals and feeds them into the ConvLSTM predictive unit for computation, which inevitably leads to information aliasing. We argue that different types of information have varying levels of importance and should be treated accordingly. For instance, sensory input, which contains more low-frequency signals, can provide more effective information than prediction errors and deserves more attention. Moreover, the information contained in higher-level predictions (

P_{l}^{t}

), prediction errors (

E_{l - 1}^{t}

), and sensory input (

f_{l - 1}^{t}

) is quite different, and it is preferable to calculate these separately instead of splicing it together. Although the convolution unit can theoretically learn to distinguish between them eventually, it is undoubtedly a challenging task. Repeated multiplication of large weight matrices may cause vanishing or exploding gradients. Furthermore, considering the special computation of ConvLSTM (which requires computing four outputs [20]), the concatenation operation in PPNV1 leads to a substantial increase in the number of parameters and computational overhead. To address these issues, we propose using a network module before and after the ConvLSTM predictive unit to process different inputs and only propagate the necessary signals to the ConvLSTM for calculation. The purpose is to mitigate the problem of information aliasing and reduce network parameters and computational overhead as much as possible. In addition to the aliasing caused by the concatenation of different signals, the aliasing of different frequencies within a signal is also a concern. During the downsampling and upsampling processes, high-frequency signals are often mixed with low-frequency signals, which can cause the final generated image to be incoherent and unnatural. To alleviate this problem, we propose redesigning the upsampling and downsampling artifacts with anti-aliasing low-pass filters to ensure that the model can more easily learn low-frequency Fourier features. The following subsections describe the details of these modifications.

5.1. Modulation Module

As shown in Figure 4, we introduce a modulation module before and after the ConvLSTM prediction unit for signal preprocessing and postprocessing. This special module is designed to alleviate the problems of signal aliasing and difficulties in gradient propagation. Unlike the traditional concatenation or addition operations, the modulation module first distinguishes and determines the main signal

x_{1}

and the auxiliary signal

x_{2}

. The design of this module is inspired by the attention mechanism, where the brain responds more strongly when predictions are severely inconsistent with the environment [44,45,46]. For the “ModError” module (Figure 4), the modulation operation can be seen as the process of attention attachment and transfer. The sensory input contains more information that effectively describes the current scene, while the prediction error consists of sparse error signals. As neural networks have been shown to preferentially learn low-frequency features [47,48], it is better to treat the sensory input as the main signal

x_{1}

and the prediction error as an auxiliary signal

x_{2}

. According to Clark et al. [44,49,50], larger prediction errors attract more attention. Therefore, we propose viewing the prediction error as an attention matrix, and the attachment of attention is equivalent to the process of scaling and shifting the sensory input. The calculation of the modulation module is as follows: First, the auxiliary signal

x_{2}

is convolved separately, and then the sigmoid and tanh activation functions are applied to obtain the scaling and shift matrices, as shown in Equations (1) and (2):

m_{s c} = sigmoid (f (x_{2}, θ_{s c}))

(1)

m_{s f} = \tanh (f (x_{2}, θ_{s f}))

(2)

where

f (\cdot)

represents a typical convolutional network,

θ

denotes the model parameters, and

m_{s c}

and

m_{s f}

are the scaling and shifting matrices, respectively. The sigmoid and tanh functions constrain the matrices to the ranges (0, 1) and (−1, 1), respectively. The final output is computed as shown in Equation (3), where

α

is a learnable scaling factor. Since the sigmoid function limits the scaling matrix to (0, 1), which only allows downscaling but not upscaling, we recommend adding a learnable scaling factor

α

.

y = α \cdot m_{s c} \cdot x_{1} + m_{s f}

(3)

Similarly, in the postprocessing stage, we use the modulation module to process the higher-level prediction

P_{l + 1}^{t + 1}

and the output of the ConvLSTM (Figure 4). In this case, we treat the higher-level prediction as the main signal

x_{1}

and the output of the ConvLSTM as the auxiliary signal

x_{2}

. The downward propagation of the higher-level prediction can be considered as a decoding process, so we take the decoding feature as the main signal.

5.2. Downsampling Artifact

The purpose of downsampling is to reduce memory and computation, prevent overfitting, and increase the receptive field. In PPNV1, the conventional method of convolution and pooling is used for downsampling (2 in Figure 3a). Related studies have shown that classifier networks focus more on texture than shape [51], and one of the reasons is the use of pooling. Pooling, especially max pooling, is more sensitive to texture information and has better translation and rotation invariance. Therefore, regardless of how the images are translated or rotated, the classifier network can still recognize them. However, continuous video frame sequences are semantically consistent but differ in pixel space [52]. The essence of future video frame prediction is to learn changes by observing the differences between frames. As a result, the invariance brought by pooling can be disastrous for video prediction tasks, motivating us to seek an alternative that preserves spatial location information as much as possible while downsampling. The first approach that comes to mind is convolutional downsampling, which preserves spatial details better than pooling but can still be improved. For example, using a low-pass filter for downsampling before performing convolution. Aliasing, a subtle and critical issue, has recently attracted attention [53,54,55,56]. To prevent high-frequency signals from being mixed with low-frequency signals during the downsampling process, it is better to perform anti-aliasing filtering first. As shown in Figure 5a, we first perform feature extraction on the input signal using convolutional units, followed by low-pass filtering for anti-aliasing and downsampling. Finally, another convolution operation is performed to encode the signal. Next, we focus on the design and implementation of low-pass filters. According to the Nyquist–Shannon theorem [57], to recover a signal without distortion, the sampling frequency should be greater than twice the highest frequency in the signal spectrum. Otherwise, the sampled frequencies will overlap. Therefore, in the case of a fixed sampling frequency, we assume that signals with frequencies higher than half the sampling frequency are negligible, which is usually achieved with a low-pass filter. In this work, we use the Hamming window [58] for the design of the low-pass filter, as described in Equation (4), where N denotes the filter length. The calculation of the coefficients is shown in Equation (5), where f and

f_{s}

represent the cutoff and sampling frequencies, respectively. As mentioned earlier, the cutoff frequency f should be between 0 and half the sampling frequency

f_{s} / 2

, and it needs to be limited to (0, 1) when calculating, which is achieved by

2 f / f_{s}

. Finally, we obtain the low-pass filter by multiplying the Hamming window and the coefficients (Equation (6)).

w (n) = 0.54 - 0.46 cos (\frac{2 π n}{N}), 0 \leq n \leq N - 1

(4)

λ (n) = f \frac{2}{f_{s}} sinc (f \frac{2}{f_{s}} (n - 0.5 (N - 1)))

(5)

h (n) = λ (n) \cdot w (n)

(6)

We use a low-pass filter of length

N = 25

and resize it to a

5 \times 5

convolution kernel in this paper. The kernel size has been empirically validated against computational efficiency and anti-aliasing performance. The parameters of the Hamming window in Equation (4) are also empirically tested. In particular, we define the cutoff frequency f as a learnable parameter, which is important because the ratio of low-frequency to high-frequency signals may differ at different resolutions. In general, the higher the resolution, the higher the proportion of low-frequency signals, so we can limit the bandwidth even further. Taking this a step further, we propose defining different learnable cutoff frequencies

f_{i}, 1 \leq i \leq C

for each input feature map and using the sigmoid function to constrain them between

(0, 1)

, where C denotes the number of channels. Finally, we obtain a convolution kernel of shape

(C, 1, 5, 5)

, which is depth-separable and does not fuse features from different channels. It only performs anti-aliasing filtering and downsampling (stride = 2). The traditional convolution calculation is placed after downsampling, resulting in a larger receptive field.

5.3. Upsampling Artifact

Inspired by styleGAN3 [56], ideal upsampling should not modify the continuous representation; its only purpose is to increase the output sampling rate. Similarly, the upsampling operation in PPNV2 consists of two steps. First, we increase the sampling rate to

s_{o u t} = s_{i n} \cdot m

by interleaving

m - 1

zeros between each input sample (m is usually set to 2), which is the same as in styleGAN3. The feature map after staggered filling with zeros is shown in Figure 6 (zoom in for a better view), where x refers to the original input features. Next, we perform convolution processing on the result and then apply low-pass filtering. In this work, we replace the bilinear or bicubic

2 \times

upsampling filter with a

7 \times 7

depthwise separable convolution approximation and consider the four computational cases shown in Figure 6, which is empirically obtained. The

7 \times 7

kernel provides sufficient spatial coverage to interpolate zero-inserted features (Figure 6) while minimizing spectral distortion. In

2 \times

upsampling, we focus on the calculation at different positions in the

2 \times 2

box

(\begin{matrix} 0 & 0 \\ 0 & x \end{matrix})

, which involves different convolution parameters and input features (x). Taking the first case (Figure 6a) as an example, there are

4 \times 4

features involved in the calculation, which can be equivalent to bicubic interpolation in some cases. For the other three cases, they calculate fewer features, thus saving computational overhead compared to bicubic interpolation, but introducing additional parameters. Since the kernel is depth-separable, the increase in parameters is still tolerable. Given the special function of this convolution, the initialization of parameters is also important. We use the weighted average method to initialize the parameters, where the weight of a feature (x) is inversely proportional to its distance from the center. Specifically, taking the first case (Figure 6a) as an example again, we need to calculate the interpolation of the zero value in the center (green). There are three types of features with different distances around the center, and we assign different weights to these features. First, the weight is proportional to the inverse of the distance, as shown in Equations (7) and (8), where

w_{1}, w_{2}, w_{3}

represent the weights for the three types of distances (

l_{1}, l_{2}, l_{3}

), respectively. Second, the sum of all weights should be 1 (Equation (9)), allowing the three types of weights to be calculated (Equation (10)). The parameters for the second case (Figure 6b) and third case (Figure 6c) are calculated in the same way (note that for different cases,

w_{n}

refers to different weights). For the last case (Figure 6d), we expect to preserve the original feature x, so we set the central parameter to 1 and the others to 0. Finally, the complete initialized kernel is shown in the third row of Figure 6.

w_{1} : w_{2} : w_{3} = \frac{1}{l_{1}} : \frac{1}{l_{2}} : \frac{1}{l_{3}}

(7)

l_{1} : l_{2} : l_{3} = \sqrt{2} : \sqrt{3^{2} + 1} : \sqrt{3^{2} + 3^{2}}

(8)

4 w_{1} + 8 w_{2} + 4 w_{3} = 1

(9)

w_{1} = 0.1122, w_{2} = 0.0502, w_{3} = 0.0374

(10)

The complete downward propagation process is shown in Figure 5b. The input y is first decoded with convolution, which corresponds to the encoding process of upward propagation. Next, upsampling is performed using the method described above, followed by low-pass filtering to eliminate aliasing frequencies above

s / 2

. Finally, we use another convolution operation to reconstruct the result, corresponding to the feature extraction process in the upward propagation.

6. Improved Training Strategies

In PPNV1, the focus is on constructing neural network models based on predictive coding theories, while the training strategies are less explored. PPNV1 is a typical end-to-end training model that only utilizes the bottom-level predictions to calculate the loss. The careless signal processing makes the entire network model difficult to interpret (as discussed in Section 4), resulting in uncontrollable training of higher levels in the model. Using higher-level prediction errors to calculate losses may even lead to worse results. Additionally, the inconsistency between training and testing in image time-series tasks remains a challenging topic that has not been addressed in PPNV1. To tackle these issues, we propose several improvements in PPNV2. First, benefiting from the clear and improved network architecture, the entire network becomes controllable, allowing us to simultaneously calculate the training loss for higher levels. Second, we further optimize the calculation process of long-term prediction by computing both the encoding loss and the decoding loss simultaneously to enhance the anti-interference ability of neural units. The following subsections provide more details on these improvements.

6.1. Long-Term Prediction Training Strategy

Long-term prediction is a crucial task in future video frame prediction. When predicting further into the future, we need to input the predicted frame instead of the real frame to generate the next video frame, leading to a mismatch between training and testing. The predicted frames are “imperfect”, and the accumulation of prediction errors causes the predictions to become increasingly blurry over time. To alleviate this problem, we propose a tracking training method that uses the predicted frame as input and the real frame as the target during training. First, we take a fixed number of T real sequence frames (T is usually set to 10) as inputs for calculation. We denote the representation of the input frame at time step t and level l as

f_{l}^{t}

. These real frames can be directly used to compare with the predictions from the previous time step to calculate the prediction error and loss. The calculation of the prediction error is the same as in PPNV1, which is divided into positive and negative errors:

E_{l}^{t} = C (| f_{l}^{t} - P_{l}^{t - 1} |, | P_{l}^{t - 1} - f_{l}^{t} |)

, where C denotes concatenation of the two matrices. The loss is obtained by calculating the squared Euclidean distance:

L = \sum {(f_{l}^{t} - P_{l}^{t - 1})}^{2}

, which is introduced in the next subsection. Second, we take the last predicted frame

P^{T}

as a new input to predict

P^{T + 1}

, and then continue to predict with

P^{T + 1}

to achieve continuous prediction. To distinguish from the representation of the real frame, we denote the representation of the predicted frame input at time step t and level l as

{\hat{f}}_{l}^{t}

. Importantly,

{\hat{f}}_{l}^{t}

is used to calculate the prediction error

E_{l}^{t}

but not the loss, which differs from the case where the real frame is used as input. As mentioned earlier, the predicted frame is “imperfect”, and the loss calculated using the predicted frame might be “false”. Therefore, we must use the real frame to calculate the loss. To calculate the loss for higher levels, we maintain an encoding pathway for real frames during training, which shares parameters with the encoding pathway of the predicted frames (Figure 7). The real frame is only used to provide the target required for loss calculation during training and does not participate in the calculation of prediction generation. This encoding pathway is also closed during testing.

Furthermore, we calculate not only the prediction loss (Figure 7,

L_{1}

) at each level but also the loss of the encoding process of the predicted frame (Figure 7,

L_{2}

). The purpose is to enhance the anti-interference ability of the downsampling artifact. We aim to make the convolutional units robust enough to obtain the same features from “imperfect” predicted frames as from real ones, further mitigating the effects of prediction errors.

6.2. Loss Function

In this subsection, we introduce the details of the loss function used in PPNV2. We split the computation into two stages during training: first calculating with real frames as input, followed by predicted frames. We assume that the length of the input real frame sequence is

T_{1}

, and the length of the predicted frame sequence is

T_{2}

, resulting in a total input length of

T = T_{1} + T_{2}

. First, we compute the prediction loss (Figure 7,

L_{1}

) at each level and each time step throughout the entire calculation, which can be expressed as

L_{1} = \sum_{t = 0}^{T} \sum_{l = 0}^{L} λ_{t} \cdot λ_{l} \cdot {(f_{l}^{t} - P_{l}^{t - 1})}^{2}

(11)

where L denotes the number of network levels, and

λ_{t}

and

λ_{l}

represent the weighting coefficients at time step t and level l, respectively. Importantly, we use the squared Euclidean distance to calculate the loss instead of the traditional mean squared error (MSE). The difference is that MSE divides the Euclidean distance by the total number of features, which can cause the loss value to be too small and may result in the gradient of some parameters being ignored. This can lead to the training process converging first and then diverging, which often occurs in low-precision training. Second, we calculate the encoding loss (Figure 7,

L_{2}

) simultaneously in the second stage to make the convolutional units robust enough to extract similar features from “imperfect” predicted frames as from real frames. The calculation of

L_{2}

is defined as

L_{2} = \sum_{t = T_{1}}^{T} \sum_{l = 1}^{L} λ_{t} \cdot λ_{l} \cdot {(f_{l}^{t} - {\hat{f}}_{l}^{t})}^{2}

(12)

where

f_{l}^{t}

and

{\hat{f}}_{l}^{t}

indicate the representation of the real frame and predicted frame at time step t and level l, respectively. In addition, we also use the Learned Perceptual Image Patch Similarity (LPIPS) [29] to calculate the training loss. Zhang et al. propose using deep features to measure the perceptual similarity between two images, which is implemented by a neural network. LPIPS is often used as an evaluation metric for video prediction, but to the best of our knowledge, it has not been used as a function to compute the training loss. In this work, we use the provided pretrained model (with VGG as the backbone) to compute the loss between the predicted frame and the target, which is crucial for sharpening the generated images:

L_{l p i p s} = \sum_{t = 0}^{T} λ_{t} \cdot f_{l p i p s} (f^{t}, P^{t - 1})

(13)

where

f_{l p i p s}

denotes the LPIPS pretrained model. Finally, the total loss is calculated as the sum of the above three losses:

L_{t o t a l} = L_{1} + L_{2} + α \cdot L_{l p i p s}

(14)

Since the loss provided by LPIPS is very small, we multiply it by a coefficient

α

.

7. Results

7.1. Evaluation with SOTA

Similar to PPNV1 and most video prediction works, we use three kinds of metrics, structural similarity (SSIM), peak signal-to-noise ratio (PSNR), and LPIPS, for quantitative evaluation. The calculations for the SSIM and PSNR are shown in Equations (15) and (16). The LPIPS score is calculated using the designated pretrained model (Alex network as the backbone). Among them, higher scores for SSIM and PSNR and a lower score for LPIPS indicate better results. We validate the proposed model on the KTH [59], Human3.6M, Caltech, and KITTI datasets, and the data preprocessing method is similar to previous works [23,28,60,61] to ensure the fairness of the evaluation. The specific experimental setup of this work is as follows: We use PyTorch 1.12.1 as the platform and Adam as the optimizer to realize the calculation of the above model. The settings of the hyper-parameters defined in Equation (12) are shown in Equation (17), where L denotes the network level. The value of the weighting factor

α

defined in Equation (14) is set to the total number of pixels in the input image (

α = C \cdot H \cdot W

), where the purpose is to ensure that the dimension of the LPIPS loss is consistent with the

L_{1}

and

L_{2}

losses.

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + c_{1}) (2 σ_{x y} + c_{2})}{(μ_{x}^{2} + μ_{y}^{2} + c_{1}) (σ_{x}^{2} + σ_{y}^{2} + c_{2})}

(15)

P S N R = 10 \cdot {log}_{10} (\frac{M A X_{I}^{2}}{M S E})

(16)

λ_{t} = \{\begin{matrix} 0.2, & t = 0 \\ 1.0, & t > 0 \end{matrix} λ_{l} = \{\begin{matrix} 1.0, & l = 0 \\ 1.0 / L, & l > 0 \end{matrix}

(17)

Table 2 and Figure 8 give the quantitative results and visualization examples on the KTH dataset, respectively. The KTH dataset [59] is a benchmark for human action recognition, containing grayscale video sequences of six actions (walking, jogging, running, boxing, hand waving, hand clapping) performed by 25 subjects under varying conditions, widely used to evaluate spatiotemporal models and video prediction frameworks. From the results, it can be observed that the proposed method has a huge breakthrough in LPIPS evaluation while maintaining a high SSIM score, which means that it can generate sufficiently clear and sharp images while maintaining high pixel accuracy. Stochastic video prediction methods (such as SVAP-VAE [62]) can generate clearer future frames, so their LPIPS evaluation results are better, but they may be more biased towards unconditional generation, so the pixel accuracy is low. On the contrary, some methods improve pixel accuracy through averaging, but the generated images are blurry (such as Conv-TT-LSTM [63]).

Table 3 and Figure 9 give the results on the Human3.6M dataset. It can be observed from Table 3 that the proposed method performs much better than other works in terms of long-term prediction, although the performance in the early stages is slightly worse. The reconstruction of the character’s silhouette and motion is the difficulty of prediction on this dataset. Since most of the image area is a static background, the model only needs to keep the background unchanged to obtain high pixel accuracy, thus diluting the generation of the character in the image. Nevertheless, according to Figure 9, the proposed method can better solve the above problem.

Table 4 shows the quantitative evaluation results on the Caltech and KITTI datasets. It can be observed that the proposed method achieves a great improvement on the Caltech dataset. It can also be seen from Figure 10 that our method generates finer images and recovers more details. Although the pixel accuracy (SSIM and PSNR scores) on the KITTI dataset is not satisfactory, we obtain good LPIPS evaluation results, which means better visual performance (Figure 11). Quantitative trends show that PPNV2’s LPIPS degrades linearly at

- 0.02

per time step (vs.

- 0.04

for MSPN), indicating stable long-term prediction.

7.2. Ablations and Comparisons

We perform several ablation studies to estimate the effects of each artifact or method proposed above. Each ablation experiment is performed on the basis of the default method with corresponding artifacts removed or replaced. Table 5 and Figure 12 give the quantitative and qualitative results on the KTH dataset, respectively. Firstly, we highlight the superiority of the “Modulate” model by comparing with the “Add” and “Concat” methods (Section 5.1). We have previously proposed a new module to better handle the two different input signals. It can be observed that the “Modulate” method outperforms the traditional “Add” and “Concat” methods both in quantitative and qualitative evaluations. In addition, the “Add” and “Concat” methods both experienced crashes (sudden drop in accuracy) during training, while the “Modulate” method did not, indicating that the gradient propagation of this method is indeed more stable.

Whether the low-pass filter (Section 5.2) is used or not has a greater impact on the prediction effect. It can be observed from Table 5 and Figure 12 that in the absence of a low-pass filter to remove the aliasing signal (“NoFilter”), the long-term prediction performance is worse and the characterization of the human profile is indeed worse. In addition, we remove the interleaved upsampling method proposed in Section 5.3 by padding with zeros, and revert to the traditional bilinear upsampling. It can be observed that the gain from the proposed method is not very large, and the traditional bilinear upsampling method can also obtain good results. However, the proposed method is more coherent in depicting the color of the character without gradually fading, which is also in line with our original intention of proposing such a method. Finally, we also remove the LPIPS loss and only use Euclidean distance to train the model. It can be observed from the results that although its pixel accuracy (SSIM and PSNR) is high, its visual performance is worse, and the LPIPS evaluation result is also the worst. This shows the huge role of LPIPS loss in image sharpening.

8. Discussion and Conclusions

In this paper, we presented the Pyramidal Predictive Network V2 (PPNV2), an improved architecture for future video frame prediction that addresses the limitations of its predecessor, PPNV1 [23]. The proposed model incorporates advanced signal processing techniques, anti-aliasing design, and enhanced training strategies to generate sharp, detailed, and visually coherent future frames.

The development of PPNV2 is motivated by the understanding that perception is a predictive process, which can be learned through embodied learning [73,74,75]. This perspective suggests that different agents with various morphologies can adapt to the same predictive architecture, enabling them to perceive and interact with their environment effectively. The hierarchical and predictive coding-inspired design of PPNV2 aligns with this view, as it allows the model to learn multi-scale representations and generate predictions based on the agent’s sensory input.

From a technical perspective, the experimental results demonstrate the effectiveness of PPNV2 in outperforming state-of-the-art methods across various benchmark datasets, including KTH, Human3.6M, Caltech, and KITTI. The model’s ability to generate high-quality predictions while maintaining pixel accuracy and perceptual similarity highlights its potential for real-world applications, such as robotic navigation and autonomous systems.

In the context of modern robotic navigation, predictive perception can serve as a complementary function to avoid hazards and accidents. Just as humans coordinate themselves in society by anticipating the actions and intentions of others, robots equipped with predictive perception capabilities can navigate dynamic environments more safely and efficiently. By predicting future states of the environment, including the motion of obstacles and other agents, robots can plan and execute more informed and proactive actions, reducing the risk of collisions and accidents. The ablation study further validates the importance of each proposed component in PPNV2. The modulation module, inspired by the attention mechanism in the human visual system [44,45,46], proves to be superior in handling different input signals compared to traditional feature integration methods. The low-pass filter and interleaved upsampling technique effectively reduce aliasing and generate more coherent predictions. Moreover, the incorporation of the LPIPS loss [29] significantly improves the perceptual quality of the generated frames, leading to sharper and more realistic results. Despite the impressive performance of PPNV2, there are still several challenges and opportunities for future research. One direction is to explore the integration of PPNV2 with other perception and decision-making modules in robotic systems. For example, the predicted future frames can be used to estimate the likelihood of collisions or to identify potential paths for safe navigation.

In conclusion, the Pyramidal Predictive Network V2 (PPNV2) represents a significant advancement in future video frame prediction, addressing the limitations of previous approaches and incorporating novel techniques to generate high-quality predictions. The model’s design, inspired by the predictive nature of perception and embodied learning, makes it adaptable to different agents and morphologies. The experimental results and ablation study demonstrate the effectiveness of the proposed model and its components, highlighting its potential for improving robotic navigation and reducing the risk of accidents by enabling predictive perception.

The interpretability of the learned representations in PPNV2 is another important aspect to consider. While the proposed model achieves impressive performance, understanding the internal representations and decision-making processes can provide valuable insights into the model’s behavior and potential biases. Techniques such as visualization [76] and attribution methods [77] can be employed to analyze the learned features and identify the most informative regions in the input frames. This understanding can help in designing more transparent and explainable predictive models, which is crucial for building trust and acceptance of autonomous systems in society.

The next steps of PPNV2 could consider both the technical and biological perspectives:

Real-time embedded deployment: Adapt PPNV2 for resource-constrained robotic systems by optimizing its hierarchical architecture for real-time inference on embedded hardware (e.g., via quantization or neural architecture search).
Closed-loop control integration: Embed PPNV2 within a reinforcement learning framework [78] to enable end-to-end training of perception–action loops, where predictions directly inform control policies.
Neurosymbolic interpretability: Combine PPNV2’s hierarchical features with symbolic reasoning modules [79] to generate human-readable explanations of predicted trajectories (e.g., “pedestrian will turn left”).
Biologically plausible learning: Collaborate with cognitive science to align PPNV2’s predictive coding mechanisms with empirical neural data (e.g., fMRI or EEG studies [80]), advancing neurorobotic models.

Author Contributions

C.L.: Data Creation, Formal Analysis, Writing Original Draft; J.Z.: Conceptualization, Funding Acquisition, Methodology, W.L.: Supervision, Resources, R.D.: Writing—Review & Editing, Validation, M.D.: Writing—Review & Editing. All authors have read and agreed to the published version of the manuscript.

Funding

This project is supported by the Germany/Hong Kong Joint Research Scheme sponsored by the Research Grants Council of Hong Kong and the German Academic Exchange Service: G-PolyU505/22, and PolyU Grant: CD5E-P0043422.

Data Availability Statement

Our project is available at https://github.com/ANNMS-Projects/PNVP.

Conflicts of Interest

The authors declare no conflicts of interest.

References

DiCarlo, J.J.; Zoccolan, D.; Rust, N.C. How does the brain solve visual object recognition? Neuron 2012, 73, 415–434. [Google Scholar] [PubMed]
Serre, T. Deep learning: The good, the bad, and the ugly. Annu. Rev. Vis. Sci. 2019, 5, 399–426. [Google Scholar] [PubMed]
Felleman, D.J.; Van Essen, D.C. Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex 1991, 1, 1–47. [Google Scholar] [PubMed]
Tanaka, K. Inferotemporal cortex and object vision. Annu. Rev. Neurosci. 1996, 19, 109–139. [Google Scholar]
Friston, K. A theory of cortical responses. Philos. Trans. R. Soc. B Biol. Sci. 2005, 360, 815–836. [Google Scholar]
Rao, R.P.; Ballard, D.H. Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci. 1999, 2, 79–87. [Google Scholar]
Wu, B.; Nair, S.; Martin-Martin, R.; Fei-Fei, L.; Finn, C. Greedy hierarchical variational autoencoders for large-scale video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2318–2328. [Google Scholar]
Wu, H.; Yao, Z.; Wang, J.; Long, M. MotionRNN: A flexible model for video prediction with spacetime-varying motions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15435–15444. [Google Scholar]
Liu, B.; Chen, Y.; Liu, S.; Kim, H.S. Deep learning in latent space for video prediction and compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 701–710. [Google Scholar]
Lee, S.; Kim, H.G.; Choi, D.H.; Kim, H.I.; Ro, Y.M. Video prediction recalling long-term motion context via memory alignment learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3054–3063. [Google Scholar]
Jin, B.; Hu, Y.; Tang, Q.; Niu, J.; Shi, Z.; Han, Y.; Li, X. Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4554–4563. [Google Scholar]
Chatterjee, M.; Ahuja, N.; Cherian, A. A hierarchical variational neural uncertainty model for stochastic video prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9751–9761. [Google Scholar]
Franceschi, J.Y.; Delasalles, E.; Chen, M.; Lamprier, S.; Gallinari, P. Stochastic latent residual video prediction. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 13–18 July 2020; pp. 3233–3246. [Google Scholar]
Chang, Z.; Zhang, X.; Wang, S.; Ma, S.; Gao, W. STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13946–13955. [Google Scholar]
Morris, B.T.; Trivedi, M.M. Learning, modeling, and classification of vehicle track patterns from live video. IEEE Trans. Intell. Transp. Syst. 2008, 9, 425–437. [Google Scholar]
Wei, J.; Dolan, J.M.; Litkouhi, B. A prediction-and cost function-based algorithm for robust autonomous freeway driving. In Proceedings of the 2010 IEEE Intelligent Vehicles Symposium, La Jolla, CA, USA, 21–24 June 2010; pp. 512–517. [Google Scholar]
Finn, C.; Levine, S. Deep visual foresight for planning robot motion. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 2786–2793. [Google Scholar]
Gao, X.; Jin, Y.; Zhao, Z.; Dou, Q.; Heng, P.-A. Future Frame Prediction for Robot-Assisted Surgery. In Information Processing in Medical Imaging; Feragen, A., Sommer, S., Schnabel, J., Nielsen, M., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 533–544. [Google Scholar]
Shi, X.; Gao, Z.; Lausen, L.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Deep learning for precipitation nowcasting: A benchmark and a new model. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
Han, T.; Xie, W.; Zisserman, A. Video representation learning by dense predictive coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Wang, J.; Jiao, J.; Liu, Y.H. Self-supervised video representation learning by pace prediction. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 504–521. [Google Scholar]
Ling, C.; Zhong, J.; Li, W. Pyramidal Predictive Network: A Model for Visual-Frame Prediction Based on Predictive Coding Theory. Electronics 2022, 11, 2969. [Google Scholar] [CrossRef]
Ling, C.; Li, W.; Zeng, J.; Zhong, J. Combined Deterministic and Stochastic Streams for Visual Prediction Using Predictive Coding. In Proceedings of the 2023 IEEE International Conference on Development and Learning (ICDL), Macau, China, 9–11 November 2023; pp. 467–472. [Google Scholar]
Softky, W.R. Unsupervised pixel-prediction. Adv. Neural Inf. Process. Syst. 1996, 8, 809–815. [Google Scholar]
Deco, G.; Schürmann, B. Predictive coding in the visual cortex by a recurrent network with gabor receptive fields. Neural Process. Lett. 2001, 14, 107–114. [Google Scholar] [CrossRef]
Hollingworth, A. Constructing visual representations of natural scenes: The roles of short-and long-term visual memory. J. Exp. Psychol. Hum. Percept. Perform. 2004, 30, 519. [Google Scholar] [CrossRef]
Lotter, W.; Kreiman, G.; Cox, D. Deep predictive coding networks for video prediction and unsupervised learning. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
Hosseini, M.; Maida, A.S.; Hosseini, M.; Raju, G. Inception-inspired lstm for next-frame video prediction. arXiv 2019, arXiv:1909.05622. [Google Scholar]
Srivastava, N.; Mansimov, E.; Salakhudinov, R. Unsupervised learning of video representations using lstms. In Proceedings of the International Conference on Machine Learning. PMLR, Lille, France, 7–9 July 2015; pp. 843–852. [Google Scholar]
Finn, C.; Goodfellow, I.; Levine, S. Unsupervised learning for physical interaction through video prediction. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
Mathieu, M.; Couprie, C.; LeCun, Y. Deep multi-scale video prediction beyond mean square error. arXiv 2015, arXiv:1511.05440. [Google Scholar]
Vondrick, C.; Pirsiavash, H.; Torralba, A. Generating videos with scene dynamics. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social LSTM: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 961–971. [Google Scholar]
Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; Alahi, A. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2255–2264. [Google Scholar]
Friston, K. The free-energy principle: A unified brain theory? Nat. Rev. Neurosci. 2010, 11, 127–138. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Burt, P.J.; Adelson, E.H. The Laplacian pyramid as a compact image code. In Readings in Computer Vision; Elsevier: Amsterdam, The Netherlands, 1987; pp. 671–679. [Google Scholar]
Ke, T.W.; Maire, M.; Yu, S.X. Multigrid neural architectures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6665–6673. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Ling, C.; Zhong, J.; Li, W. Predictive Coding Based Multiscale Network with Encoder-Decoder LSTM for Video Prediction. arXiv 2022, arXiv:2212.11642. [Google Scholar]
Friston, K. Learning and inference in the brain. Neural Netw. 2003, 16, 1325–1352. [Google Scholar] [CrossRef]
Friston, K. Hierarchical models in the brain. PLoS Comput. Biol. 2008, 4, e1000211. [Google Scholar]
Hohwy, J.; Roepstorff, A.; Friston, K. Predictive coding explains binocular rivalry: An epistemological review. Cognition 2008, 108, 687–701. [Google Scholar] [CrossRef] [PubMed]
Xu, Z.Q.J.; Zhang, Y.; Xiao, Y. Training behavior of deep neural network in frequency domain. In Proceedings of the International Conference on Neural Information Processing, Vancouver, BC, Canada, 8–14 December 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 264–274. [Google Scholar]
Xu, Z.Q.J.; Zhang, Y.; Luo, T.; Xiao, Y.; Ma, Z. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv 2019, arXiv:1901.06523. [Google Scholar]
Clark, A. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behav. Brain Sci. 2013, 36, 181–204. [Google Scholar]
Aitchison, L.; Lengyel, M. With or without you: Predictive coding and Bayesian inference in the brain. Curr. Opin. Neurobiol. 2017, 46, 219–227. [Google Scholar]
Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F.A.; Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv 2018, arXiv:1811.12231. [Google Scholar]
Oprea, S.; Martinez-Gonzalez, P.; Garcia-Garcia, A.; Castro-Vargas, J.A.; Orts-Escolano, S.; Garcia-Rodriguez, J.; Argyros, A. A review on deep learning techniques for video prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2806–2826. [Google Scholar]
Azulay, A.; Weiss, Y. Why do deep convolutional networks generalize so poorly to small image transformations? arXiv 2018, arXiv:1805.12177. [Google Scholar]
Zou, X.; Xiao, F.; Yu, Z.; Li, Y.; Lee, Y.J. Delving deeper into anti-aliasing in convnets. Int. J. Comput. Vis. 2022, 131, 67–81. [Google Scholar]
Vasconcelos, C.; Larochelle, H.; Dumoulin, V.; Romijnders, R.; Le Roux, N.; Goroshin, R. Impact of aliasing on generalization in deep convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10529–10538. [Google Scholar]
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. Adv. Neural Inf. Process. Syst. 2021, 34, 852–863. [Google Scholar]
Shannon, C.E. Communication in the presence of noise. Proc. IRE 1949, 37, 10–21. [Google Scholar] [CrossRef]
Madisetti, V. The Digital Signal Processing Handbook; CRC Press: Boca Raton, FL, USA, 1997. [Google Scholar]
Schuldt, C.; Laptev, I.; Caputo, B. Recognizing human actions: A local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, 23–26 August 2004; Volume 3, pp. 32–36. [Google Scholar]
Straka, Z.; Svoboda, T.; Hoffmann, M. PreCNet: Next Frame Video Prediction Based on Predictive Coding. arXiv 2020, arXiv:2004.14878. [Google Scholar]
Lin, X.; Zou, Q.; Xu, X.; Huang, Y.; Tian, Y. Motion-aware feature enhancement network for video prediction. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 688–700. [Google Scholar]
Lee, A.X.; Zhang, R.; Ebert, F.; Abbeel, P.; Finn, C.; Levine, S. Stochastic adversarial video prediction. arXiv 2018, arXiv:1804.01523. [Google Scholar]
Su, J.; Byeon, W.; Kossaifi, J.; Huang, F.; Kautz, J.; Anandkumar, A. Convolutional tensor-train lstm for spatio-temporal learning. Adv. Neural Inf. Process. Syst. 2020, 33, 13714–13726. [Google Scholar]
Villegas, R.; Yang, J.; Hong, S.; Lin, X.; Lee, H. Decomposing motion and content for natural video sequence prediction. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Oliu, M.; Selva, J.; Escalera, S. Folded recurrent neural networks for future video prediction. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 716–731. [Google Scholar]
Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Wang, Y.; Gao, Z.; Long, M.; Wang, J.; Philip, S.Y. Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In Proceedings of the International Conference on Machine Learning. PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 5123–5132. [Google Scholar]
Jin, B.; Hu, Y.; Zeng, Y.; Tang, Q.; Liu, S.; Ye, J. Varnet: Exploring variations for unsupervised video prediction. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 5801–5806. [Google Scholar]
Wang, Y.; Jiang, L.; Yang, M.H.; Li, L.J.; Long, M.; Fei-Fei, L. Eidetic 3d lstm: A model for video prediction and beyond. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Liu, Z.; Yeh, R.A.; Tang, X.; Liu, Y.; Agarwala, A. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4463–4471. [Google Scholar]
Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Liu, G.; Tao, A.; Kautz, J.; Catanzaro, B. Video-to-Video Synthesis. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
Wu, Y.; Gao, R.; Park, J.; Chen, Q. Future video synthesis with object motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5539–5548. [Google Scholar]
Shapiro, L. The embodied cognition research programme. Philos. Compass 2007, 2, 338–346. [Google Scholar]
Smith, L.; Gasser, M. The development of embodied cognition: Six lessons from babies. Artif. Life 2005, 11, 13–29. [Google Scholar]
Hoffmann, M.; Marques, H.; Arieta, A.; Sumioka, H.; Lungarella, M.; Pfeifer, R. Body schema in robotics: A review. IEEE Trans. Auton. Ment. Dev. 2010, 2, 304–324. [Google Scholar]
Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part I 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 818–833. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Bhuyan, B.P.; Ramdane-Cherif, A.; Tomar, R.; Singh, T. Neuro-symbolic artificial intelligence: A survey. Neural Comput. Appl. 2024, 36, 12809–12844. [Google Scholar] [CrossRef]
Kietzmann, T.C.; McClure, P.; Kriegeskorte, N. Deep neural networks in computational neuroscience. BioRxiv 2017, 133504. [Google Scholar]

Figure 1. The architecture of PPNV1. According to the definitions in PPNV1 [23],

f_{l}^{t}

(green),

P_{l}^{t}

(orange), and

E_{l}^{t}

(red) represent the sensory input, prediction, and prediction error at time step t and level l, respectively. In the figure, there is a lag in the upward propagation of sensory input. Specifically,

f_{l}^{t + 1}

should be a higher-level representation of

f_{l - 1}^{t + 1}

rather than

f_{l - 1}^{t}

. Furthermore, since

f_{l}^{t + 1}

characterizes

E_{l - 1}^{t + 1}

simultaneously,

P_{l}^{t}

can be considered as a prediction of

E_{l - 1}^{t + 1}

. However,

E_{l - 1}^{t + 1}

is actually indirectly generated by

P_{l}^{t}

, which leads to a contradiction and makes the model difficult to interpret.

Figure 1. The architecture of PPNV1. According to the definitions in PPNV1 [23],

f_{l}^{t}

(green),

P_{l}^{t}

(orange), and

E_{l}^{t}

(red) represent the sensory input, prediction, and prediction error at time step t and level l, respectively. In the figure, there is a lag in the upward propagation of sensory input. Specifically,

f_{l}^{t + 1}

should be a higher-level representation of

f_{l - 1}^{t + 1}

rather than

f_{l - 1}^{t}

. Furthermore, since

f_{l}^{t + 1}

characterizes

E_{l - 1}^{t + 1}

simultaneously,

P_{l}^{t}

can be considered as a prediction of

E_{l - 1}^{t + 1}

. However,

E_{l - 1}^{t + 1}

is actually indirectly generated by

P_{l}^{t}

, which leads to a contradiction and makes the model difficult to interpret.

Figure 2. The complete architecture of the improved Pyramidal Predictive Network (PPNV2). Similar to PPNV1,

f_{l}^{t}

(green),

P_{l}^{t}

(orange), and

E_{l}^{t}

(red) represent the sensory input, prediction, and prediction error at time step t and level l, respectively. We first made improvements to how sensory input and prediction errors are propagated and computed, making the model more interpretable (will be introduced in detail in Section 4). Second, we redesigned several characteristic artifacts to address the problem of information aliasing. Here, “US” denotes the upsampling artifact, in which we replace the traditional bilinear or bicubic upsampling methods by interleaving zeros and then using depthwise separable convolutions to implement interpolation calculations (Section 5.3). A low-pass filter with a learnable cutoff frequency is used for anti-aliasing filtering. “DS” denotes the downsampling artifact, in which low-pass filtering is also introduced to prevent high-frequency information from being mixed into low-frequency information (Section 5.2). “ModError” and “ModPred” are special modulation modules designed to better handle different inputs and make gradient propagation easier and more stable (Section 5.1). Zoom in for a better view.

Figure 2. The complete architecture of the improved Pyramidal Predictive Network (PPNV2). Similar to PPNV1,

f_{l}^{t}

(green),

P_{l}^{t}

(orange), and

E_{l}^{t}

(red) represent the sensory input, prediction, and prediction error at time step t and level l, respectively. We first made improvements to how sensory input and prediction errors are propagated and computed, making the model more interpretable (will be introduced in detail in Section 4). Second, we redesigned several characteristic artifacts to address the problem of information aliasing. Here, “US” denotes the upsampling artifact, in which we replace the traditional bilinear or bicubic upsampling methods by interleaving zeros and then using depthwise separable convolutions to implement interpolation calculations (Section 5.3). A low-pass filter with a learnable cutoff frequency is used for anti-aliasing filtering. “DS” denotes the downsampling artifact, in which low-pass filtering is also introduced to prevent high-frequency information from being mixed into low-frequency information (Section 5.2). “ModError” and “ModPred” are special modulation modules designed to better handle different inputs and make gradient propagation easier and more stable (Section 5.1). Zoom in for a better view.

Figure 3. The signal processing of the Pyramidal Predictive Network has been redesigned. (a) The original model (PPNV1). (b) The improved model (PPNV2). We make several changes to the original structure as shown in the red boxes in the figure. (i) We adjust the sensory input being propagated upward, where we use sensory input of the next time step (

f_{l - 1}^{t + 1}

) instead of the current one (

f_{l - 1}^{t}

) for propagation to solve the problem of lagging propagation. (ii) We separate prediction error from sensory input and calculate them with novel downsampling artifacts, respectively, to prevent the information aliasing and make the model more interpretable. (iii) Instead of roughly feeding different signals directly into the ConvLSTM for computation, we use a network block to process them separately, which is made up of “ModError”, “ConvLSTM”, and “ModPred”, shown in Figure 2.

Figure 3. The signal processing of the Pyramidal Predictive Network has been redesigned. (a) The original model (PPNV1). (b) The improved model (PPNV2). We make several changes to the original structure as shown in the red boxes in the figure. (i) We adjust the sensory input being propagated upward, where we use sensory input of the next time step (

f_{l - 1}^{t + 1}

) instead of the current one (

f_{l - 1}^{t}

) for propagation to solve the problem of lagging propagation. (ii) We separate prediction error from sensory input and calculate them with novel downsampling artifacts, respectively, to prevent the information aliasing and make the model more interpretable. (iii) Instead of roughly feeding different signals directly into the ConvLSTM for computation, we use a network block to process them separately, which is made up of “ModError”, “ConvLSTM”, and “ModPred”, shown in Figure 2.

Figure 4. Modulation module for integrating different input signals. (a) The proposed modulation module takes two input signals,

x_{1}

and

x_{2}

, and performs a gated integration. The auxiliary signal

x_{2}

is processed by convolutional layers and activated by the sigmoid and tanh functions to obtain scaling and shifting matrices. These matrices are then used to modulate the main signal

x_{1}

through element-wise multiplication and addition. (b) Comparison of the proposed modulation module with traditional integration methods, namely, concatenation (Concat) and element-wise addition (Add). The modulation module prevents the direct leakage of information from

x_{2}

to

x_{1}

and provides a more stable gradient flow along the

x_{1}

path (shown in red).

Figure 4. Modulation module for integrating different input signals. (a) The proposed modulation module takes two input signals,

x_{1}

and

x_{2}

, and performs a gated integration. The auxiliary signal

x_{2}

is processed by convolutional layers and activated by the sigmoid and tanh functions to obtain scaling and shifting matrices. These matrices are then used to modulate the main signal

x_{1}

through element-wise multiplication and addition. (b) Comparison of the proposed modulation module with traditional integration methods, namely, concatenation (Concat) and element-wise addition (Add). The modulation module prevents the direct leakage of information from

x_{2}

to

x_{1}

and provides a more stable gradient flow along the

x_{1}

path (shown in red).

Figure 5. Simplified flowchart for downsampling (a) and upsampling (b). We introduce a low-pass filter for anti-aliasing filtering. Importantly, we set learnable cutoff frequencies for each feature map of each input. “ZeroPad” means that the input is interleaved by

m - 1

zeros in m-fold upsampling, and then interpolated using depthwise separable convolution.

Figure 5. Simplified flowchart for downsampling (a) and upsampling (b). We introduce a low-pass filter for anti-aliasing filtering. Importantly, we set learnable cutoff frequencies for each feature map of each input. “ZeroPad” means that the input is interleaved by

m - 1

zeros in m-fold upsampling, and then interpolated using depthwise separable convolution.

Figure 6. The convolution interpolation calculation. We focus on the calculation at different positions in the

2 \times 2

box

(\begin{matrix} 0 & 0 \\ 0 & x \end{matrix})

, which can be discussed in four different cases (a–d). Particularly, the convolutional kernel parameters used in each case are non-overlapping (the second row), and we can initialize the parameters for each case according to the distance between the parameters to the center (green). What the third row shows is a total initialized convolution kernel. Zoom in for a better view.

Figure 6. The convolution interpolation calculation. We focus on the calculation at different positions in the

2 \times 2

box

(\begin{matrix} 0 & 0 \\ 0 & x \end{matrix})

, which can be discussed in four different cases (a–d). Particularly, the convolutional kernel parameters used in each case are non-overlapping (the second row), and we can initialize the parameters for each case according to the distance between the parameters to the center (green). What the third row shows is a total initialized convolution kernel. Zoom in for a better view.

Figure 7. Improved training strategy in PPNV2 by calculating encoding and decoding losses. The decoding loss

L_{1}

, which is the prediction loss at each level, is obtained by comparing the current prediction

P_{l}^{t}

with the real sensory representation

f_{l}^{t + 1}

of the next time step. The encoding loss

L_{2}

is only calculated when the predicted frame is used as input. We maintain an additional encoding pathway for the real frame to obtain the target for calculating the loss, which shares the encoding pathway (

D S_{f}

) of the predicted frame.

Figure 7. Improved training strategy in PPNV2 by calculating encoding and decoding losses. The decoding loss

L_{1}

, which is the prediction loss at each level, is obtained by comparing the current prediction

P_{l}^{t}

with the real sensory representation

f_{l}^{t + 1}

of the next time step. The encoding loss

L_{2}

is only calculated when the predicted frame is used as input. We maintain an additional encoding pathway for the real frame to obtain the target for calculating the loss, which shares the encoding pathway (

D S_{f}

) of the predicted frame.

Figure 8. Visualization examples on the KTH datasets. We use 10 frames as input to predict the next 30 frames.

Figure 9. Visualization examples on the Human3.6M dataset. We use 10 frames as input to predict the next 5 frames. The other results are obtained from [43]. Boxes show blurry prediction occurs in different time-steps. Zoom in for a better view.

Figure 10. Visualization examples on the Caltech dataset. We use 10 frames as input to predict the next frame. Zoom in for a better view.

Figure 11. Visualization examples on the KITTI dataset. We use 10 frames as input to predict the next 5 frames. In each group, the first row indicates the ground truth while the second row is prediction.

Figure 12. Ablations and comparisons on the KTH dataset. The qualitative evaluation results. Shown in the figure are visualization examples of the predicted future 30 frames (

10 \to 40

).

Figure 12. Ablations and comparisons on the KTH dataset. The qualitative evaluation results. Shown in the figure are visualization examples of the predicted future 30 frames (

10 \to 40

).

Table 1. Comparison of existing studies and justification for PPNV2.

Approach	Key Contribution	Shortcomings	PPNV2 Differentiation
RNNs/LSTMs [31,32]	Temporal modeling with recurrent architectures	Blurry predictions due to MSE loss Limited multi-scale reasoning High computational cost	Pyramid architecture for multi-scale processing Hybrid LPIPS + Euclidean loss [29] Depthwise separable convolutions
Adversarial Methods [33,34]	Sharp frame generation via GANs	Training instability Poor temporal consistency Mode collapse	Deterministic predictive coding framework Anti-aliasing modules for temporal coherence No adversarial training required
Social Models [35,36]	Socially aware trajectory prediction	Limited to trajectory-level prediction Cannot generate pixel-level frames Domain-specific constraints	Unified pixel+feature level prediction Generalizable to multiple robotic domains
Predictive Coding [5,6]	Neuroscience-inspired error correction	Theoretical frameworks lack efficient implementations No multi-scale hierarchy	Practical pyramid implementation Multi-level error propagation
Pyramid Architectures [41,42]	Multi-scale feature extraction	Static spatial processing No temporal dynamics	Temporal pyramid structure Frequency-aware down/upsampling
PPNV1 [43]	Temporal pyramid predictive coding	Signal aliasing artifacts Suboptimal training strategy Information lag in propagation	Modulation modules for signal refinement Multi-level simultaneous training Redesigned anti-aliasing filters

Table 2. Quantitative evaluation on the KTH dataset. The metrics are averaged over the predicted frames. Red and blue indicate the best and second best results, respectively. Due to computational resource constraints, we run only one trial for each method. But we use fixed seeds for data splits and model initialization.

Method	10 → 20			10 → 40
Method	SSIM↑	PSNR↑	LPIPS↓	SSIM↑	PSNR↑	LPIPS↓
MCNet [64]	0.804	25.95	-	0.73	23.89	-
fRNN [65]	0.771	26.12	-	0.678	23.77	-
PredRNN [66]	0.839	27.55	-	0.703	24.16	-
PredRNN++ [67]	0.865	28.47	-	0.741	25.21	-
VarNet [68]	0.843	28.48	-	0.739	25.37	-
SAVP-VAE [62]	0.852	27.77	8.36	0.811	26.18	11.33
E3D-LSTM [69]	0.879	29.31	-	0.810	27.24	-
STMF [11]	0.893	29.85	11.81	0.851	27.56	14.13
Conv-TT-LSTM [63]	0.907	28.36	13.34	0.882	26.11	19.12
LMC-Memory [10]	0.894	28.61	13.33	0.879	27.50	15.98
PPNet [23]	0.886	31.02	13.12	0.821	28.37	23.19
MSPN [43]	0.881	31.87	7.98	0.831	28.86	14.04
PPNV2 (Ours)	0.893	32.05	4.76	0.833	28.97	8.93

Table 3. Quantitative evaluation on the Human3.6M dataset. The best results are marked in bold.

Method	Metric	T = 2	T = 4	T = 6	T = 8	T = 10
fRNN [65]	PSNR	27.58	26.10	25.06	24.26	23.66
	SSIM	0.9000	0.8885	0.8799	0.8729	0.8675
	LPIPS	0.0515	0.0530	0.0540	0.0539	0.0542
MAFENet [61]	PSNR	31.36	28.38	26.61	25.47	24.61
	SSIM	0.9663	0.9528	0.9414	0.9326	0.9235
	LPIPS	0.0151	0.0219	0.0287	0.0339	0.0419
MSPN [43]	PSNR	31.95	29.19	27.46	26.44	25.52
	SSIM	0.9687	0.9577	0.9478	0.9382	0.9293
	LPIPS	0.0146	0.0271	0.0384	0.0480	0.0571
PPNV2 (Ours)	PSNR	32.07	30.08	28.81	28.12	27.55
	SSIM	0.9645	0.9566	0.9510	0.9461	0.9421
	LPIPS	0.0169	0.0239	0.0288	0.0337	0.0381

Table 4. Quantitative evaluation on the Caltech and KITTI datasets. The best results are marked in bold.

Method	Caltech ( $10 \to 15$ )			KITTI ( $10 \to 15$ )
Method	SSIM	PSNR	LPIPS	SSIM	PSNR	LPIPS
MCNet [64]	0.705	-	37.34	0.555	-	37.39
PredNet [28]	0.753	-	36.03	0.475	-	62.95
Voxel Flow [70]	0.711	-	28.79	0.426	-	41.59
Vid2vid [71]	0.751	-	20.14	-	-	-
FVSOMP [72]	0.756	-	16.50	0.608	-	30.49
PPNet [23]	0.812	21.3	14.83	0.617	18.24	31.07
MSPN [43]	0.818	23.88	10.98	0.629	19.44	32.10
PPNV2 (Ours)	0.865	25.44	5.287	0.621	19.32	15.45

Table 5. Ablations and comparisons on the KTH dataset. The quantitative evaluation results. The best results are marked in bold.

Ablation	$10 \to 20$			$10 \to 40$
Ablation	SSIM↑	PSNR↑	LPIPS↓	SSIM↑	PSNR↑	LPIPS↓
Default	0.893	32.05	4.76	0.833	28.97	8.93
Add	0.886	31.87	5.08	0.817	28.46	9.75
Concat	0.888	31.92	5.16	0.820	28.71	9.56
NoFilter	0.882	31.70	5.21	0.808	28.39	9.71
Bilinear	0.887	31.85	5.10	0.816	28.38	9.37
NoLPIPS	0.889	31.99	11.45	0.824	28.79	20.41

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ling, C.; Zhong, J.; Li, W.; Dong, R.; Dai, M. Pyramidal Predictive Network V2: An Improved Predictive Architecture and Training Strategies for Future Perception Prediction. Big Data Cogn. Comput. 2025, 9, 79. https://doi.org/10.3390/bdcc9040079

AMA Style

Ling C, Zhong J, Li W, Dong R, Dai M. Pyramidal Predictive Network V2: An Improved Predictive Architecture and Training Strategies for Future Perception Prediction. Big Data and Cognitive Computing. 2025; 9(4):79. https://doi.org/10.3390/bdcc9040079

Chicago/Turabian Style

Ling, Chaofan, Junpei Zhong, Weihua Li, Ran Dong, and Mingjun Dai. 2025. "Pyramidal Predictive Network V2: An Improved Predictive Architecture and Training Strategies for Future Perception Prediction" Big Data and Cognitive Computing 9, no. 4: 79. https://doi.org/10.3390/bdcc9040079

APA Style

Ling, C., Zhong, J., Li, W., Dong, R., & Dai, M. (2025). Pyramidal Predictive Network V2: An Improved Predictive Architecture and Training Strategies for Future Perception Prediction. Big Data and Cognitive Computing, 9(4), 79. https://doi.org/10.3390/bdcc9040079

Article Menu

Pyramidal Predictive Network V2: An Improved Predictive Architecture and Training Strategies for Future Perception Prediction

Abstract

1. Introduction

2. Related Works

2.1. Video Prediction and Future Perception Prediction

2.2. Predictive Coding and Hierarchical Models

2.3. Pyramid Architectures and Multi-Scale Processing

2.4. Pyramidal Predictive Network (PPNV1)

3. PPNV1

4. Signal Flows for Prediction

4.1. Improving the Propagation of Sensory Information

4.2. Separating the Processing of Sensory Input and Prediction Error

5. Anti-Aliasing Design

5.1. Modulation Module

5.2. Downsampling Artifact

5.3. Upsampling Artifact

6. Improved Training Strategies

6.1. Long-Term Prediction Training Strategy

6.2. Loss Function

7. Results

7.1. Evaluation with SOTA

7.2. Ablations and Comparisons

8. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI