**1. Introduction**

Predicting the future from past observations is a key problem across science. Constructing models that predict future observations is a fundamental method of making and testing scientific discoveries. Understanding and predicting dynamics has been a fundamental goal of physics for centuries. In engineering, devices often have to predict the future of their environment to perform efficiently. Living organisms need to make inferences about their environment and its future for survival.

Real-world systems often generate complex dynamics that can be hard to compute. A biological organism typically will not have the computational capacity to perfectly represent the environment. In science, measurements have limited precision, limiting the precision with which one can hope to make predictions. In these settings, the main goal will typically be to ge<sup>t</sup> good prediction at low computational cost.

This motivates the study of models that try to extract those key features of past observations that are most relevant to predicting the future. A general information-theoretic framework for this problem is provided by Predictive Rate–Distortion [1,2], also known as the past-future information bottleneck [3]. The Predictive Rate–Distortion trade-off seeks to find an encoding of past observations that is maximally informative about future observations, while satisfying a bound on the amount of information that has to be stored. More formally, this framework trades off prediction loss in the future, formalized as cross-entropy, with the cost of representing the past, formalized as the mutual information between the past observations and the compressed representations of the past. Due to its

information-theoretic nature, this framework is extremely general and applies to processes across vastly different domains. It has been applied to linear dynamical systems [3,4], but is equally meaningful for discrete dynamical systems [2]. For biological systems that make predictions about their environment, this corresponds to placing an information-theoretic constraint on the computational cost used for conditioning actions on past observations [5].

The problem of determining encodings that optimize the Predictive Rate–Distortion trade-off has been solved for certain specific kinds of dynamics, namely for linear dynamic systems [3] and for processes whose predictive dynamic is represented exactly by a known, finite Hidden Markov Model [2]. However, real-world processes are often more complicated. When the dynamics are known, a representing Hidden Markov Model may still be extremely large or even infinite, making general-purpose automated computation difficult. Even more importantly, the underlying dynamics are often not known exactly. An organism typically does not have access to the exact dynamics of its surroundings. Similarly, the exact distribution of sentences in a natural language is not known exactly, precluding the application of methods that require an exact model. Such processes are typically only available implicitly, through a finite sample of trajectories.

Optimal Causal Filtering (OCF, Still et al. [6]) addresses the problem of estimating Predictive Rate–Distortion from a finite sample of observation trajectories. It does so by constructing a matrix of observed frequencies of different pairs of past and future observations. However, such a method faces a series of challenges [2]. One is the curse of dimensionality: Modeling long dependencies requires storing an exponential number of observations, which quickly becomes intractable for current computation methods. This exponential growth is particularly problematic when dealing with processes with a large state space. For instance, the number of distinct words in a human language as found in large-scale text data easily exceeds <sup>1</sup>×<sup>10</sup>5, making storing and manipulating counts of longer word sequences very challenging. A second challenge is that of overfitting: When deploying a predictive model constructed via OCF on new data to predict upcoming observations, such a model can only succeed when the past sequences occurred in the sample to which OCF was applied. This is because OCF relies on counts of full past and future observation sequences; it does not generalize to unseen past sequences.

Extrapolating to unseen past sequences is possible in traditional time series models representing processes that take continuous values; however, such methods are less easily applied to discrete sequences such as natural language. Recent research has seen a flurry of interest in using flexible nonlinear function approximators, and in particular recurrent neural networks, which can handle sequences with discrete observations. Such machine learning methods provide generic models of sequence data. They are the basis of the state of the art by a clear and significant margin for prediction in natural language [7–10]. They also have been successfully applied to modeling many other kinds of time series found across disciplines [11–18].

We propose Neural Predictive Rate–Distortion (NPRD) to estimate Predictive Rate–Distortion when only a finite set of sample trajectories is given. We use neural networks both to encode past observations into a summary code, and to predict future observations from it. The universal function approximation capabilities of neural networks enable such networks to capture complex dynamics, with computational cost scaling only linearly with the length of observed trajectories, compared to the exponential cost of OCF. When deploying on new data, such a neural model can generalize seamlessly to unseen sequences, and generate appropriate novel summary encodings on the fly. Recent advances in neural variational inference [19,20] allow us to construct predictive models that provide almost optimal predictive performance at a given rate, and to estimate the Predictive Rate–Distortion trade-off from such networks. Our method can be applied to sample trajectories in an off-the-shelf manner, without prior knowledge of the underlying process.

In Section 2, we formally introduce Predictive Rate–Distortion, and discuss related notions of predictive complexity. In Section 3, we describe the prior method of Optimal Causal Filtering (OCF). In Section 4, we describe our method NPRD. In Section 5, we validate NPRD on processes whose

Predictive Rate–Distortion is analytically known, showing that it finds essentially optimal predictive models. In Section 6, we apply NPRD to data from five natural languages, providing the first estimates of Predictive Rate–Distortion of natural language.

## **2. Predictive Rate–Distortion**

We consider stationary discrete-time stochastic processes (*Xt*)*<sup>t</sup>*∈Z, taking values in a state space *S*. Given a reference point in time, say *T* = 0, we are interested in the problem of predicting the future of *Xt*≥0 = ( *X*0, *X*1, *X*2, ...) from the past *Xt*<<sup>0</sup> = (..., *X*−2, *<sup>X</sup>*−<sup>1</sup>). In general—unless the observations *Xt* are independent—predicting the future of the process accurately will require taking into account the past observations. There is a trade-off between the accuracy of prediction, and how much information about the past is being taken into account. On one extreme, not taking the past into account at all, one will not be able to take advantage of the dependencies between past and future observations. On the other extreme, considering the entirety of the past observations *Xt*≤0 can require storing large and potentially unbounded amounts of information. This trade-off between information storage and prediction accuracy is referred to as Predictive Rate–Distortion (PRD) [2]. The term rate refers to the amount of past information being taken into account, while distortion refers to the degradation in prediction compared to optimal prediction from the full past.

The problem of Predictive Rate–Distortion has been formalized by a range of studies. A principled and general formalization is provided by applying the Information Bottleneck idea [2,6,21]: We will write ←− *X* for the past *X*<0, and −→*X* for the future *X*≥0, following [2]. We consider random variables *Z*, called codes, that summarize the past and are used by the observer to predict the future. Formally, *Z* needs to be independent from the future −→*X* conditional on the past ←− *X* : in other words, *Z* does not provide any information about the future except what is contained in the past. Symbolically:

$$Z \sqcup \overline{X} \backslash \overline{X} . \tag{1}$$

This is equivalent to the requirement that *Z* ↔ ←− *X* ↔ −→*X* be a Markov chain. This formalizes the idea that the code is computed by an observer from the past, without having access to the future. Predictions are then made based only on *Z*, without additional access to the past ←− *X* .

The rate of the code *Z* is the mutual information between *Z* and the past: <sup>I</sup>[*<sup>Z</sup>*, ←− *X* ]. By the Channel Coding Theorem, this describes the channel capacity that the observer requires in order to transform past observations into the code *Z*.

The distortion is the loss in predictive accuracy when predicting from *Z*, relative to optimal prediction from the full past ←− *X* . In the Information Bottleneck formalization, this is equivalent to the amount of mutual information between past and future that is *not* captured by *Z* [22]:

$$\mathbf{I}[\overleftarrow{X}, \overrightarrow{X} | \mathbf{Z}].\tag{2}$$

Due to the Markov condition, the distortion measure satisfies the relation

$$\mathbf{I}[\overleftarrow{X}, \overrightarrow{X} | Z] = \mathbf{I}[\overleftarrow{X}, \overrightarrow{X}] - \mathbf{I}[Z, \overrightarrow{X}]\_\prime \tag{3}$$

i.e., it captures how much less information *Z* carries about the future −→*X* compared to the full past ←− *X* . For a fixed process (*Xt*)*<sup>t</sup>*, choosing *Z* to minimize the distortion is equivalent to maximizing the mutual information between the code and the future:

$$\mathbf{1}[Z, \overrightarrow{X}].\tag{4}$$

We will refer to (4) as the predictiveness of the code *Z*. The rate–distortion trade-off then chooses *Z* to minimize distortion at bounded rate:

$$\min\_{\substack{\mathbf{Z}\colon\mathbf{I}[\overleftarrow{X},Z]\le d}}\mathrm{I}[\overleftarrow{X},\overrightarrow{X}\,|Z] \tag{5}$$

or—equivalently—maximize predictiveness at bounded rate:

$$\max\_{\mathbf{Z}\colon\mathbf{I}[\overleftarrow{X},Z]\le d} \mathbf{I}[Z,\overrightarrow{X}].\tag{6}$$

Equivalently, for each *λ* ≥ 0, we study the problem

$$\max\_{Z} \left( \mathbf{I}[Z, \overleftarrow{X}] - \lambda \cdot \mathbf{I}[\overleftarrow{X}, Z] \right),\tag{7}$$

where the scope of the maximization is the class of all random variables *Z* such that *Z* → ←− *X* → −→*X* is a Markov chain.

The objective (7) is equivalent to the Information Bottleneck [21], applied to the past and future of a stochastic process. The coefficient *λ* indicates how strongly a high rate I[ ←− *X* , *Z*] is penalized; higher values of *λ* result in lower rates and thus lower values of predictiveness.

The largest achievable predictiveness <sup>I</sup>[*<sup>Z</sup>*, −→*X* ] is equal to I[ ←− *X* , −→*X* ], which is known as the excess entropy of the process [23]. Due to the Markov condition (1) and the Data Processing Inequality, predictiveness of a code *Z* is always upper-bounded by the rate:

$$\mathbf{I}[Z,\overline{X}] \le \mathbf{I}[\overleftarrow{X},Z].\tag{8}$$

As a consequence, when *λ* ≥ 1, then (7) is always optimized by a trivial *Z* with zero rate and zero predictiveness. When *λ* = 0, any lossless code optimizes the problem. Therefore, we will be concerned with the situation where *λ* ∈ (0, <sup>1</sup>).

#### *2.1. Relation to Statistical Complexity*

Predictive Rate–Distortion is closely related to Statistical Complexity and the -machine [24,25]. Given a stationary process *Xt*, its causal states are the equivalence classes of semi-infinite pasts ←− *X* that induce the same conditional probability over semi-infinite futures −→*X* : Two pasts ←− *X* , ←− *X* belong to the same causal state if and only if *<sup>P</sup>*(*<sup>x</sup>*1...*k*| ←− *X* ) = *<sup>P</sup>*(*<sup>x</sup>*1...*k*| ←− *X* ) holds for all finite sequences *<sup>x</sup>*0...*k* (*k* ∈ N). Note that this definition is not measure-theoretically rigorous; such a treatment is provided by Löhr [26].

The causal states constitute the state set of a a Hidden Markov Model (HMM) for the process, referred to as the -machine [24]. The statistical complexity of a process is the state entropy of the -machine. Statistical complexity can be computed easily if the -machine is analytically known, but estimating statistical complexity empirically from time series data are very challenging and seems to at least require additional assumptions about the process [27].

Marzen and Crutchfield [2] show that Predictive Rate–Distortion can be computed when the -machine is analytically known, by proving that it is equivalent to the problem of compressing causal states, i.e., equivalence classes of pasts, to predict causal states of the backwards process, i.e., equivalence classes of futures. Furthermore, [6] show that, in the limit of *λ* → 0, the code *Z* that optimizes Predictive Rate–Distortion (7) turns into the causal states.

## *2.2. Related Work*

There are related models that represent past observations by extracting those features that are relevant for prediction. Predictive State Representations [28,29] and Observable Operator Models [30] encode past observations as sets of predictions about future observations. Rubin et al. [31] study agents that trade the cost of representing information about the environment against the reward they can obtain by choosing actions based on the representation. Relatedly, Still [1] introduces a Recursive Past Future Information Bottleneck where past information is compressed repeatedly, not just at one reference point in time.

As discussed in Section 2.1, estimating Predictive Rate–Distortion is related to the problem of estimating statistical complexity. Clarke et al. [27] and Still et al. [6] consider the problem of estimating statistical complexity from finite data. While statistical complexity is hard to identify from finite data in general, Clarke et al. [27] introduces certain restrictions on the underlying process that make this more feasible.

#### **3. Prior Work: Optimal Causal Filtering**

The main prior method for estimating Predictive Rate–Distortion from data are Optimal Causal Filtering (OCF, Still et al. [6]). This method approximates Predictive Rate–Distortion using two approximations: first, it replaces semi-infinite pasts and futures with bounded-length contexts, i.e., pairs of finite past contexts ( *M*← *X* := *X*−*<sup>M</sup>* ... *<sup>X</sup>*−<sup>1</sup>) and future contexts ( → *M X* := *X*0 ... *X <sup>M</sup>*−<sup>1</sup>) of some finite length *M*.(It is not crucial that past and future contexts have the same lengths, and indeed Still et al. [6] do not assume this). (We do assume equal length throughout this paper for simplicity of our experiments, though nothing depends on this). The PRD objective (7) then becomes (9), aiming to predict length-*M* finite futures from summary codes *Z* of length-*M* finite pasts:

$$\max\_{\substack{\mathbf{I} \in \mathcal{Z} \sqcup \stackrel{\rightarrow M\_{\bot}}{\mathbf{X}} \stackrel{M \leftarrow}{\mathbf{X}}}} \left( \mathbf{I}[Z\_{\prime}^{\rightarrow \mathcal{M}}] - \lambda \cdot \mathbf{I}[\![\![X\!, Z\!] \!] \right) . \tag{9}$$

Second, OCF estimates information measures directly from the observed counts of ( *M*← *X* ),( → *M X* ) using the plug-in estimator of mutual information. With such an estimator, the problem in (9) can be solved using a variant of the Blahut–Arimoto algorithm [21], obtaining an encoder *P*(*Z*| *M*← *X* ) that maps each observed past sequence *M*← *X* to a distribution over a (finite) set of code words *Z*.

Two main challenges have been noted in prior work: first, solving the problem for a finite empirical sample leads to overfitting, overestimating the amount of structure in the process. Still et al. [6] address this by subtracting an asymptotic correction term that becomes valid in the limit of large *M* and *λ* → 0, when the codebook *P*(*Z*| ←− *X* ) becomes deterministic, and which allows them to select a deterministic codebook of an appropriate complexity. This leaves open how to obtain estimates outside of this regime, when the codebook can be far from deterministic.

The second challenge is that OCF requires the construction of a matrix whose rows and columns are indexed by the observed past and future sequences [2]. Depending on the topological entropy of the process, the number of such sequences can grow as |*A*| *M*, where *A* is the set of observed symbols, and processes of interest often do show this exponential growth [2]. Drastically, in the case of natural language, *A* contains thousands of words.

A further challenge is that OCF is infeasible if the number of required codewords is too large, again because it requires constructing a matrix whose rows and columns are indexed by the codewords and observed sequences. Given that storing and manipulating matrices greater than 10<sup>5</sup> × 10<sup>5</sup> is currently not feasible, a setting where I[ ←− *X* , *Z*] > log 10<sup>5</sup> ≈ 11.5 cannot be captured with OCF.

#### **4. Neural Estimation via Variational Upper Bound**

We now introduce our method, Neural Predictive Rate–Distortion (NPRD), to address the limitations of OCF, by using parametric function approximation: whereas OCF constructs a codebook mapping between observed sequences and codes, we use general-purpose function approximation estimation methods to compute the representation *Z* from the past and to estimate a distribution over future sequences from *Z*. In particular, we will use recurrent neural networks, which are known to

provide good models of sequences from a wide range of domains; our method will also be applicable to other types of function approximators.

This will have two main advantages, addressing the limitations of OCF: first, unlike OCF, function approximators can discover generalizations across similar sequences, enabling the method to calculate good codes *Z* even for past sequences that were not seen previously. This is of paramount importance in settings where the state space is large, such as the set of words of a natural language. Second, the cost of storing and evaluating the function approximators will scale *linearly* with the length of observed sequences both in space and in time, as opposed to the exponential memory demand of OCF. This is crucial for modeling long dependencies.

#### *4.1. Main Result: Variational Bound on Predictive Rate–Distortion*

We will first describe the general method, without committing to a specific framework for function approximation yet. We will construct a bound on Predictive Rate–Distortion and optimize this bound in a parametric family of function approximators to obtain an encoding *Z* that is close to optimal for the nonparametric objective (7).

As in OCF (Section 3), we assume that a set of finite sample trajectories *x*−*M* ... *xM*−1 is available, and we aim to compress pasts of length *M* to predict futures of length *M*. To carry this out, we restrict the PRD objective (7) to such finite-length pasts and futures:

$$\max\_{\substack{\boldsymbol{Z}\in\boldsymbol{Z}\colon\boldsymbol{Z}\in\boldsymbol{M}\times\boldsymbol{M}\leftarrow\\\boldsymbol{X}\in\boldsymbol{X}}} \left(\mathrm{I}[\boldsymbol{Z},\overset{\rightarrow\boldsymbol{M}}{\boldsymbol{X}}]-\lambda\cdot\mathrm{I}[\overset{\scriptstyle\boldsymbol{M}\leftarrow}{\boldsymbol{X}},\boldsymbol{Z}]\right).\tag{10}$$

It will be convenient to equivalently rewrite (10) as

$$\min\_{\substack{\mathbf{Z}\colon\mathbf{Z}\perp\stackrel{\rightarrow M}{X}\mid\stackrel{M\leftarrow}{X}}} \left[\mathcal{H}[\stackrel{\rightarrow M}{X}|\mathcal{Z}] + \lambda \cdot \mathcal{I}[\stackrel{M\leftarrow}{X},\mathcal{Z}]\right],\tag{11}$$

where H[ → *M X* |*Z*] is the prediction loss. Note that minimizing prediction loss is equivalent to maximizing predictiveness I[ → *M X* , *<sup>Z</sup>*].

When deploying such a predictive code *Z*, two components have to be computed: a distribution *P*(*Z*| *M*← *X* ) that encodes past observations into a code *Z*, and a distribution *P*( → *M X* |*Z*) that decodes the code *Z* into predictions about the future.

Let us assume that we already have some encoding distribution

$$Z \sim P\_{\Phi}(Z \mid \stackrel{M \leftarrow}{X}),\tag{12}$$

where *φ* is the encoder, expressed in some family of function approximators. The encoder transduces an observation sequence *M*← *X* into the parameters of the distribution *<sup>P</sup>φ*(·| *M*← *X* ). From this encoding distribution, one can obtain the optimal decoding distribution over future observations via Bayes' rule:

$$P(\stackrel{\rightarrow\_{M}}{X}|Z) = \frac{P(\stackrel{\rightarrow\_{M}}{X}, Z)}{P(Z)} = \frac{\mathbb{E}\_{M\leftarrow}P(\stackrel{\rightarrow\_{M}}{X}, Z \mid \stackrel{M\leftarrow}{X})}{\mathbb{E}\_{M\leftarrow}P\_{\Phi}(Z \mid \stackrel{M\leftarrow}{X})} = ^{(\*)}\frac{\mathbb{E}\_{M\leftarrow}P(\stackrel{\rightarrow\_{M}}{X} \mid \stackrel{M\leftarrow}{X})P\_{\Phi}(Z \mid \stackrel{M\leftarrow}{X})}{\mathbb{E}\_{M\leftarrow}P\_{\Phi}(Z \mid \stackrel{M\leftarrow}{X})},\tag{13}$$

where (∗) uses the Markov condition *Z* ⊥ → *M X* | *M*← *X* . However, neither of the two expectations in the last expression of (13) is tractable, as they require summation over exponentially many sequences, and algorithms (e.g., dynamic programming) to compute this sum efficiently are not available in general. For a similar reason, the rate I[ *M*← *X* , *Z*] of the code *Z* ∼ *<sup>P</sup>φ*(*Z*| *M*← *X* ) is also generally intractable.

Our method will be to introduce additional functions, also expressed using function approximators that approximate some of these intractable quantities: first, we will use a parameterized probability distribution *q* as an approximation to the intractable marginal *P*(*Z*) = E*<sup>M</sup>*← *X<sup>P</sup>φ*(*Z*| *M*← *X* ):

$$q(Z)\text{ approximates }P(Z) = \mathbb{E}\_{\begin{subarray}{c}M\leftarrow \\ X\end{subarray}}P\_{\Phi}(Z\mid \stackrel{M\leftarrow}{X}).\tag{14}$$

Second, to approximate the decoding distribution *P*( →*M X* |*Z*), we introduce a parameterized decoder *ψ* that maps points *Z* ∈ R*<sup>N</sup>* into probability distributions *<sup>P</sup>ψ*(<sup>→</sup>*MX* |*Z*) over future observations →*M X* :

$$P\!\_{\Psi}(\stackrel{\rightarrow}{X}^{M}|Z)\text{ approximates }P(\stackrel{\rightarrow}{X}^{M}|Z)\tag{15}$$

for each code *Z* ∈ R*N*. Crucially, *<sup>P</sup>ψ*(<sup>→</sup>*MX* |*Z*) will be easy to compute efficiently, unlike the exact decoding distribution *P*( →*M X* |*Z*).

If we fix a stochastic process (*Xt*)*<sup>t</sup>*∈<sup>Z</sup> and an encoder *φ*, then the following two bounds hold for *any* choice of the decoder *ψ* and the distribution *q*:

**Proposition 1.** *The loss incurred when predicting the future from Z via ψ upper-bounds the true conditional entropy of the future given Z, when predicting using the exact decoding distribution (13):*

$$-\mathbb{E}\_{X}\mathbb{E}\_{Z\sim\phi(X)}\left[\log P\_{\varnothing}(\stackrel{\rightarrow}{X}|Z)\right] \geq \mathbb{H}[\stackrel{\rightarrow}{X}|Z].\tag{16}$$

*Furthermore, equality is attained if and only if <sup>P</sup>ψ*( →*M X* |*Z*) = *P*( →*M X* |*Z*)*.*

**Proof.** By Gibbs' inequality:

$$\begin{aligned} -\mathbb{E}\_X \mathbb{E}\_{Z \sim \phi(X)} \left[ \log P\_{\boldsymbol{\Psi}} (\stackrel{\rightarrow}{X} \mid Z) \right] &\geq -\mathbb{E}\_X \mathbb{E}\_{Z \sim \phi(X)} \left[ \log P(\stackrel{\rightarrow}{X} \mid Z) \right] \\ &= \mathbb{H}[\stackrel{\rightarrow}{X} \mid Z]. \end{aligned}$$

**Proposition 2.** *The KL Divergence between <sup>P</sup>φ*(*Z*| *M*← *X* ) *and q*(*Z*)*, averaged over pasts M*← *X , upper-bounds the rate of Z:*

$$\begin{split} \mathbb{E}\_{\begin{subarray}{c} \mathcal{U} \leftarrow \Big[\mathcal{D}\_{\text{KL}}\big(P\_{\theta}(Z|\:\!\stackrel{\scriptstyle\mathcal{M}\leftarrow}{\mathcal{X}})\,\|\,\boldsymbol{q}(\mathcal{Z})\big)\Big] = \mathbb{E}\_{\begin{subarray}{c} \mathcal{U} \leftarrow \mathbb{E}\_{\begin{subarray}{c} \mathcal{X} \leftarrow \mathbb{E}\_{\begin{subarray}{c} \mathcal{X}|\mathcal{X}}\,\text{-}\\ \mathcal{X}\text{ }\mathcal{X}\end{subarray}} \log\frac{P\_{\theta}(Z|\:\!\stackrel{\scriptstyle\mathcal{M}\leftarrow}{\mathcal{X}})}{\boldsymbol{q}(Z)}\\ \geq \mathbb{E}\_{\begin{subarray}{c} \mathcal{X} \leftarrow \mathbb{E}\_{\begin{subarray}{c} \mathcal{X}|\mathcal{X}\end{subarray}}\,\text{-} \\ = \mathbb{I}\left[\!\!\stackrel{\scriptstyle \mathcal{M}\leftarrow}{\mathcal{X}}\!\!\!\!\right]. \end{split}} \Big] \frac{P\_{\theta}(Z|\:\!\stackrel{\scriptstyle \mathcal{M}\leftarrow}{\mathcal{X}})}{\begin{subarray}{c} \mathcal{X} \end{subarray}} \tag{17}$$

*Equality is attained if and only if q*(*Z*) *is equal to the true marginal P*(*Z*) = E*<sup>M</sup>*← *X <sup>P</sup>φ*(*Z*| *M*← *X* )*.*

**Proof.** The two equalities follow from the definition of KL Divergence and Mutual Information. To show the inequality, we again use Gibbs' inequality:

$$-\mathbb{E}\_{\begin{subarray}{c}M\leftarrow\operatorname{\mathbb{E}}\\\hat{X}\end{subarray}}\mathbb{E}\_{\begin{subarray}{c}\mathcal{Z}\big|\stackrel{M\leftarrow}{\mathcal{X}}\end{subarray}}\log q(Z) = -\mathbb{E}\_{\mathbf{Z}}\log q(Z) \ge -\mathbb{E}\_{\mathbf{Z}}\log P(Z) = -\mathbb{E}\_{\begin{subarray}{c}M\leftarrow\operatorname{\mathbb{E}}\\\hat{X}\end{subarray}}\mathbb{E}\_{\begin{subarray}{c}\mathcal{Z}\big|\stackrel{M\leftarrow}{\mathcal{X}}\end{subarray}}\log P(Z).$$

Here, equality holds if and only if *q*(*Z*) = *<sup>P</sup>*(*Z*), proving the second assertion.

We now use the two propositions to rewrite the Predictive Rate–Distortion objective (18) in a way amenable to using function approximators, which is our main theoretical result, and the foundation of our proposed estimation method:

**Corollary 1** (Main Result)**.** *The Predictive Rate–Distortion objective (18)*

$$\min\_{\substack{\mathbf{m}\\ \mathbf{Z} : \boldsymbol{Z} \perp \boldsymbol{X} \parallel \boldsymbol{X} \vdash \boldsymbol{X}}} \left[ \mathbf{H} [\stackrel{\rightarrow \mathbf{M}}{\mathbf{X}} \mid \mathbf{Z}] + \lambda \, \mathbf{I} [\stackrel{\mathbf{M} \leftarrow \mathbf{}}{\mathbf{X}} \, \mathbf{Z}] \right] \tag{18}$$

*is equivalent to*

$$\min\_{\boldsymbol{\Phi}\boldsymbol{\Phi},\boldsymbol{\Phi}} \left[ \mathbb{E}\_{\begin{subarray}{c}M\leftarrow\-M\\X\ \boldsymbol{\cdot},X\end{subarray}} \left[ -\mathbb{E}\_{\begin{subarray}{c}Z\sim\boldsymbol{\Phi}(\boldsymbol{\cdot}\boldsymbol{X})\end{subarray}} \left[ \log P\_{\boldsymbol{\theta}} (\stackrel{\rightarrow}{\boldsymbol{X}}\,\middle|\,\boldsymbol{Z}\big] \,\right] + \boldsymbol{\lambda} \cdot \text{D}\_{\text{KL}} \left[ \left( P\_{\boldsymbol{\theta}} (\boldsymbol{Z}\,\middle|\,\stackrel{\text{M-}}{\boldsymbol{X}}\,)\,\middle|\,\boldsymbol{q}(\boldsymbol{Z}) \right) \right] \right] \right],\tag{19}$$

*where φ*, *ψ*, *q range over all triples of the appropriate types described above.*

*From a solution to (19), one obtains a solution to (18) by setting Z* ∼ *<sup>P</sup>φ*(·| *M*← *X* )*. The rate of this code is given as follows:*

$$\mathbf{I}[Z\_{\prime}\stackrel{\mathcal{M}\leftarrow}{X}] = \mathbb{E}\_{\begin{subarray}{c}M\leftarrow\\X\end{subarray}}\left[\mathrm{D}\_{\mathrm{KL}}(P\_{\Phi}(Z|\stackrel{\mathcal{M}\leftarrow}{X})\parallel q(Z))\right] \tag{20}$$

*and the prediction loss is given by*

$$\mathbb{H}[\stackrel{\rightarrow}{X}^{M}|Z] = -\mathbb{E}\_{X\_{M\leftarrow}\stackrel{\rightarrow}{\rightarrow}\stackrel{M}{M}}\mathbb{E}\_{Z\sim P\_{\theta}(\stackrel{M\leftarrow}{X})}\left[\log P\_{\theta}(\stackrel{\rightarrow}{X}^{M}|Z)\right].\tag{21}$$

**Proof.** By the two propositions, the term inside the minimization in (19) is an upper bound on (18), and takes on that value if and only if *<sup>P</sup>φ*(·| *M*← *X* ) equals the distribution of the *Z* optimizing (18), and *ψ*, *q* are as described in those propositions.

Note that the right-hand sides of (20) and (21) can both be estimated efficiently using Monte Carlo samples from *Z* ∼ *<sup>P</sup>φ*( *M*← *X* ).

If *φ*, *ψ*, *q* are not exact solutions to (19), the two propositions guarantee that we still have bounds on rate and prediction loss for the code *Z* generated by *φ*:

$$\mathbb{E}\left[Z\_{\prime}\stackrel{\mathcal{M}\leftarrow}{X}\right] \le \mathbb{E}\_{\stackrel{\mathcal{M}\leftarrow}{X}}\left[\mathrm{D}\_{\mathrm{KL}}\left(P\_{\Phi}(Z\mid \stackrel{\mathcal{M}\leftarrow}{X})\parallel q(Z)\right)\right],\tag{22}$$

$$\mathbb{H}[\stackrel{\rightarrow^{M}}{X}|Z] \le -\mathbb{E}\_{X\_{\stackrel{M\leftarrow}{\times}} \stackrel{\rightarrow^{M}}{X}} \mathbb{E}\_{Z \sim P\_{\theta}(\stackrel{M\leftarrow}{X})} \left[ \log P\_{\forall}(\stackrel{\rightarrow^{M}}{X}|Z) \right].\tag{23}$$

To carry out the optimization (19), we will restrict *φ*, *ψ*, *q* to a powerful family of parametric families of function approximators, within which we will optimize the objective with gradient descent. While the solutions may not be exact solutions to the nonparametric objective (19), they will still satisfy the bounds (22) and (23), and—if the family of approximators is sufficiently rich—can come close to turning these into the equalities (20) and (21).

## *4.2. Choosing Approximating Families*

For our method of Neural Predictive Rate–Distortion (NPRD), we choose the approximating families for the encoder *φ*, the decoder *ψ*, and the distribution *q* to be certain types of neural networks that are known to provide strong and general models of sequences and distributions.

For *φ* and *ψ*, we use recurrent neural networks with Long Short Term Memory (LSTM) cells [32], widely used for modeling sequential data across different domains. We parameterize the distribution *<sup>P</sup>φ*(*Z*| *M*← *X* ) as a Gaussian whose mean and variance are computed from the past *M*← *X* : We use an LSTM network to compute a vector *h* ∈ R*<sup>k</sup>* from the past observations *M*←*X* , and then compute

$$Z \sim \mathcal{N}(\mathcal{W}\_{\mu}\mathcal{h}\_{\prime}(\mathcal{W}\_{\sigma}\mathcal{h})^2 I\_{k \times k}),\tag{24}$$

where *<sup>W</sup>μ*, *Wσ* ∈ R*k*×*<sup>k</sup>* are parameters. While we found Gaussians sufficiently flexible for *φ*, more powerful encoders could be constructed using more flexible parametric families, such as normalizing flows [19,33].

For the decoder *ψ*, we use a second LSTM network to compute a sequence of vector representations *gt* = *ψ*(*<sup>Z</sup>*, *<sup>X</sup>*0...*t*−<sup>1</sup>) (*gt* ∈ R*<sup>k</sup>*) for *t* = 0, . . . *M* − 1. We compute predictions using the softmax rule

$$P\_{\psi}(X\_t = s\_i | X\_{1\dots t-1}, Z) \propto \exp((\mathcal{W}\_a \underline{g}\_t)\_i) \tag{25}$$

for each element *si* of the state space *S* = {*<sup>s</sup>*1, ...,*<sup>s</sup>*|*S*|}, and *Wo* ∈ R|*S*|×*<sup>k</sup>* is a parameter matrix to be optimized together with the parameters of the LSTM networks.

For *q*, we choose the family of Neural Autoregressive Flows [20]. This is a parametric family of distributions that allows efficient estimation of the probability density and its gradients. This method widely generalizes a family of prior methods [19,33,34], offering efficient estimation while surpassing prior methods in expressivity.

#### *4.3. Parameter Estimation and Evaluation*

We optimize the parameters of the neural networks expressing *φ*, *ψ*, *q* for the objective (19) using Backpropagation and Adam [35], a standard and widely used gradient descent-based method for optimizing neural networks. During optimization, we approximate the gradient by taking a single sample from *Z* (24) per sample trajectory *X*−*M*, ... , *XM*−<sup>1</sup> and use the reparametrized gradient estimator introduced by Kingma and Welling [36]. This results in an unbiased estimator of the gradient of (19) w.r.t. the parameters of *φ*, *ψ*, *q*.

Following standard practice in machine learning, we split the data set of sample time series into three partitions (training set, validation set, and test set). We use the training set for optimizing the parameters as described above. After every pass through the training set, the objective (19) is evaluated on the validation set using a Monte Carlo estimate with one sample *Z* per trajectory; optimization terminates once the value on the validation set does not decrease any more.

After optimizing the parameters on a set of observed trajectories, we estimate rate and prediction loss on the test set. Given parameters for *φ*, *ψ*, *q*, we evaluate the PRD objective (19), rate (22), and the prediction loss (23) on the test set by taking, for each time series *X*−*M*...*XM*−<sup>1</sup> = *M*← *X* →*M X* from the test set, a single sample *z* ∼ N (*μ*, *σ*<sup>2</sup>) and computing Monte Carlo estimates for rate

$$\mathbb{E}\_X \left[ \text{D}\_{\text{KL}}(P\_\theta(Z \mid \stackrel{M \leftarrow}{X}) \parallel q(Z)) \right] \approx \frac{1}{N} \sum\_{X\_{-M-M} \in \text{TestData}} \log \frac{p\_{\mathcal{N}(\mu, \sigma^2)}(z)}{q(z)},\tag{26}$$

where *p*N (*μ*,*σ*<sup>2</sup>)(*z*) is the Gaussian density with *μ*, *σ*<sup>2</sup> computed from *M*←*X* as in (24), and prediction loss

$$-\mathbb{E}\_{Z\sim\phi(\stackrel{M\leftarrow}{X})}\left[\log P\_{\forall}\left(\stackrel{\rightarrow}{X}\mid Z\right)\right] \approx -\frac{1}{N} \sum\_{X\_{-M-M}\in\text{TestData}} \log P\_{\forall}\left(\stackrel{\rightarrow}{X}\mid Z\right). \tag{27}$$

Thanks to (22) and (23), these estimates are guaranteed to be *upper bounds* on the true rate and prediction loss achieved by the code *Z* ∼ N (*μ*, *<sup>σ</sup>*<sup>2</sup>), up to sampling error introduced into the Monte Carlo estimation by sampling *z* and the finite size of the test set.

It is important to note that this sampling error is different from the overfitting issue affecting OCF: Our Equations (26) and (27) provide *unbiased* estimators of upper bounds, whereas overfitting *biases*

the values obtained by OCF. Given that Neural Predictive Rate–Distortion provably provide upper bounds on rate and prediction loss (up to sampling error), one can objectively compare the quality of different estimation methods: among methods that provide upper bounds, the one that provides the lowest such bound for a given problem is the one giving results closest to the true curve.
