Estimating Predictiveness

Given the estimate for prediction loss, we estimate predictiveness <sup>I</sup>[*<sup>Z</sup>*, →*M X* ] with the following method. We use the encoder network that computes the code vector *h* (24) to also estimate the marginal probability of the past observation sequence, *<sup>P</sup>η*( *M*← *X* ). *Pη* has support over sequences of length *M*. Similar to the decoder *ψ*, we use the vector representations *ft* ∈ R*<sup>k</sup>* computed by the LSTM encoder after processing *X*−*M*...*<sup>t</sup>* for *t* = <sup>−</sup>*M*,..., −1, and then compute predictions using the softmax rule

$$P\_{\mathbb{H}}(X\_{l} = s\_{i} | X\_{1...t-1}, Z) \propto \exp((\mathcal{W}\_{o'} f\_{l})\_{i}),\tag{28}$$

where *Wo* ∈ R|*S*|×*<sup>k</sup>* is another parameter matrix.

Because we consider stationary processes, we have that the cross-entropy under *Pη* of →*M X* is equal to the cross-entropy of *M*← *X* under the same encoding distribution: <sup>E</sup>*XM*← *X* →*M X* log *<sup>P</sup>η*(<sup>→</sup>*MX* ) = <sup>−</sup>E*XM*← *X* →*M X* log *<sup>P</sup>η*(*<sup>M</sup>*<sup>←</sup>*<sup>X</sup>* ). Using this observation, we estimate the predictiveness <sup>I</sup>[*<sup>Z</sup>*, →*MX* ] = H[<sup>→</sup>*MX* ] − H[ →*M X* |*Z*] by the difference between the corresponding cross-entropies on the test set [37]:

$$-\mathbb{E}\_{X\_{M\_{\stackrel{M\_{\tau}}}{X}\stackrel{\rightarrow}{\rightarrow}^{M}}}\left[\log P\_{\eta}\left(\stackrel{\rightarrow}{X}{X}\right) - \log P\_{\emptyset}\left(\stackrel{\rightarrow}{X}^{M}|Z\right)\right],\tag{29}$$

which we approximate using Monte Carlo sampling on the test set as in (26) and (27).

In order to optimize parameters for estimation of *<sup>P</sup>η*, we add the cross-entropy term <sup>−</sup>E*XM*← *X* →*M X* log *<sup>P</sup>η*(*<sup>M</sup>*<sup>←</sup>*<sup>X</sup>* ) to the PRD objective (19) during optimization, so that the full training objective comes out to:

$$\min\_{\boldsymbol{\Phi}, \boldsymbol{\Phi}, \boldsymbol{\Phi}, \boldsymbol{\eta}} \left[ \mathbb{E}\_{\begin{subarray}{c} M \leftarrow \boldsymbol{\Delta} \\ \boldsymbol{X} \end{subarray}} \left[ -\mathbb{E}\_{\begin{subarray}{c} Z \sim \boldsymbol{\rho} \left( \boldsymbol{X} \right) \end{subarray}} \left[ \log P\_{\boldsymbol{\theta}} \left( \boldsymbol{X} \, \middle| \, \boldsymbol{Z} \right) \right] + \boldsymbol{\lambda} \cdot \text{D}\_{\mathbf{KL}} \left[ \left( P\_{\boldsymbol{\theta}} (\boldsymbol{Z} \, \middle| \, \boldsymbol{X} \, \middle| \, \boldsymbol{\eta} (\boldsymbol{Z}) \right) \right] - \log P\_{\boldsymbol{\eta}} (\boldsymbol{X} \, \middle| \, \boldsymbol{X} \right) \right]. \tag{30}$$

Again, by Gibbs' inequality and Propositions 1 and 2, this is minimized when *Pη* represents the true distribution over length-*M* blocks *P*( *M*← *X* ), *<sup>P</sup>φ*(*Z*| *M*← *X* ) describes an optimal code for the given *λ*, *q* is its marginal distribution, and *<sup>P</sup>ψ*( →*M X* |*Z*) is the Bayes-optimal decoder. For approximate solutions to this augmented objective, the inequalities (22) and (23) will also remain true due to Propositions 1 and 2.
