Recovering Causal States

Does NPRD lead to interpretable codes *Z*? To answer this, we further investigated the NPRD approximation to the Random Insertion Process (RIP), obtained in the previous paragraph. The -machine was computed by Marzen and Crutchfield [2] and is given in Figure 2 (left). The process has three causal states: State *A* represents those pasts where the future starts with 12*k*0 (*k* = 0, 1, 2, ... )—these are the pasts ending in either 001 or 10111*m* (*m* = 0, 1, 2, ... ). State *B* represents those pasts ending in 10—the future has to start with 01 or 11. State *C* represents those pasts ending in either 00 or 01—the future has to start with 12*k*+10 (*k* = 0, 1, 2, . . . ).

The analytical solution to the Predictive Rate–Distortion problem was computed by Marzen and Crutchfield [2]. At *λ* > 0.5, the optimal solution collapses *A* and *B* into a single codeword, while all three states are mapped to separate codewords for *λ* ≤ 0.5.

Does NPRD recover this picture? We applied PCA to samples from *Z* computed at two different values of *λ*, *λ* = 0.25 and *λ* = 0.6. The first two principal components of *Z* are shown in Figure 3. Samples are colored by the causal states corresponding to the pasts of the trajectories that were encoded into the respective points by NPRD. On the left, obtained at *λ* = 0.6, the states A and B are collapsed, as expected. On the right, obtained at *λ* = 0.25, the three causal states are reflected as distinct modes of *Z*. Note that, at finite *M*, a fraction of pasts is ambiguous between the green and blue causal states; these are colored in black and NPRD maps them into a region between the modes corresponding to these states.

**Figure 2.** Recovering the -machine from NPRD. **Left**: The -machine of the Random Insertion Process, as described by [2]. **Right**: After computing a code *Z* from a past *x*−15···−1, we recorded which of the three clusters the code moves to when appending the symbol 0 or 1 to the past sequence. The resulting transitions mirror those in the -machine.

**Figure 3.** Applying Principal Component Analysis to 5000 sampled codes *Z* for the Random Insertion Process, at *λ* = 0.6 (**left**) and *λ* = 0.25 (**right**). We show the first two principal components. Samples are colored according to the states in the -machine. There is a small number of samples from sequences that, at *M* = 15, cannot be uniquely attributed to any of the states (ambiguous between A and C); these are indicated in black.

In Figure 2 (right), we record, for each of the three modes, to which cluster the distribution of the code *Z* shifts when a symbol is appended. We restrict to those strings that have nonzero probability for RIP (no code will ever be needed for other strings). For comparison, we show the -machine computed by Marzen and Crutchfield [2]. Comparing the two diagrams shows that NPRD effectively recovers the -machine: the three causal states are represented by the three different modes of *Z*, and the effect of appending a symbol also mirrors the state transitions of the -machine.

#### A Process with Many Causal States

We have seen that NPRD recovers the correct trade-off, and the structure of the causal states, in processes with a small number of causal states. How does it behave when the number of causal states is very large? In particular, is it capable of extrapolating to causal states that were never seen during training?

We consider the following process, which we will call COPY3: *X*−15, ..., *X*−1, are independent uniform draws from {1, 2, <sup>3</sup>}, and *X*1 = *X*−1, ..., *X*15 = *X*−15. This process deviates a bit from our usual setup since we defined it only for *t* ∈ {−15, ..., <sup>15</sup>}, but it is well-suited to investigating this question: the number of causal states is 3<sup>15</sup> ≈ 14 million. With exactly the same setup as for the EVEN and RIP processes, NPRD achieved essentially zero distortion on unseen data, even though the number of training samples (3 Million) was far lower than the number of distinct causal states. However, we found that, in this setup, NPRD overestimated the rate. Increasing the number of training samples from 3M to 6M, NPRD recovered codebooks that achieved both almost zero distortion and almost optimal rate, on fresh samples (Figure 4). Even then, the number of distinct causal states is more than twice the number of training samples. These results demonstrate that, by using function approximation, NPRD is capable of extrapolating to unseen causal states, encoding and decoding appropriate codes on the fly.

**Figure 4.** Rate–Distortion for the COPY3 process. We show NPRD samples, and the resulting upper bound in red. The gray line represents the anaytical curve.

Note that one could easily design an optimal decoder and encoder for COPY3 by hand—the point of this experiment is to demonstrate that NPRD is capable of inducing such a codebook purely from data, in a general-purpose, off-the-shelf manner. This contrasts with OCF: without optimizations specific to the task at hand, a direct application of OCF would require brute-force storing of all 14 million distinct pasts and futures.

#### **6. Estimating Predictive Rate–Distortion for Natural Language**

We consider the problem of estimating rate–distortion for natural language. Natural language has been a testing ground for information-theoretic ideas since Shannon's work. Much interest has been devoted to estimating the entropy rate of natural language [10,47–49]. Indeed, the information density of language has been linked to human processing effort and to language structure. The word-by-word

information content has been shown to impact human processing effort as measured both by per-word reading times [50–52] and by brain signals measured through EEG [53,54]. Consequently, prediction is a fundamental component across theories of human language processing [54]. Relatedly, the Uniform Information Density and Constant Entropy Rate hypotheses [55–57] state that languages order information in sentences and discourse so that the entropy rate stays approximately constant.

The relevance of prediction to human language processing makes the *difficulty* of prediction another interesting aspect of language complexity: Predictive Rate–Distortion describes how much memory of the past humans need to maintain to predict future words accurately. Beyond the entropy rate, it thus forms another important aspect of linguistic complexity.

Understanding the complexity of prediction in language holds promise for enabling a deeper understanding of the nature of language as a stochastic process, and to human language processing. Long-range correlations in text have been a subject of study for a while [58–63]. Recently, D ˛ebowski [64] has studied the excess entropy of language across long-range discourses, aiming to better understand the nature of the stochastic processes underlying language. Koplenig et al. [65] shows a link between traditional linguistic notions of grammatical structure and the information contained in word forms and word order. The idea that predicting future words creates a need to represent the past well also forms a cornerstone of theories of how humans process sentences [66,67].

We study prediction in the range of the words in individual sentences. As in the previous experiments, we limit our computations to sequences of length 30, already improving over OCF by an order of magnitude. One motivation is that, when directly estimating PRD, computational cost has to increase with the length of sequences considered, making the consideration of sequences of hundreds or thousands of words computationally infeasible. Another motivation for this is that we are ultimately interested in Predictive Rate–Distortion as a model of memory in human processing of grammatical structure, formalizing psycholinguistic models of how humans process individual sentences [66,67], and linking to studies of the relation between information theory and grammar [65].

#### *6.1. Part-of-Speech-Level Language Modeling*

We first consider the problem of predicting English on the level of Part-of-Speech (POS) tags, using the Universal POS tagset [68]. This is a simplified setting where the vocabulary is small (20 word types), and one can hope that OCF will produce reasonable results. We use the English portions of the Universal Dependencies Project [69] tagged with Universal POS Tags [68], consisting of approximately 586 K words. We used the training portions to estimate NPRD and OCF, and the validation portions to estimate the rate–distortion curve. We used NPRD to generate 350 codebooks for values of *λ* sampled from [0, 0.4]. We were only able to run OCF for *M* ≤ 3, as the number of sequences exceeds 10<sup>4</sup> already at *M* = 4.

The PRD curve is shown in Figure 5 (left). In the setting of low rate and high distortion, NPRD and OCF (blue, *M* = 1, 2, 3) show essentially identical results. This holds true until <sup>I</sup>[*<sup>Z</sup>*, −→*X* ] ≈ 0.7, at which point the bounds provided by OCF deteriorate, showing the effects of overfitting. NPRD continues to provide estimates at greater rates.

Figure 5 (center) shows rate as a function of log 1 *λ* . Recall that *λ* is the trade-off-parameter from the objective function (7). In Figure 5 (right), we show rate and the mutual information with the future, as a function of log 1 *λ* . As *λ* → 0, NPRD (red, *M* = 15) continues to discover structure, while OCF (blue, plotted for *M* = 1, 2, 3) exhausts its capacity.

Note that NPRD reports rates of 15 nats and more when modeling with very low distortion. A discrete codebook would need over 3 million distinct codewords for a code of such a rate, exceeding the size of the training corpus (about 500 K words), replicating what we found for the COPY3 process: Neural encoders and decoders can use the geometric structure of the code space to encode generalizations across different dimensions, supporting a very large effective number of distinct possible codes. Unlike discrete codebooks, the geometric structure makes it possible for neural encoders to construct appropriate codes 'on the fly' on new input.

**Figure 5. Left**: Rate-Predictiveness for English POS modeling. Center and right: Rate (**Center**) and Predictiveness (**Right**) on English POS Modeling, as a function of − log *λ*. As *λ* → 0, NPRD (red, *M* = 15) continues to discover structure, while OCF (blue, plotted for *M* = 1, 2, 3) exhausts its capacity.
