*6.2. Discussion*

Let us now consider the curves in Figure 5 in more detail. Fitting parametric curves to the empirical PRD curves in Figure 5, we find a surprising result that the statistical complexity of English sentences at the POS level appears to be unbounded.

The rate-predictiveness curve (left) shows that, at low rates, predictiveness is approximately proportional to the rate. At greater degrees of predictiveness, the rate grows faster and faster, whereas predictiveness seems to asymptote to ≈ 1.1 nats. The asymptote of predictiveness can be identified with the mutual information between past and future observations, *E*0 := I[ *M*← *X* , →*M X* ], which is a lower bound on the excess entropy. The rate should asymptote to the statistical complexity. Judging by the curve, natural language—at the time scale we are measuring in this experiment—has a statistical complexity much higher than its excess entropy: at the highest rate measured by NPRD in our experiment, rate is about 20 nats, whereas predictiveness is about 1.1 nats. If these values are correct, then—due to the convexity of the rate-predictivity curve—statistical complexity exceeds the excess entropy by a factor of at least 201.1 . Note that this picture agrees qualitatively with the OCF results, which sugges<sup>t</sup> a lower-bound on the ratio of at least 2.5 0.6 > 5.

Now, turning to the other plots in Figure 5, we observe that rate increases at least linearly with log 1*λ* , whereas predictiveness again asymptotes. This is in qualitative agreemen<sup>t</sup> with the picture gained from the rate-predictiveness curve.

Let us consider this more quantitatively. Based on Figure 5 (center), we make the ansatz that the map from log 1*λ* to the rate *R* := <sup>I</sup>[*<sup>M</sup>*<sup>←</sup>*X* , *Z*] is superlinear:

$$R = \alpha \left( \log \frac{1}{\lambda} \right)^{\beta},\tag{33}$$

with *α* > 0, *β* > 1. We fitted *R* ≈ log 1*λ* 1.7 (*α* = 1, *β* = 1.7). Equivalently,

$$\frac{1}{\lambda} = \exp\left(\frac{1}{a^{1/\beta}} R^{1/\beta}\right). \tag{34}$$

From this, we can derive expressions for rate *R* := I[ *M*← *X* , *Z*] and predictiveness *P* := <sup>I</sup>[*<sup>Z</sup>*, →*M X* ] as follows. For the solution of Predictive Rate–Distortion (10), we have

$$
\frac{\partial P}{\partial \theta} - \lambda \frac{\partial R}{\partial \theta} = 0,\tag{35}
$$

where *θ* is the codebook defining the encoding distribution *P*(*Z*| *M*← *X* ), and thus

$$
\lambda = \frac{\partial P}{\partial R}.\tag{36}
$$

Our ansatz therefore leads to the equation

$$\frac{\partial P}{\partial R} = \exp\left(-\frac{1}{a^{1/\beta}} R^{1/\beta}\right). \tag{37}$$

Qualitatively, this says that predictiveness *P* asymptotes to a finite value, whereas rate *R*—which should asymptote to the statistical complexity—is unbounded.

**Figure 6.** Interpolated values for POS-level prediction of English (compare Figure 5).

Equation (37) has the solution

$$P = \mathbb{C} - \mathfrak{a}\beta \cdot \Gamma\left(\beta, \left(\mathbb{R}/a\right)^{1/\beta}\right),\tag{38}$$

where Γ is the incomplete Gamma function. Since lim*R*→∞ *P* = *C*, the constant *C* has to equal the maximally possible predictiveness *E*0 := I[ *M*← *X* , →*M X* ].

Given the values fitted above (*α* = 1, *β* = 1.7), we found that *E*0 = 1.13 yielded a good fit. Using (33), this can be extended without further parameters to the third curve in Figure 5. Resulting fits are shown in Figure 6.

Note that there are other possible ways of fitting these curves; we have described a simple one that requires only two parameters *α* > 0, *β* > 1, in addition to a guess for the maximal predictiveness *E*0. In any case, the results show that natural language shows an approximately linear growth of predictiveness with a rate at small rates, and exploding rates at diminishing returns in predictiveness later.

#### *6.3. Word-Level Language Modeling*

We applied NPRD to the problem of predicting English on the level of part-of-speech tags in Section 6.1. We found that the resulting curves were described well by Equation (37). We now consider the more realistic problem of prediction at the level of words, using data from multiple languages. This problem is much closer to the task faced by a human in the process of comprehending text, having to encode prior observations so as to minimize prediction loss on the upcoming words. We will examine whether Equation (37) describes the resulting trade-off in this more realistic setting, and whether it holds across languages.

For the setup, we followed a standard setup for recurrent neural language modeling. The hyperparameters are shown in Table A1. Following standard practice in neural language modeling, we restrict the observation space to the most frequent 10<sup>4</sup> words; other words are replaced by their part-of-speech tag. We do this for simplicity and to stay close to standard practice in natural language

processing; NPRD could deal with unbounded state spaces through a range of more sophisticated techniques such as subword modeling and character-level prediction [70,71].

We used data from five diverse languages. For English, we turn to the Wall Street Journal portion of the Penn Treebank [72], a standard benchmark for language modeling, containing about 1.2 million tokens. For Arabic, we pooled all relevant portions of the Universal Dependencies treebanks [73–75]. We obtained 1 million tokens. We applied the same method to construct a Russian corpus [76], obtaining 1.2 million tokens. For Chinese, we use the Chinese Dependency Treebank [77], containing 0.9 million tokens. For Japanese, we use the first 2 million words from a large processed corpus of Japanese business text [78]. For all these languages, we used the predefined splits into training, validation, and test sets.

For each language, we sampled about 120 values of log 1*λ* uniformly from [−6, 0] and applied NPRD to these. The resulting curves are shown in Figures 7 and 8, together with fitted curves resulting from Equation (37). As can be seen, the curves are qualitatively very similar across languages to what we observed in Figure 6: In all languages, rate initially scales linearly with predictiveness, but diverges as the predictiveness approaches its supremum *E*0. As a function of log 1*λ* , rate grows at a slightly superlinear speed, confirming our ansatz (33).

**Figure 7.** Word-level results.

**Figure 8.** Word-level results (cont.).

These results confirm our results from Section 6.1. At the time scale of individual sentences, Predictive Rate–Distortion of natural language appears to quantitatively follow Equation (37). NPRD reports rates up to ≈ 60 nats, more than ten times the largest values of predictiveness. On the other hand, the growth of rate with predictiveness is relatively gentle in the low-rate regime. We conclude that predicting words in natural language can be approximated with small memory capacity, but more accurate prediction requires very fine-grained memory representations.

## *6.4. General Discussion*

Our analysis of PRD curves for natural language suggests that human language is characterized by very high and perhaps infinite statistical complexity, beyond its excess entropy. In a similar vein, D ˛ebowski [64] has argued that the excess entropy of connected texts in natural language is infinite (in contrast, our result is for isolated sentences). If the statistical complexity of natural language is indeed infinite, then statistical complexity is not sufficiently fine-grained as a complexity metric for characterizing natural language.

We sugges<sup>t</sup> that the PRD curve may form a more natural complexity metric for highly complex processes such as language. Among those processes with infinite statistical complexity, some will have a gentle PRD curve—meaning that they can be well-approximated at low rates—while others will have a steep curve, meaning they cannot be well-approximated at low rates. We conjecture that, although natural language may have infinite statistical complexity, it has a gentler PRD curve than other processes with this property, meaning that achieving a reasonable approximation of the predictive distribution does not require inordinate memory resources.
