**Appendix A. Hyperparameters**

All hyperparameter choices are shown in Table A1. We defined separate hyperparameter ranges for the three Sections 5.2, 6.1, and 6.3. Guided by the fact that the analytically known processes in Section 5.2 are arguably less complex than natural language, we allowed larger and more powerful models for modeling of natural language (Sections 6.1 and 6.3), in particular word-level modeling (Section 6.3).

In Table A1, hyperparameters are organized into four groups. The first group of parameters is the dimensions of the input embedding and the recurrent LSTM states. The second group of parameters is regularization parameters. We apply dropout [79] to the input and output layers. The third group of parameters is related to the optimization procedure [35]. Neural Autoregressive Flows, used to approximate the marginal *q*, also have a set of hyperparameters [20]: the length of the flow (1, 2, ...), the type (DSF/Deep Sigmoid Flow or DDSF/Deep Dense Sigmoid Flow), the dimension of the flow (an integer) and the number of layers in the flow (1,2, ...). Larger dimensions, more layers, and longer flows lead to more expressive models; however, they are computationally more expensive.

Training used Early Stopping using the development set, as described in Section 4.3. Models were trained on a TITAN Xp graphics card. On Even Process and Random Insertion Process, NPRD took a median of 10 min to process 3M training samples. OCF took less than one minute at *M* ≤ 5; however, it does not scale to larger values of *M*. On English word-level modeling, training took a median number of nine epochs (max 467 epochs) and five minutes (max 126 min).


**Table A1.** NPRD Hyperparameters. See Appendix A for description of the parameters.

#### **Appendix B. Alternative Modeling Choices**

In this section, we investigate the trade-offs involved in alternative modeling choices. It may be possible to use simpler function-approximators for *φ*, *ψ*, and *q*, or smaller context windows sizes *M*, without harming accuracy too much.

First, we investigated the performance of a simple fixed approximation to the marginal *q*. We considered a diagonal unit-variance Gaussian, as is common in the literature on Variational Autoencoders [36]. We show results in Figure A1. Even with this fixed approximation to *q*, NPRD continues to provide estimates not far away from the analytical curve. However, comparison with the results obtained from full NPRD (Figure 1) shows that a flexible parameterized approximation still provides considerably better fit.

Second, we investigated whether the use of recurrent neural networks is necessary. As recurrent models such as LSTMs process input sequentially, they cannot be fully parallelized, posing the question of whether they can be replaced by models that can be parallelized. Specifically, we considered Quasi-Recurrent Neural Networks (QRNNs), which combine convolutions with a weak form of recurrence, and which have shown strong results on language modeling benchmarks [80]. We replaced the LSTM encoder and decoder with QRNNs and fit NPRD to the Even Process and the Random Insertion Process. We found that, when using QRNNs, NPRD consistently fitted codes with zero rate for the Even Process and the Random Insertion Process, indicating that the QRNN was failing to extract useful information from the past of these processes. We also found that the cross-entropies of the estimated marginal distribution *<sup>P</sup>η*( *M*← *X* ) were considerably worse than when using LSTMs or simple RNNs. We conjecture that, due to the convolutional nature of QRNNs, they cannot model such processes effectively in principle: QRNNs extract representations by pooling embeddings of words or *n*-grams occurring in the past. When modeling, e.g., the Even Process, the occurrence of specific *n*-grams is not informative about whether the length of the last block is even or odd; this requires some information about the positions in which these *n*-grams occur, which is not available to the QRNN, but which is generally available in a more general autoregressive model such as an LSTM.

Third, we varied *M*, comparing *M* = 5, 10 to the results obtained with *M* = 15. Results are shown in Figure A2. At *M* = 5, NPRD provides estimates similar to OCF (Figure 1). At *M* = 10, the estimates are close to the analytical curve; nonetheless, *M* = 15 yields clearly more accurate results.

**Figure A1.** Rate–Distortion for the Even Process **(left**) and the Random Insertion Process (**right**), estimated using a simple diagonal unit Gaussian approximation for *q*. Gray lines: analytical curves; red dots: multiple runs of NPRD (>200 samples); red line: trade-off curve computed from NPRD runs. Compare Figure 1 for results from full NPRD.

**Figure A2.** Rate–Distortion for the Even Process (**left**) and the Random Insertion Process (**right**), varying *M* = 5 (blue), 10 (red), 15 (green); gray lines: analytical curves; red dots: multiple runs of NPRD; red line: trade-off curve computed from NPRD runs. Compare Figure 1 for results from full NPRD.

#### **Appendix C. Sample Runs on English Text**

In Figure A3, we provide four sample outputs from English word-level modeling with three values of log 1*λ* (1.0, 3.0, 5.0), corresponding to low (1.0), medium (3.0), and high (5.0) rates (compare Figure 7). We obtained sample sequences by selecting the first 32 sequences *M*← *X* →*M X* (at *M* = 15) from the Penn Treebank validation set, and selected four examples where the variation in cross-entropy values at *X*0 was largest between the three models.

Across these samples, models generated at log 1*λ* = 5 show lower cross-entropy on the first future observation *X*0, as these codes have higher rates. For instance, in the first sample, the cross-entropy on the first word *Jersey* is lowest for the code with the higher rate; indeed, this word is presumably strongly predicted by the preceding sequence *...Sen. Billd Bradley of New*. Codes with higher rates are better at encoding such predictive information from the past.

**Past:** within minutes after the stock market closed friday , i called Sen. Bill Bradley of new

**Figure A3.** Four example outputs from English word-level modeling, with low rate (log 1*λ* = 1; red, dotted), medium rate (log 1*λ* = 3; green, dashed), high rate (log 1*λ* = 5; blue, solid). For each sample, we provide the prior context *M*← *X* (**top**), and the per-word cross-entropies (in nats) on the future words →*M X* (**bottom**).
