Neural Predictive Rate–Distortion

For all experiments, we used *M* = 15. Neural networks have hyperparameters, such as the number of units and the step size used in optimization, which affect the quality of approximation depending on properties of the dataset and the function being approximated. Given that NPRD provably provides upper bounds on the PRD objective (18), one can in principle identify the best hyperparameters for a given process by choosing the combination that leads to the lowest estimated upper bounds. As a computationally more efficient method, we defined a range of plausible hyperparameters based both on experience reported in the literature, and considerations of computational efficiency. These parameters are discussed in Appendix A. We then randomly sampled, for each of the processes that we experimented on, combinations of *λ* and these hyperparameters to run NPRD on. We implemented the model using PyTorch [46].

#### *5.2. Analytically Tractable Problems*

We first test NPRD on two processes where the Predictive Rate–Distortion trade-off is analytically tractable. The Even Process [2] is the process of 0/1 IID coin flips, conditioned on all blocks of consecutive ones having even length. Its complexity and excess entropy are both ≈ 0.63 nats. It has infinite Markov order, and Marzen and Crutchfield [2] find that OCF (at *M* = 5) performs poorly. The true Predictive Rate–Distortion curve was computed in [2] using the analytically known -machine. The Random Insertion Process [2] consists of sequences of uniform coin flips *Xt* ∈ {0, <sup>1</sup>}, subject to the constraint that, if *Xt*−<sup>2</sup> was a 0, then *Xt* has to be a 1.

We applied NPRD to these processes by training on 3M random trajectories of length 30, and using 3000 additional trajectories for validation and test data. For each process, we ran NPRD 1000 times for random choices of *λ* ∈ [0, 1]. Due to computational constraints, when running OCF, we limited sample size to 3000 trajectories for estimation and as held-out data. Following Marzen and Crutchfield [2], we ran OCF for *M* = 1, ..., 5.

The resulting estimates are shown in Figure 1, together with the analytical rate–distortion curves computed by Marzen and Crutchfield [2]. Individual runs of NPRD show variation (red dots), but most runs lead to results close to the analytical curve (gray line), and strongly surpass the curves computed by OCF at *M* = 5. Bounding the trade-off curve using the sets of runs of NPRD results in a close fit (red line) to the analytical trade-off curves.

**Figure 1.** Rate–Distortion for the Even Process (**left**) and the Random Insertion Process (**right**). Gray lines: analytical curves; Red dots: multiple runs of NPRD; red line: trade-off curve computed from NPRD runs; blue: OCF for *M* ≤ 5.
