OCF

As discussed in Section 3, OCF is affected by overfitting, and will systematically overestimate the predictiveness achieved at a given rate [6]. To address this problem, we follow the evaluation method used for NPRD, evaluating rate and predictiveness on held-out test data. We partition the available time series data into a training and test set. We use the training set to create the encoder *P*(*Z*| *M*← *X* ) using the Blahut–Arimoto algorithm as described by Still et al. [6]. We then use the held-out test set to estimate rate and prediction loss. This method not only enables fair comparison between OCF and NPRD, but also provides a more realistic evaluation, by focusing on the performance of the code *Z* when deployed on new data samples. For rate, we use the same variational bound that we use for NPRD, stated in Proposition 2:

$$\frac{1}{N} \sum\_{\stackrel{M\leftarrow}{X} \in \text{Test Data}} \mathcal{D}\_{\text{KL}}(P(Z \mid \stackrel{M\leftarrow}{X}) \mid s(Z)), \tag{31}$$

where *P*(*Z*| *M*← *X* ) is the encoder created by the Blahut–Arimoto algorithm, and *s*(*Z*) is the marginal distribution of *Z* on the training set. *N* is the number of sample time series in the test data. In the limit of enough training data, when *s*(*Z*) matches the actual population marginal of *Z*, (31) is an unbiased estimate of the rate. We estimate the prediction loss on the future observations as the empirical cross-entropy, i.e., the variational bound stated in Proposition 1:

$$\frac{1}{N} \sum\_{\substack{M \leftarrow \rightharpoonup \to M \\ X \text{ } \dots \text{ } X \text{ } \in \text{Test Data}}} \mathbb{E}\_{Z \sim P(\cdot \mid \stackrel{M \leftarrow}{X})} \log P(\stackrel{\rightarrow M}{X} \mid \mathcal{Z}), \tag{32}$$

where *P*( →*M X* |*Z*) is the decoder obtained from the Blahut–Arimoto algorithm on the training set. Thanks to Propositions 1 and 2, these quantities provide upper bounds, up to sampling error introduced by finiteness of the held-out data. Again, sampling error does not bias the results in either direction, unlike overfitting, which introduces a systematic bias.

Held-out test data may contain sequences that did not occur in the training data. Therefore, we add a pseudo-sequence *ω* and add pseudo-observations (*<sup>ω</sup>*, →*M X* ), ( *M*← *X* , *<sup>ω</sup>*), (*<sup>ω</sup>*, *ω*) for all observed sequences *M*← *X* →*M X* to the matrix of observed counts that serves as the input to the Blahut–Arimoto algorithm. These pseudo-observations were assigned pseudo-counts *γ* in this matrix of observed counts; we found that a wide range of values ranging from 0.0001 to 1.0 yielded essentially the same results. When evaluating the codebook on held-out data, previously unseen sequences were mapped to *ω*.
