**6. Discussion**

#### *6.1. Computational versus Cognitive Methods*

In parallel with the cognitive approach, computational approaches have also attempted to estimate the entropy rate's upper bound for natural language. Such an approach requires that some language model be used, and previous estimates have been found with, for example, the *n*-gram language model [2], compression algorithm model [3,4], and neural language model [5,7]. In particular, Brown et al. [2] constructed the word-level *n*-gram language model and obtained *h* = 1.63 bpc, whereas Takahira et al. [4] conducted a compression experiment using giga byte-scale newspaper corpora and obtained an estimate of *h* = 1.32.

In addition to the compression algorithms and *n*-gram language models, recent works have also employed neural language models, which potentially have higher capacities for accurately predicting future characters. Recently, Dai et al. [7] reported *h* = 1.08 bpc when using Transformer XL on text8. This dataset is a collection of natural language text taken from Wikipedia and cleaned to the point of having only 26 alphabet characters and space corresponding to the setting of the Shannon's experiment. That *h* value was smaller than our estimated value, suggesting that humans may not be able to outperform computational models in character guessing games. Nevertheless, it is worth considering the differences in the conditions of the experiments.

The primary factor is the context length. Dai et al. [7]'s model utilized several hundred context lengths to acquire their results. The high performance of the neural language models can be explained, at least partially, by their ability to utilize long contexts. However, humans *are* also able to utilize long contexts, at least as long as *n* ≈ 102, to improve their prediction performances, whereas our experiment used the context lengths of up to *n* = 70 to obtain the upper bound for *h*.

Furthermore, while a cognitive experiment obtains the upper bound of the entropy rate from the number of guesses, when using the computational model, the estimate is calculated based on the probability assigned to the correct character. With a distribution at hand, the upper bound of the computational model can be evaluated more tightly and precisely. The design of an experiment that incorporates a longer context length and character probability distributions is a direction of research that may be pursued in future work.

#### *6.2. Application to Other Languages and Words*

This work focused on English, which is the most studied language within the context of entropy rate estimation. Shannon's experiment is applicable to other languages if the alphabet size of the writing system is comparable with that of English.

In contrast, for ideographic languages such as Chinese and Japanese, which have much larger alphabet sizes, it is practically impossible to conduct Shannon's experiment. A prediction could involve

thousands of trials until a subject reaches the correct character. Therefore, a new experimental design is required to estimate the entropy rate for these languages with large alphabet sizes.

Such an experimental setting would be also applicable to the estimation of the entropy rate at the word level, which could be interesting to investigate via a cognitive approach. Humans partly generate text word by word and character by character (sound by sound). Thus, any analysis could reveal new information about linguistic communication channels, including their distortions, as studied in [26,27].

#### *6.3. Nature of h Revealed by Cognitive Experimentation*

Provided with some previous work and the good fit of an ansatz extrapolation function while assuming that *h* ≥ 0 and using what we consider reliable data points, we arrived at *h* = 1.22.

There is more than one way, however, to investigate the true value of *h*. Figure 4 shows how data points for larger *n* become lower than the estimated ansatz, perhaps suggesting that the values tend to zero even for larger *n*. It could be the case that *h* goes to zero. Indeed, a function without an *h* term (i.e., *h* = 0) would fit reasonably well if the upper bound is evaluated only with relatively small datapoints of *n* such as *n* ≤ 70. Overall, our analysis does not rule out the possibility of the zero entropy rate.

One observation gained from this work that highlighted the sample size is that data points are distributed and statistical margins must be considered. Hence, *h* should be considered as having a distribution and not as a single value. One such way of analysis was described in Section 5.
