*3.3. Experimental Outcomes*

The last row of Table 1 provides the summary for the cognitive experiment. We collected 172,954 observations from 683 different subjects, whose residences were limited to the United States, Canada, Great Britain, and Australia. The mean of the maximum values of *n* for each experimental session was 87.51. The mean number of observations collected for *n* ≤ 70 was 1954.86.

These numbers are by far the largest collected for this type of experiment [1,9–11], in terms of both the total number of observations and the number of subjects. While these values were fixed in the previous works, they varied in our experiment due to the use of Mechanical Turk.

Figure 2 shows the number of samples acquired for different context lengths *n* − 1. As the context length *n* − 1 increased, the number of observations decreased because, in our experiment, the number of guesses could reach the maximum number of guesses allowed for a phrase, as mentioned in the previous section. For up to *n* = 70, over 85% of the subjects made guesses. Beyond *n* = 70, however, the number of subjects making guesses decreased quickly. As we discuss later, having a large number of observations is crucial for acquiring a good estimate of the entropy rate within a statistically reasonable margin.

**Figure 2.** The number of observations collected for the predictions made for the *n*-th character. The vertical line indicates *n* = 70, which provided the minimum direct estimate of *hexpmin* = 1.407 in our experiment.

#### *3.4. Human Prediction Accuracy with Respect to Context Length*

Shannon [1] originally reported that the upper bound decreases with respect to the context length for up to *n* = 100. This result implies that a human is able to improve their prediction performance with more context. However, the later experiment by [11] disagreed with Shannon's [1], as they reported that the upper bound did not decrease for *n* ≥ 32. Therefore, the question remains as to whether longer contextual phrases help humans to predict future characters more accurately. Hence, we examined whether the prediction performance of subjects improved with a longer contextual phrase length, based on all observations collected.

Figure 3 shows the probability that a subject provided the correct *n*-th character with their first guess. At *n* = 1 (i.e., the subject was asked to predict the first character of a phrase with no context given), the probability was below 20%. The probability improved greatly from *n* = 1 to *n* = 2, as it reached above 50% for *n* = 2. As *n* increased to *n* = 100, the probability roughly monotonically increased to nearly 80%. Based on this result, a subject improves their accuracy in predicting the next character as the context length *n* increases, at least up to *n* = 100, which supports Shannon's claim.

**Figure 3.** The probability that the subject needed only one guess to make the correct prediction of *n*-th character.

This result also implies that the subjects of our experiment exhibited reasonable performances since it was a major concern that the collected observations might be of low quality due to the online experimental setting.

#### *3.5. The Datapoints of the Bounds for n*

Using all of the observations, the upper and lower bounds can be estimated with Equation (5) for every *n*. The number of collected observations varies with respect to *n*, as shown in Figure 2. Figure 4 shows the plots of the upper and lower bounds computed for *n* = 1, 2, ... , 70 using all of the collected observations. The blue plot indicates the upper bound, whereas the red plot shows the lower bound. For the upper bound, the blue plot exhibits a decreasing tendency, although the values fluctuate along with *n*. Our main interest lies in the upper bound.

**Figure 4.** The plots of the upper bound (**blue**) and the lower bound (**red**) acquired from all observations and their extrapolations via ansatz functions of *f*1 (dashed lines).

Plots of both bounds have large fluctuations for *n* > 70 due to the decrease in the sample size for large *n*, which will be examined later in Section 5.1. The minimum experimental value of the upper bound was *hexpmin* ≡ 1.407 bpc, which was located at *n* = 70. Since this is the minimum of the direct experimental values, any computed entropy rate larger than this would appear to be invalid. In the remainder of this paper, the observations collected up to *n* = 70 are utilized.

#### **4. Extrapolation of the Bounds with an Ansatz Function**

As mentioned in the Introduction, the other drawback of the previous studies utilizing the cognitive approach to the entropy rate lies in not extrapolating the experimental values. Precisely, in the previous cognitive experiments [1,10,11], the reported entropy rate values were the direct upper bounds at the largest *n* used, such as *n* = 100 in [1].

As the entropy rate, by definition, is the value of *Fn* with *n* tending to infinity, its upper and lower bounds, as *n* tends to infinity, must be considered and can be examined via some extrapolation functions.

## *4.1. Ansatz Functions*

As the mathematical nature of a natural language time series is unknown, such a function can only be an ansatz function. The first ansatz function was proposed by Hilberg [23], who hypothesized that the entropy rate decreases according to the power function with respect to *n* based on the experimental results of Shannon [1]. This function is as follows:

$$f\_1(n) = An^{\beta - 1} + h, \qquad \qquad \beta < 1. \tag{6}$$

Originally, this function was proposed without the *h* term. There have been theoretical arguments as to whether *h* = 0 [2–5,7,24,25]; therefore, a function with the *h* term was considered in this work.

Takahira et al. [4] suggested another possibility that modifies the function *f*1(*n*) slightly, which is as follows:

$$f\_2(n) = \exp\left(A n^{\beta - 1} + h\right), \tag{7}$$

They observed that the stretched exponential function *f*2(*n*) leads to a smaller value of *h* by roughly 0.2 bpc in a compression experiment for English characters.

Schümann and Grassberger [3] introduced another function *f*3(*n*) based on their experimental result:

$$f\_3(n) = An^{\beta - 1} \log n + h,\tag{8}$$

These three ansatz functions *f*1, *f*2, and *f*3 will be evaluated based on their fit to the data points discussed in the previous section. For *f*1 and *f*3, *h* is the estimated value at infinite *n*, whereas in the case of *f*2, the estimated value of the upper and lower bounds at infinity is *<sup>e</sup>h*.

#### *4.2. Comparison among Ansatz Functions Using All Estimates*

Every ansatz function was fitted to the plots of the upper and lower bounds via the Levenberg–Marquardt algorithm for minimizing the square error. The ansatz functions' fits to the data points mentioned in Section 3.5, are shown in Figure 4 for *f*1 and in Figure A1 in the Appendix A for *f*2 and *f*3.

For *f*1 and *f*2, the fits converged well and the errors were also moderate. The mean-root-square error of *f*1 was 0.044, quite close to the error of *f*2, which was 0.043. Both the entropy rate estimates also converged to similar values of *h*; namely, *h* = 1.393 and *h* = 1.353 bpc, respectively, for the upper bounds. The values of *β*, were 0.484 and 0.603 for *f*1 and *f*2, respectively, suggesting monotonic decay in both cases.

On the other hand, *f*3 presented some problems. The function did not fit well, and the error was 0.069. Above all, *f*3's extrapolated upper bound was *h* = 1.573 bpc. The value is larger than the minimum experimental value *hexpmin* = 1.407 bpc considered in Section 3.5.

This tendency of *f*3 to overestimate the value *h* may be the result of *f*3(*n*) having been designed based on the convergence of the entropy rate of some random sequence. Therefore, a suitable ansatz function would be either *f*1 or *f*2. As seen, they provide similar results, which is consistent with the original observation provided in [4]. Consequently, we focus on *f*1, the most conventional ansatz, in the following section.

#### **5. Analysis via the Bootstrap Technique**

Section 2.3 mentioned that the scale of our experiment was significantly larger than the scales used in previous experiments [1,9,11]. The large number of observations allowed us to investigate the effect of the number of observations via the bootstrap technique, which uses subsets of the experimental samples.

#### *5.1. The Effect of the Sample Size*

*B* sets of observations, each of which include *S* records of the experimental sessions, were sampled without redundancy. Let *S* be referred to as the *sample size* in the following discussion. As defined in Section 2.1, a record of an experimental session consists of a series of the number of guesses for each context of length *n* − 1 produced by the same subject for a phase.

For each set, the upper bound of every *n* is the rightmost term in Equation (5), and an acquired set of points is extrapolated with the ansatz function *f*1. We obtain *B* different values of *h*. In addition to their mean value, it would be reasonable to examine the interval between some bounds for the entropy rate estimate. We consider these bounds based on the fixed percentile of *B* values of *h*. We set *B* = 1000 and acquired the means and both bounds at 5% upper/lower percentiles for different values of *S*.

Figure 5 shows the histograms of *h* values for *S* = 100, 500, 1000, and 1500. At *S* = 100, the estimated values vary widely, and the 5% percentile bounds are *h* = 1.124 bpc and *h* = 1.467 bpc, as shown in Table 3. The previous experiments, including Shannon's study [1,9,11], used a maximum of *S* = 100 observations for certain values of *n*. Our results sugges<sup>t</sup> that the values reported by these works have large intervals around them and should not be considered to be general results.

Furthermore, for small *S*, the estimated values also tend to be biased towards smaller values. The mean value at *S* = 100 was *h* = 1.340 bpc, which is about 0.07 bpc smaller than the value *h* = 1.412 bpc obtained for *S* = 1000. This underestimation occurred due to the fact that an event with small probability cannot be sampled when the sample size is small. Such events with small probabilities then contribute to increasing the entropy. When their contributions are ignored, the estimate tends to be smaller than its true value. Consequently, Shannon's original experiment could have underestimated the upper bound.

These observations sugges<sup>t</sup> that a large sample size is necessary to obtain convergence of the upper bound. As observed in the values reported in Table 3, the histograms Figure 5, the red data points, and the shaded area in Figure 6, the differences between the 5% upper/lower percentile bounds decrease with larger sample size *S*. At *S* = 1000, the difference between the bounds is smaller than 0.1 bpc, which is a reasonably acceptable margin of error.

**Figure 5.** Histograms for the estimated values of the upper bound of the entropy rate *h* for different sample sizes. (**a**) *S* = 100; (**b**) *S* = 500; (**c**) *S* = 1000; (**d**)*S* = 1500.

**Figure 6.** The estimated upper bounds with ansatz function *f*1 using: (1) 1000 experimental sessions with the best prediction performances (**blue**), and (2) all experimental sessions (**red**), with the values reported in Table 3. The blue and red points indicate the mean values for the *B* = 1000 sets, and the shaded areas indicate the 5% percentile bounds.

**Table 3.** The means and the 5% percentile-bound-intervals for the upper bound of *h* found by using the ansatz function *f*1 for *S* = 100, 500, 1000, and 1500. The number of sets is *B* = 1000. The error is large for a small sample sizes, such as *S* = 100, as the difference between the 5% percentile upper and lower bounds is larger than 0.3 bpc. This difference decreases with increasing *S* and eventually becomes smaller than ±0.1 bpc for *S* ≥ 1000.


#### *5.2. The Effect of Variation on Subjects' Estimation Performances*

Our experiment was conducted with anonymous subjects, and therefore, was less controlled than an in-laboratory experiment. Such factors could influence the entropy rate estimate; therefore, the bias is examined in this section.

Although the residences of the participants were limited to native English speaking countries, as mentioned in Section 3.3, we could not control the native tongues of our participants. Although our phrases were extracted from the *Wall Street Journal* and the terms and expressions were easy to understand, even for non-natives (see Table 2), the results might be biased. In addition, the experiment was not supervised on site; therefore, subject conditions could have varied.

In principle, the entropy rate measures the maximal predictability of the text. Therefore, each estimated value should be obtained based on the maximal performance of the subject. Here, we consider estimating the entropy rate with only the best-performed experimental sessions. We first defined the performance of an experimental session as the average number of guesses required to predict the succeeding character *Xn*. The experimental sessions for which the maximal *n* was less than 70 were filtered out in order to keep the sample size the same for all *n* = 1 . . . 70.

Next, the experimental sessions were sorted by performance, and the *S* = 1000 best sessions are selected. Note that this *S* was necessary for obtaining convergence, as seen in the previous section.

We evaluated the mean and 5% percentile bounds of the best-performing set by measuring the upper bound *h* from *B* = 1000 sets of *S* = 100, 150, 200, ... , 1000 sub-samples. At *S* = 1000, there is only one possible set; therefore, *h* can have just one value. The results are shown in Figure 6. The blue data points in the middle show the means, and the blue-colored areas around them shows the intervals contained within the 5% percentile bounds. Similar to the results for all experiment sessions (shown as red data points and a red-shaded area), the widths of the intervals are quite large for small sample sizes, such as *S* = 100, and decrease towards *S* = 1000. The mean value of the upper bound increased with respect to *S*, which is also similar to the result for all experiment sessions.

Using just the selected experimental sessions, the final estimated value converged to *h* ≈ 1.22 bpc, which is smaller than the value estimated when using all experimental sessions *hexpmin* and those acquired by previous cognitive experiments.
