**1. Introduction**

Entropy rates *h* of natural languages have been used to investigate the complexity underlying these languages. The entropy rate of a sequence measures the amount of information per character [1] and indicates that the number of possible sequences is 2*hn* for a sequence of length *n*.

Following the development of information theory and an abundance of data resources, recent studies have used computational approaches for finding the entropy rates of natural languages. Starting from the first attempt made by [2], which used a three-gram, word-level language model, various compression algorithms have been utilized [3,4]. The most recent study makes use of a state-of the art neural language model [5]. However, such computational attempts have a drawback; i.e., the computation of *h* requires a computational language model with which to predict the probability distribution of every character. As a result, the value of *h* reflects not only the complexity of the language but also the performance of the model. Indeed, in natural language processing, such an estimate of *h* is used as an indicator of the goodness-of-fit of a language model [6]. Recently reported decreases in the upper bound of *h*, for which the current minimum for English is 1.08 bpc [7] are simply highlighting improvements in the computational model.

Originally, Shannon's study [1] and some work that followed [8–11] used cognitive methods to estimate the entropy rate *h*. The original scientific interest in *h* had to do with the complexity of human language. Given this perspective, the performance of a computational model should not be involved in obtaining a value of *h*.

The studies using cognitive approaches can be reconsidered from two perspectives. First, they were all based on limited-scale experiments. In all of these studies, a subject was asked to predict the *n*-th character given the preceding *n* − 1 characters. According to [11], Shannon's spouse was his only subject. Even the most recent cognitive study [11] relied on just eight subjects. Experimenting on such a small scale raises the question of the statistical validity of the acquired estimate.

Second, none of the cognitive approaches considered the limit with respect to the context length *n*. While the estimated values should be evaluated at infinite *n* by the definition of the entropy rate, the reported values are obtained at some finite *n*. In Shannon [1], the value *h* = 1.3 bits per character (bpc) for English was obtained at *n* = 100, and Moradi et al. [11] concluded that the estimated value does not decrease beyond *n* ≥ 32 and reported a rate of *h* ≈ 1.6 bpc. For extrapolation, however, a large number of observations becomes necessary in order to capture the dependence of the entropy rate on *n* well.

To that end, we conducted a large-scale cognitive test to acquire the English language entropy rate *h* through Amazon Mechanical Turk (AMT). AMT is a crowd-sourcing service offered by Amazon that allowed us to gather a large number of participants in a short time and at a reasonable cost. We focused on the entropy rate in English to make a fair comparison with Shannon [1] and other works. Other languages possibly have different values of the entropy rate, as can be seen in the comparison made in [4]. We collected a total of 172,954 character predictions from 683 different subjects. To the best of our knowledge, the scale used in this experiment was more than two times larger than any used in previous studies. At such a scale, the effects of factors that may influence the estimation of the entropy rate can be examined. Our analysis implies that Shannon's original experiment had an insufficient sample size with which to find a convergen<sup>t</sup> estimate. We finally obtained *h* ≈ 1.22 bpc for English, which is smaller than Shannon's original result of *h* = 1.3 bpc.

#### **2. Entropy Rate Estimation**

#### *2.1. Entropy Rate and n-Gram Entropy*

#### **Definition 1. Shannon entropy**

*Let X be a stochastic process* {*Xt*}∞ *t*=1*, where each element belongs to a finite character set* X *. Let Xj i* = *Xi*, *Xi*+1, ... , *Xj*−1, *Xj for i* < *j and P*(*X<sup>j</sup> i* ) *be the probability of Xj i . The Shannon entropy of a stochastic process H*(*Xn* 1 ) *is defined as*

$$H(X\_1^n) = -\sum\_{X\_1^n} P(X\_1^n) \log P(X\_1^n). \tag{1}$$

#### **Definition 2. Entropy rate**

*The entropy rate h of a stochastic process X is defined as*

$$h = \lim\_{n \to \infty} \frac{1}{n} H(X\_1^n),\tag{2}$$

if such a value exists [12]. The entropy rate *h* is the average amount of information per element in a sequence of infinite length.

In the following, let *Fn* be the prediction complexity of *Xn* given *X<sup>n</sup>*−<sup>1</sup> 1, as follows:

$$F\_n \equiv H(X\_n | X\_1^{n-1}).\tag{3}$$

In other words, *Fn* quantifies the average uncertainty of the *n*-th character given a character string with length *n* − 1. If the stochastic process *X* is stationary, *Fn* reaches the entropy rate *h* as *n* tends to infinity, as follows [12]:

$$h = \lim\_{n \to \infty} F\_n.\tag{4}$$

In this work, *h* was estimated via *Fn*. A human *subject* was given *X<sup>n</sup>*−<sup>1</sup> 1 characters and asked to predict the next character *Xn*. We aimed to collect a large number of predictions from many subjects. For a subject and a phrase, let a *sample* indicate the *prediction* of a *Xn* given a particular *X<sup>n</sup>*−<sup>1</sup> 1 .

An *experimental session* is defined as a subject and phrase pair. For every experimental session, a subject first predicts *X*1, then *X*2 given *X*1, then *X*3 given *X*21, then *X*4 given *X*31,..., *Xn* given *X<sup>n</sup>*−<sup>1</sup> 1 , and so on. Therefore, in an experimental session, a number of observations are acquired for a given phrase, with the maximum number of observations being the character length of the phrase.
