**1. Introduction**

At a very basic level, the quantitative study of natural languages is about counting words: if a word occurs very often in one text but not in a second one, then we conclude that this difference might have some kind of significance for classifying both texts [1]. If a word occurs very often after another word, then we conclude that this might have some kind of significance in speech and language processing [2]. In both examples, we can use the gained knowledge to make informed predictions "with accuracy better than chance" [3], thus leading us to information theory quite naturally. If we consider each word type *i* = 1, 2, ... , *K* as one distinct symbol, then we can count how often each word type appears in a document or text *t* and call the resulting word token frequency *fi*. We can then represent *t* as a distribution of word frequencies. In order to quantify the amount of information contained in *t*, we can calculate the Gibbs–Shannon entropy of this distribution as [4]:

$$H(p) = -\sum\_{i=1}^{K} p\_i \* \log\_2(p\_i) \tag{1}$$

where *pi* = *fi N* is the maximum likelihood estimator of the probability of *i* in *t* for a database of *N* = *Ki* = 1 *fi* tokens. In [5], word entropies are estimated for more than 1000 languages. The results are

then interpreted in light of information-theoretic models of communication, in which it is argued that word entropy constitutes a basic property of natural languages. *<sup>H</sup>*(*p*) can be interpreted as the average number of guesses required to correctly predict the type of word token that is randomly sampled from the entire text base (more precisely, [4], Section 5.7) show that the expected number of guesses *EG* satisfies *<sup>H</sup>*(*p*) ≤ *EG* < *<sup>H</sup>*(*p*) + 1). In the present paper, we analyze the lexical dynamics of the German weekly news magazine *Der Spiegel* (consisting of *N* = 236,743,042 word tokens, *K* = 4,009,318 di fferent word types, and 365,514 articles that were published between 1947 and 2017; details on the database and preprocessing are presented Section 2). If the only knowledge we possess about the database were *K*, the number of di fferent word types, then we would need on average *H*max = log2(*K*) = log2(4,009,318) ≈ 21.93 guesses to correctly predict the word type, calculating *H* for our database based on Equation (1) using the corresponding probabilities for each *i* yields 12.28. The di fference between *H*max and *<sup>H</sup>*(*p*) is defined as information in [3]. Thus, knowledge of the non-uniform word frequency distribution gives us approximately 9.65 bits of information, or put di fferently, we save on average almost 10 guesses to correctly predict the word type.

To quantify the (dis)similarity between two di fferent texts or databases, word entropies can be used to calculate the so-called Jensen–Shannon divergence [6]:

$$D(p,q) \, := H(\frac{p+q}{2}) - \frac{1}{2}H(p) - \frac{1}{2}H(q) \tag{2}$$

where *p* and *q* are the (relative) word frequencies of the two texts and *p* + *q* is calculated by concatenating both texts. From a Bayesian point of view, *<sup>D</sup>*(*p*,*q*) can be interpreted as the expected amount of gained information that comes from sampling one word token from the concatenation of both texts regarding the question which of the two texts the word token belongs to [7]. If the two texts are identical, *<sup>D</sup>*(*p*,*q*) = 0, because sampling a word token does not provide any information regarding to which text the token belongs. If, on the other side, the two texts do not have a single word type in common, then sampling one word token is enough to determine from which text the token comes, and correspondingly, *<sup>D</sup>*(*p*,*q*) = 1. The Jensen–Shannon divergence has already been applied in the context of measuring stylistic influences in the evolution of literature [8], cultural and institutional changes [9,10], the dynamics of lexical evolution [11,12], or to quantify changing corpus compositions [13].

Perhaps the most intriguing aspect of word frequency distributions is the fact that they can be described remarkably well by a simple relationship that is known as Zipf's law [14]: if one assigns rank *r* = 1 to the most frequent word (type), rank *r* = 2 to the second most frequent word, and so on, then the frequency of a word and its rank *r* is related as follows:

$$p(r) \propto r^{-\gamma} \tag{3}$$

where the exponent γ is a parameter that has to be determined empirically. An estimation of γ by maximum likelihood (as described in [15]) for our database yields 1.10. However, when analyzing word frequency distributions, the main obstacle is that all quantities basically vary systematically with the sample size, i.e., the number of word tokens in the database [16,17]. To visualize this, we randomly arranged the order of all articles of our database. This step was repeated 10 times in order to create 10 di fferent versions of our database. For each version, we estimate *H* and γ after every *n* = 2*k* consecutive tokens, where *k* = 6, 7, ... , *log*2(*N*) = 28. Figure 1 shows a Simpson's Paradox [18] for the resulting data: an apparent strong positive relationship between *H* and γ is observed across all datapoints (Spearman ρ = 0.99). However, when the sample size is kept constant, this relationship completely changes: if the correlation between *H* and γ is calculated for each *k*, the results indicate a strong negative relationship (ρ ranges between −0.98 and −0.64 with a median of −0.92). The reason for this apparent contradiction is the fact that both *H* and γ monotonically increase with the sample size. When studying word frequency distributions quantitatively, it is essential to take this dependence on the sample size into account [16].

**Figure 1.** A Simpson's Paradox for word frequency distributions. Here, the word entropy *H* and the exponent of the Zipf distribution γ are estimated after every *n* = 2*k* consecutive tokens, where *k* = 6, 7, ... , *log*2(*N*) for 10 different random re-arrangements of the database; each dot corresponds to one observed value. The blue line represents a locally weighted regression of *H* on γ (with a bandwidth of 0.8). It indicates a strong positive relationship between *H* and γ (Spearman ρ = 0.99). However, when the sample size is held constant, this relationship completely changes, as indicated by the orange lines that correspond to separate locally weighted regressions of *H* on γ for each *k*. Here, the results indicate a strong negative relationship between H and γ (ρ ranges between −0.98 and −0.64 with a median of −0.92). The reason for this apparent contradiction is the fact that both H and γ monotonically increase with the sample size.

Another important aspect of word distributions is the fact that word frequencies vary by a magnitude of many orders, as visualized in Figure 2. On the one hand, Figure 2a shows that there are very few word types that occur very often. For example, the 100 most frequent word types account for more than 40% of all word occurrences. Typically, many of those word types are function words [16] expressing grammatical relationships, such as adpositions or conjunctions. On the other hand, Figure 2b shows that there are a grea<sup>t</sup> deal of word types with a very low frequency of occurrence. For example, more than 60% of all word types only occur once, and less than 3% of all word types have a frequency of occurrence of more than 100 in our database. Many of those low frequency words are content words that carry the meaning of a sentence, e.g., nouns, (lexical) verbs, and adjectives. In addition to the sample size dependence outlined above, it is important to take this broad range of frequencies into account when quantitatively studying word frequency distributions [19].

**Figure 2.** Visualization of the word frequency distribution of our database. Cumulative distribution (in %) as a function of (**a**) the rank and (**b**) the word frequency.

In this context, it was recently demonstrated that generalized entropies of order α, also called Havrda–Charvat–Lindhard–Nielsen–Aczél–Daróczy–Tsallis entropies [20], offer novel and interesting opportunities to quantify the similarity of symbol sequences [21,22]. It can be written as:

$$H\_a(p) \;= \frac{1}{a-1} \left( 1 - \sum\_{i=1}^{K} p\_i^a \right) \tag{4}$$

where αis a free parameter. For α = 1, the standard Gibbs–Shannon entropy is recovered. Correspondingly, a generalization of the standard Jensen–Shannon divergence (Equation (2)) can be obtained by replacing *H* (Equation (1)) with *H*α (Equation (4)) and thus leading to a spectrum of divergence measures *D*<sup>α</sup>, parametrized by α [22]. For the analysis of the statistical properties of natural languages, this parameter is highly interesting, because, as demonstrated by [21,22], varying the α-parameter allows us to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. If α is increased (decreased), then the weight of the most frequent words is increased (decreased). As pointed out by an anonymous reviewer, a similar idea was already reported in the work of Tanaka-Ishii and Aihara [23], who studied a different formulation of generalized entropy, the so-called <sup>R</sup>ény<sup>i</sup> entropy of order α [24]. Because we are especially interested in using generalized entropies to quantify the (dis)similarity between two different texts or databases, following [21,22], we chose to focus on the generalization of Havrda–Charvat–Lindhard–Nielsen–Aczél–Daróczy–Tsallis instead of the formulation of <sup>R</sup>ényi, because a divergence measure based on the latter can become negative for α > 1 [25], while it can be shown that the corresponding divergence measure based on the former formulation is strictly non-negative [20,22]. In addition, *<sup>D</sup>*α(*p*,*q*) is the square of a metric for α ∈ (0, 2], i.e., (i) *<sup>D</sup>*α(*p*,*q*) ≥ 0, (ii) *<sup>D</sup>*α(*p*,*q*) = 0 ⇐⇒ *p* = *q*, (iii) *<sup>D</sup>*α(*p*,*q*) = *<sup>D</sup>*α(*q*,*p*), and (iv) √*<sup>D</sup>*α obeys the triangular inequality [7,20,22].

In addition, [21] also estimated the size of the database that is needed to obtain reliable estimates of generalized divergences. For instance, [21] showed that only the 100 most frequent words contribute to *H*α and *D*α for α = 2.00, and all other words are practically irrelevant. This number quickly grows with α. For example, database sizes of *N* ≈ 10<sup>8</sup> are needed for a robust estimation of the standard Jensen–Shannon divergence (Equation (2)), i.e., for α = 1.00. This connection makes the approach of [21,22] particularly interesting in relation to the systematic influence of the sample size demonstrated above (cf. Figure 1).

In this study, the approach is systematically and empirically studied by analyzing the lexical dynamics of the *Der Spiegel* periodical. The remainder of the paper is structured as follows: In the next section, details on the database and preprocessing are given (Section 2). In Sections 3.1 and 3.2, the dependence of both *H*α and *D*α on the sample size is tested for di fferent α-parameters. This section is followed by a case study, in which we demonstrate that the influence of sample size makes it di fficult to quantify lexical dynamics and language change and also show that standard sampling approaches do not solve this problem (Section 3.3). This paper ends with some concluding remarks regarding the consequences of the results for the statistical analysis of languages (Section 4).

#### **2. Materials and Methods**

In the present study, we used all 365,514 articles that were published in the German weekly news magazine *Der Spiegel* between January 1947, when the magazine was first published, and December 2017. To read-in and tokenize the texts, we used the *Treetagger* with a German parameter file [26]. All characters were converted to lowercase. Punctuation and cardinal numbers (both treated as separate words by the Treetagger) were removed. However, from a linguistic point of view, changes in the usage frequencies of punctuation marks and cardinal numbers are also interesting. For instance, a frequency increase of the full stop could be indicative of decreases in syntactic complexity [15]. In Appendix A, we therefore present and discuss additional results in which punctuation and cardinal numbers were not removed from the data.

In total, our database consists of *N* = 236,743,042 word tokens and *K* = 4,009,318 di fferent word types.

Motivated by the studies of [21,22], we chose the following six α values to study the empirical behavior of generalized entropies and generalized divergences: 0.25, 0.75, 1.00, 1.50, and 2.00. To highlight that varying α makes it possible to magnify di fferences between di fferent texts at specific scales of the corresponding word frequency spectrum, we take advantage of the fact that *H*α can be written as a sum over di fferent words, where each individual word type *i* contributes

$$\begin{array}{ll}\frac{p\_i^a - \frac{1}{k}}{a - 1}, & \text{for } a \neq 1.00\\-p\_i \* \log\_2(p\_i), & \text{for } a = 1.00\end{array} . \tag{5}$$

In Table 1, we divided the word types into di fferent groups according to their token frequency (column 1). Each group consists of *g* = 1, 2, ... , *G* word types (cf. column 2). For each group, column 3 presents three randomly chosen examples.


**Table 1.** Contribution (in %) of word types with different token frequencies as a function of α \*.

\* Values are rounded for illustration purposes only throughout this paper.

This implies that the relative contribution *C*(*g*) per group can be calculated as (see also ([21], Equation (5))):

$$\mathsf{C}(\mathsf{g}) \;= \begin{cases} \frac{\sum\_{\mathfrak{x}=1}^{G} p\_{\mathfrak{x}}^{\mathfrak{a}}}{\sum\_{i=1}^{K} p\_{i}^{\mathfrak{a}}}, for \; \mathsf{a} \neq 1.00\\ \frac{\sum\_{\mathfrak{x}=1}^{G} (-1)^{\*} p\_{\mathfrak{x}} \* \log\_{2}(p\_{\mathfrak{x}})}{\sum\_{i=1}^{K} (-1)^{\*} p\_{i} \* \log\_{2}(p\_{i})}, for \; \mathsf{a} = 1.00 \end{cases} . \tag{6}$$

Columns 4–8 of Table 1 show the relative contribution (in %) for each group to *H*α as a function of α. For lower values of α, *H*α is dominated by word types with lower token frequencies. For instance, hapax legomena, i.e., word types that only occur once, contribute almost half of *H*<sup>α</sup>=0.25. For larger values of α, only the most frequent word contributes to *H*<sup>α</sup>. For example, the 27 word types with a token frequency of more than 1,000,000 contribute more than 92% to *H*<sup>α</sup>=2.00. Because words in different frequency ranges have different grammatical and pragmatic properties, varying α makes it possible to study different aspects of the word frequency spectrum [21].

As written above, we are interested in testing the dependence of both *H*α and *D*α on the sample size for the different α-values. Let us note that each article in our database can be described by different attributes, e.g., publication date, subject matter, length, category, or author. Of course, this list of attributes is not exhaustive but can be freely extended depending on the research objective. In order to balance the article's characteristics across the corpus, we prepared 10 versions of our database, each with a different random arrangemen<sup>t</sup> of the order of all articles. To study the convergence of *H*<sup>α</sup>, we computed *H*α after every *n* = 2*k* consecutive tokens for each version, where *k* = 6, 7, ... , *log*2(*N*) = 27. For *D*<sup>α</sup>, we compared the first *n* = 2*k* word tokens with the last *n* = 2*k* of each version of our database. Here, *k* = 6, 7, ... , 26. For instance for *k* = 26, the first 67,108,864 word tokens are compared with the last 67,108,864 word tokens by calculating the generalized divergence between both "texts" for different α-values. Through the manipulation of the article order, it can be inferred that, random fluctuations aside, any systematic differences are caused by differences in the sample size.

As outlined above, our initial research interest concerned the use of generalized entropies and divergence in order to measure lexical change rates at specific ranges of the word frequency spectrum. To this end, we used the publication date of each article on a monthly basis to create a diachronic version of our database. Figure 3 visualizes the corpus size *Nt* for each *t*, where each monthly observation is identified by a variable containing the year *y* = 1947, 1948, ... , 2017 and the month *m* = 1, 2, ... ,12.

**Figure 3.** Sample size of the database as a function of time. The gray line depicts the raw data, while the orange line adds a symmetric 25-month window moving-average smoother highlighting the central tendency of the series at each point in time.

Instead of calculating the generalized Jensen–Shannon divergences for two different texts *p* and *q*, *D*α was calculated for successive moments in time, i.e., *D*α(*<sup>t</sup>*,*<sup>t</sup>* − 1), in order to estimate the rate of lexical change at a given time point *t* [11,12]. For instance, *D*α at *y* = 2000 and *m* = 1 represents the generalized divergence for a corresponding α-value between all articles that were published in January 2000 and those published in December 1999. The resulting series of month-to-month changes could then be analyzed in a standard time-series analysis framework. For example, we can test whether the series exhibits any large-scale tendency to change over time. A series with a positive trend increases over time, which would be indicative of an increasing rate of lexical change. It would also be interesting to look at first differences in the series, as an upward trend here in addition to an upward trend in the actual series would mean that the rate of lexical change is increasing at an increasing rate.

However, because the sample size clearly varies as a function of time (cf. Figure 3), it was essential to rule out the possibility that this variation systematically influences the results. Therefore, we generated a second version of this diachronic database in which we first randomly arranged the order of each article again. We then used the first *N*<sup>t</sup>=<sup>1</sup> words of this version of the database to generate a new corpus that has the same length (in words) as the original corpus at *t* = 1 but in which the diachronic signal is destroyed. We then proceeded and used the next *N*<sup>t</sup>=<sup>2</sup> words to generate a corpus that has the same length as the original corpus at *t* = 2. For example, the length of a concatenation of all articles that where published in *Der Spiegel* in January 1947 is 94,716 word tokens. Correspondingly, our comparison corpus at this point in time also consisted of 94,716 word tokens, but the articles of which it consisted could belong to any point in time between 1947 and 2017. In what follows, we computed all *D*α (*<sup>t</sup>*,*<sup>t</sup>* − 1) values for both the original version of our database and for the version with a destroyed diachronic signal. We tentatively called this a "Litmus test", because it determined whether our results can be attributed to real diachronic changes or if there is a systematic bias due to the varying sample sizes.

*Statistical analysis*: To test if *H*α and *D*α vary as a function of the sample size without making any assumptions regarding the functional form of the relationship, we used the non-parametric Spearman correlation coefficient denoted as ρ. It assesses whether there is a monotonic relationship between two variables and is computed as Pearson's correlation coefficient on the ranks and average ranks of the two

variables. The significance of the observed coe fficient was determined by Monte Carlo permutation tests in which the observed values of the sample size are randomly permuted 10,000 times. The null hypothesis is that *H*α/*D*α does not vary with the sample size. If this is the case, then the sample size becomes arbitrary and can thus be randomly re-arranged, i.e., permuted. Let *c* denote the number of times the absolute ρ-value of the derived dataset is *greater than or equal to* the absolute ρ-value computed on the original data. A corresponding coe fficient was labeled as "statistically significant" if *c* < 10, i.e., *p* < 0.001. In cases where *l*, i.e., the number of datapoints, was lower than or equal to 7, an exact test for all *l*! permutations was calculated. Here, let *c*\* denote the number of times where the absolute ρ-value of the derived dataset is *greater than* the absolute ρ-value computed on the original data. A coe fficient was labeled as "statistically significant" if *c*\*/*l*! < 0.001.

*Data availability and reproducibility*: All datasets used in this study are available in Dataverse (https: //doi.org/10.7910/DVN/OP9PRL). For copyright and license reasons, each actual word type is replaced by a unique numerical identifier. Regarding further data access options, please contact the corpus linguistics department at Institute for the German language (IDS) (korpuslinguistik@ids-mannheim.de). In the spirit of reproducible science, one of the authors (A.K.) first analyzed the data using Stata and prepared a draft. Another author (S.W.) then used the draft and the available datasets to reproduce all the results using R. The results of this replication are available and the code (Stata and R) required to reproduce all the results presented in this paper are available in Dataverse (https://doi.org/10.7910/DVN/OP9PRL).
