**4. Discussion**

In this paper, we explored the possibilities of using generalized entropies to analyze the lexical dynamics of natural language data. Using the α-parameter in order to automatically magnify di fferences between di fferent texts at specific scales of the corresponding word frequency spectrum is interesting, as it promises a more objective selection method compared to, e.g., [8], who use a pre-compiled list of content-free words, or [12], who analyzes di fferences within di fferent part-of-speech classes.

In line with other studies [17,23,27–29], the results demonstrate that it is essential for the analysis of natural language to always take into account the systematic influence of the sample size. With the exception of *H*<sup>α</sup>=*2.00* for larger sample sizes, all quantities that are based on general entropies seem to strongly covary with the sample size (also see [23] for similar results based on <sup>R</sup>ényi's formulation of generalized entropies). In his monograph on word frequency distributions, Baayen [16] introduces the two fundamental methodological issues in lexical statistics:

The sample size crucially determines a grea<sup>t</sup> many measures that have been proposed as characteristic text constants. However, the values of these measures change systematically as a function of the sample size. Similarly, the parameters of many models for word frequency distribution [sic!] are highly dependent on the sample size. This property sets lexical statistics apart from most other areas in statistics, where an increase in the sample size leads to enhanced accuracy and not to systematic changes in basic measures and parameters. ... The second issue concerns the theoretical assumption [ ... ] that words occur randomly in texts. This assumption is an obvious simplification that, however, o ffers the possibility of deriving useful formulae for text characteristics. The crucial question, however, is to what extent this simplifying assumption a ffects the reliability of the formulae when applied to actual texts and corpora. (p.1)

The main message of this paper is that those two fundamental issues also pose a strong challenge to the application of information theory for the quantitative study of natural language signals. In addition, the results of the case study (cf. Section 3.3) indicate that both fundamental issues in lexical statistics apparently interact with each other. As mentioned above, there are numerous studies that used the Jensen–Shannon divergence or related measures without an explicit "Litmus test". Let us mention two examples from our own research:


Both studies are based on data from the Google Books Ngram corpora, made available by [30]. It contains yearly token frequencies for each word type for over 8 million books, i.e., 6% of all books ever published [31]. To avoid a potential systematic bias due to strongly changing corpus sizes, random samples of equal size were drawn from the data in both [12] and [15]. However, as demonstrated in Section 3.3, apparently this simplifying assumption is problematic, because it seems to make a di fference if we randomly sample *N* word tokens or if we keep the first *N* word tokens for the statistical structure of the corresponding word frequency distribution. It is worth pointing out again that, without the "Litmus test" the interpretation of the results presented in Section 3.3 would have been completely di fferent, because randomly drawing word tokens from the data does not seem to break the sample size dependence. It is an empirical question whether the results presented in [12,15], and comparable other papers would pass a "Litmus test". In light of the results presented in this paper, we are rather skeptical, thus echoing the call of [22] that it is "essential to clarify what is the role of finite-size e ffects in the reported conclusions, in particular in the (typical) case that database sizes change over time." (p. 8). One could even go so far as to ask whether relative frequencies that are compared between databases of different sizes are systematically affected by varying database sizes. However, the test scheme as we introduced it presupposes access to the full text data. For instance, due to copyright constraints, access to Google Books Ngram data is restricted to token frequencies for all words (and phrases) that occur at least 40 times in the corpus. Thus, an analogous "Litmus test" is not possible. At our institute, we are rather fortunate to have access to the full text data of our database. Notwithstanding, copyright and license reasons are a major issue here, as well [32]. To solve this problem for our study, we replaced each actual word type with a unique numerical identifier as explained in Section 3.3. For our focus of research, using such a pseudonymization strategy is fine. However, there are many scenarios where, depending on the research objective, the actual word strings matter, making it necessary to develop a different access and publication strategy. It goes without saying that, in all cases, full-text access is the best option.

While the peculiarities of word frequency distributions make the analysis of natural language data more difficult compared to other empirical phenomena, we hope that our analyses (especially the "Litmus test") also demonstrate that textual data offer novel possibilities to answer research questions. Or put differently, natural language data contain a lot of information that can be harnessed. For example, two reviewers pointed out that it could make sense to develop a method that recovers an unbiased lexico-dynamical signal by removing the "Litmus test" signal from the original signal. This is an interesting avenue for future research.

**Supplementary Materials:** The replication results and code (Stata and R) required to reproduce all the results presented in this paper are available in Dataverse (https://doi.org/10.7910/DVN/OP9PRL).

**Author Contributions:** Conceptualization A.K., S.W., and C.M.-S.; methodology and study design, A.K.; data preparation and management, A.K. and S.W.; data analysis—original analysis, A.K.; data analysis —replication, S.W.; visualization, A.K.; writing—original draft preparation, A.K.; writing—review and editing, A.K., S.W., and C.M.-S. All the authors have read and approved the final manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** We thank Sarah Signer for proofreading. We also thank Gerardo Febres and the two anonymous reviewers for their insightful feedback.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **Appendix A Inclusion of Punctuation and Cardinal Numbers.**

Here, punctuation and numbers are included. This version of the database consists of *N* = 286,729,999 word tokens and *K* = 4,056,122 different word types. Table A1 corresponds to Table 1. Because (especially) punctuation symbols have a very high token frequency, the contribution of the highest frequency groups increases when punctuation is not removed from the database. However, the results are still qualitatively very similar. Table A2 corresponds to Table 2. For α≤ 1.50, removing punctuation does not qualitatively affect the results. However, for α= 2.00, except for *n* = 2<sup>24</sup> none of the correlation coefficients pass the permutation test. Again, this indicates that α= 2.00 is a pragmatic choice when calculating H<sup>α</sup>. However, it also demonstrates that the conceptual decision to remove punctuation/cardinal numbers can affect the results. Table A3 corresponds to Table 3 The results are not qualitatively affected by the exclusion of punctuation/cardinal numbers. The same conclusion can be drawn for Table A4, which corresponds to Table 4.


**Table A1.** Contribution of word types with different token frequency as a function of α.

**Table A2.** Spearman correlation between the sample size and *H*α for different α-values \*.


\* An asterisk indicates that the corresponding correlation coefficient passed the permutation test at *p* < 0.001. For minimum sample sizes above 220, an exact permutation test is calculated.


**Table A3.** Spearman correlation between the sample size and *D*α for different α-values \*.

\* An asterisk indicates that the corresponding correlation coefficient passed the permutation test at *p* < 0.001. For minimum sample sizes above 220, an exact permutation test is calculated.

**Table A4.** Spearman correlation between the sample size and *D*α(*<sup>t</sup>*,*<sup>t</sup>* − 1) for the original data and for the "Litmus test" for α = 1.00 and α = 2.00.


\* An asterisk indicates that the corresponding correlation coefficient passed the permutation test at *p* < 0.001.

#### **Appendix B Replication of Table 2 for a Di**ff**erent Formulation of Generalized Entropy.**

Here, we replicate Table 2 for a different formulation of generalized entropy, the so-called <sup>R</sup>ény<sup>i</sup> entropy of order α [24]; it can be written as:

$$H\_a'(p) = \frac{1}{\alpha - 1} \log\_2 \left( \sum\_{i=1}^{K} p\_i^{\alpha} \right) \tag{A1}$$


**Table A5.** Spearman correlation between the sample size and *<sup>H</sup>*α for different α-values \*.

\* An asterisk indicates that the corresponding correlation coefficient passed the permutation test at *p* < 0.001. For minimum sample sizes above 220, an exact permutation test is calculated.
