*3.3. Case Study*

As previously outlined, our initial idea was to use generalized divergences to measure the rate of lexical change at specific ranges of the word frequency spectrum. In what follows, we estimate the rate by calculating *D*α for successive months, i.e., *D*α(*<sup>t</sup>*,*<sup>t</sup>* − 1). To rule out a potential systematical influence of the varying sample size, we also calculated *D*α(*<sup>t</sup>*,*<sup>t</sup>* − 1) for our comparison corpus where the diachronic signal was destroyed ("Litmus test").

For α, we chose 2.00 and 1.00. On the one hand, the analyses of [21,22] and our analysis presented above indicate that α = 2.00 seems to be the most robust choice. On the other hand, we chose α = 1.00, i.e., the original Jensen–Shannon divergence, because, as explained above, it has already been employed in the context of analyzing natural language data without explicitly testing the potential influence of varying sample sizes. Figure 5 shows our results. If we only looked at the plots on the left side (blue lines), the results would look very interesting, as there is a clear indication that the rate of lexical change decreases as a function of time for both α = 1.00 and for α = 2.00. However, looking at the plots in the middle reveals that a very similar pattern emerges for the comparison data. For our "Litmus test", we destroyed all diachronic information except for the varying sample sizes. Nevertheless, our conclusions

would have been more or less identical. Interestingly, the patterns in Figure 5 clearly resemble the pattern of the sample size in Figure 3 (in reverse order) and thus sugges<sup>t</sup> a negative association between *D*α(*<sup>t</sup>*,*<sup>t</sup>* − 1) and the sample size. To test this observation, we calculated the Spearman correlation between the sample size and *D*α(*<sup>t</sup>*,*<sup>t</sup>* − 1) for both α = 1.00 and α = 2.00 and ran a permutation test. Table 4, row 1, shows that there is a significant strong negative correlation between the sample size and *D*α for both α = 1.00 and α = 2.00. Rows 2–5 present different approaches to solving the sample size dependence of *D*<sup>α</sup>. In row 2, we extended Equation (2) to allow for unequal sample sizes, i.e., *Np* - *Nq* as suggested by ([22], Appendix A); here:

$$\begin{array}{rcl}D\_a^{\pi}(p\_\prime q) & = H\_a(\pi\_p p + \pi\_q q) - \pi\_p H\_a(p) - \pi\_q H\_a(q) \\ \text{where } \pi\_p & = N\_p / \left(N\_p + N\_q\right) \text{and } \pi\_q \\ \end{array} \tag{7}$$

**Figure 5.** *D*α(*<sup>t</sup>*,*<sup>t</sup>* − 1) as a function of time for α = 1.00 and α = 2.00. Lines represent a symmetric 25-month window moving-average smoother highlighting the central tendency of the series at each point in time. Left: results for the original data in blue. Middle: results for the "Litmus" data in orange. Right: superimposition of both the original and the "Litmus" data.

Row 2 of Table 4 demonstrates that this "natural weights" approach does not qualitatively affect the results; there is still a significant and strong negative correlation between the sample size and *D*πα for both α = 1.00 and α = 2.00. Another approach is to increase the sample size (if possible). To this end, we aggregated the articles at the annual level instead of the monthly level. On average, the annual corpora are *N* = 3,334,409.04 words long, compared to *N* = 277,867.42 word tokens for the monthly data. Row 3 of Table 4 shows that increasing the sample size does not help in removing the influence of the sample size either. Another standard approach [15,22] is to randomly draw *Nmin* word tokens from the monthly databases, where *Nmin* is equal to the smallest of all monthly corpora, here *Nmin* = 75,819 (June 1947). To our own surprise, row 4 of Table 4 reveals that this "random draw" approach also does not break the sample size dependence. While the absolute values of the correlation coefficients for both α = 1.00 and α = 2.00 are smaller for the original data than for the comparison data, all four coefficients are significantly different from 0 (at *p* < 0.001) and thus indicate that the "random draw" approach fails to pass the "Litmus test". As a last idea, we decided to truncate each monthly corpus after *Nmin* word tokens. The difference between this "cut-off" approach and the "random draw" is that the latter approach assumes that words occur randomly in texts, while truncating the data after *Nmin* as in the "cut-off" approach respects the syntactical and semantical coherence and the discourse structure at the

text level [16,17]. On the one hand, row 5 of Table 4 demonstrates that this approach mostly solves the problem: all four coefficients are small, and only one coefficient is significantly different from zero, but positive. This suggests that the "cut-off" approach passes the "Litmus test". On the other hand, it's worth pointing out that we lose a lot of information with this approach. For example, the largest corpus is *N* = 507,542 word tokens long (October 2000). With the "cut-off" approach, more than 85% of those word tokens are not used to calculate *D*α(*<sup>t</sup>*,*<sup>t</sup>* − 1).

**Table 4.** Spearman correlation between the sample size and *D*α(*<sup>t</sup>*,*<sup>t</sup>* − 1) for the original data and for the "Litmus test" for α = 1.00 and α = 2.00.


\* An asterisk indicates that the corresponding correlation coefficient passed the permutation test at *p* < 0.001.

While the resulting pattern in Figure 6 might be indicative of an interesting lexico-dynamical process, especially for α = 1.00, what is more important in the present context is the fact that both blue lines look completely different compared with the corresponding blue lines in Figure 5. Thus, in relation to the analysis above (cf. Section 3.2), we concluded that the systematic sample size dependence of *D*α is far from practically irrelevant. On the contrary, the analyses presented in this section demonstrate again why it is essential to account for the sample size dependence of lexical statistics.

**Figure 6.** *D*α(*<sup>t</sup>*,*t*−1) as a function of time for α= 1.00 and α= 2.00. Here, each monthly corpus is truncated after *Nmin* = 75,819 word tokens. Lines represent a symmetric 25-month window moving-average smoother highlighting the central tendency of the series at each point in time. Left: results for the original data in blue. Middle: results for the "Litmus" data in orange. Right: superimposition of both the original and the "Litmus" data.
