*4.1. Marginal Distributions*

Let us now study the word-length distribution, *f*(-), shown in Figure 2. The distribution is clearly unimodal (with its maximum at - = 7), and although it has been previously modeled as a lognormal [12], we ge<sup>t</sup> a nearly perfect fit using a gamma distribution,

$$f(\ell) = \frac{\lambda}{\Gamma(\gamma)} (\lambda \ell)^{\gamma - 1} e^{-\lambda \ell},\tag{1}$$

with shape parameter *γ* = 11.10 ± 0.02 and inverted scale parameter *λ* = 1.439 ± 0.003 (where the uncertainty corresponds to one standard deviation, and <sup>Γ</sup>(*γ*) denotes the gamma function). Notice then that, for large lengths, we would ge<sup>t</sup> an exponential decay (asymptotically, strictly speaking). However, there is an important difference between the lognormal distribution proposed in [13] and the gamma distribution found here, which is that the former case refers to the length of tokens, whereas in our case we deal with the length of types (of course, length of tokens and length of types is the same length, but the relative number of tokens and types is different, depending on length). This was already distinguished by Herdan [12], who used the terms occurrence distribution and dictionary distribution, and proposed that both of them were lognormal. In the caption of Figure 2 we provide the log-likelihoods of both the gamma and lognormal fits, concluding that the gamma distribution yields a better fit for the "dictionary distribution" of word lengths. The fit is specially good in the range - > 2.

**Figure 2.** Probability mass function *f*(-) of type length, together with gamma and lognormal fits. Note that the majority of types are those with lengths between 4 and 13, and that *f*(-) is roughly constant between 5 and 10. The superiority of the gamma fit is visually apparent, and this is confirmed by log-likelihood equal to −872,175.2 in front of the value −876,535.1 for the lognormal (a discrete gamma distribution slightly improves the fit, but the simple continuous case here is enough for our purposes). The parameters resulting for the gamma fit are given in the text, and those for the lognormal are *μ* = 1.9970 ± 0.0005 and *σ* = 0.3081 ± 0.0003.

Regarding the other marginal distribution, which is the word-frequency distribution *f*(*n*) represented in Figure 3, we ge<sup>t</sup> that, as expected, Zipf's law is fulfilled with *β* = 1.94 ± 0.03 for *n* ≥ 1.9 × 10<sup>5</sup> (this is almost three orders of magnitude), see Table 1. Another power-law regime in the bulk, as in [16], is found to hold for one order of magnitude and a half (only), from *a* 400 to *b* 14,000, with exponent *α* = 1.41 ± 0.005, see Table 2. Note that although the truncated power law for the bulk part of the distribution is much shorter than the one for the tail (1.5 orders of magnitude in front of almost 3), the former contains many more data (50,000 in front of ~1000), see Tables 1 and 2 for the precise figures. Note also that the two power-law regimes for the frequency translate into two exponential regimes for *m* (the logarithm of *n*).

**Figure 3.** Probability mass function *f*(*n*) of type frequency (this is a marginal distribution with respect *f*(-, *n*)). The results of the power-law fits are also shown. The fit of a truncated continuous power law, maximizing number of data, yields *α* = 1.41; the fit of a untruncated discrete power law yields *β* = 1.94.

**Table 1.** Results of the fitting of an discrete untruncated power law to the conditional distributions *f*(*n*|-), denoted by a fixed -, and to the marginal distribution *f*(*n*), denoted by the range 1 ≤ - ≤ 20. *N* is total number of types, *nmax* is the frequency of the most frequent type, *c* is the lower cuf-off of the fit, ordermag is log10(*nmax*/*c*), *Nc* is the number of types with *n* ≥ *c*, *β* is the resulting fitting exponent, *σβ* is its standard deviation, and *p* is the *p*-value of the fit. For the conditional distributions, the possible fits are restricted to the range *n* > *n*<sup>2</sup>|-/*n*|-. The fit proceeds by sweeping 50 values of *c* per order of magnitude and using 1000 Monte Carlo simulations for the calculation of *p*. Of all the fits with *p* ≥ 0.20 for a given -, the one with smaller *c* is selected. Outside the range 5 ≤ - ≤ 14, the number of types in the tail (below 10) is too low to yield a meaningful fit.


**Table 2.** Results of the fitting of a truncated power law to the conditional distributions *f*(*n*|-), denoted by a fixed -, and to the marginal distribution *f*(*n*), denoted by the range 1 ≤ - ≤ 20. *N* is total number of types; *a* and *b* are the lower and upper cut-offs of the fit, respectively; *Nab* is the number of types with *a* ≤ *n* ≤ *b*; *α* is the resulting fitting exponent; *σα* is its standard deviation; and *p* is the *p*-value of the fit. The fit of a continuous power law is attempted in the range *n* < 0.1*n*<sup>2</sup>|-/*n*|-, sweeping 20 values of *a* and *b* per order of magnitude and using 1000 Monte Carlo simulations for the calculation of *p*. Of all the fits with *p* ≥ 0.20, for a given -, the one with larger *b*/*a* is selected, except for *f*(*n*), where the largest *Nab* is used.


#### *4.2. Power Laws and Scaling Law for the Conditional Distributions*

As mentioned, the conditional word-frequency distributions *f*(*n*|-) are of substantial relevance. In Figure 4, we display some of those functions, and it turns out that *n* is broadly distributed for each value of - (roughly in the same qualitative way it happens without conditioning to the value of -). Remarkably, the results of a scaling analysis [21,29], depicted in Figure 5, show that all the different *f*(*n*|-) (for 3 ≤ - ≤ 14) share a common shape, with a scale determined by a scale parameter in frequency. Indeed, rescaling *n* as *nn*|-/*n*<sup>2</sup>|- and *f*(*n*|-) as *f*(*n*|-)*n*<sup>2</sup>|-2/*n*|-3, where the first and second empirical moments, *n*|- and *n*<sup>2</sup>|-, are also conditioned to the value of -, we obtain an impressive data collapse, valid for ~7 orders of magnitude in *n*, which allows us to write the scaling law

$$f(n|\ell) \simeq \frac{\langle n|\ell\rangle^3}{\langle n^2|\ell\rangle^2} \lg\left(\frac{n\langle n|\ell\rangle}{\langle n^2|\ell\rangle}\right) \text{ for } 3 \le \ell \le 14\text{.}$$

where the key point is that the scaling function *g*(...) is the same function for any value of -. For - > 14 the statistics is low and the fulfilment of the scaling law becomes uncertain. Defining the scale parameter *θ*(-) = *n*<sup>2</sup>|-/*n*|-, we ge<sup>t</sup> alternative expressions for the same scaling law,

$$f(n|\ell) \simeq \frac{\langle n|\ell\rangle}{\theta^2(\ell)} \mathfrak{g}\left(\frac{n}{\theta(\ell)}\right) \propto \frac{1}{\theta^{\alpha}(\ell)} \mathfrak{g}\left(\frac{n}{\theta(\ell)}\right) \text{ for } 3 \le \ell \le 14,$$

where constants of proportionality have been reabsorbed into *g*, and the scale parameter has to be understood as proportional to a characteristic scale of the conditional distributions (i.e., *θ* is the characteristic scale, up to a constant factor; it is the relative change of *θ* what will be important for us). The reason for the fulfillment of these relations is the power-law dependence between the moments

and the scale parameter when a scaling law holds, this power-law dependence is *n*|- ∝ *θ*2−*<sup>α</sup>* and *n*<sup>2</sup>|- ∝ *θ*3−*<sup>α</sup>* for 1 < *α* < 2, see [21,29].

**Figure 4.** Probability mass functions *f*(*n*|-) of frequency *n* conditioned to fixed value of length -, for several values of -. Distributions are shown twice: all together and individually, displaced in the vertical axis by diverse factors <sup>10</sup>−3, 10−<sup>4</sup> up to <sup>10</sup>−8, for clarity sake of the power-law fits, represented by dark continuous lines.

**Figure 5.** Word frequency probability mass functions *f*(*n*|-) conditioned to fixed value of length rescaled by the ratio of powers of moments, as a function as rescaled frequency, for all values of length from 3 to 14. The data collapse guarantees the fulfilment of a scaling law.

The data collapse also unveils more clearly the functional form of the scaling function *g*, allowing to fit its power-law shape in two different ranges. The scaling function turns out to be compatible with a double power-law distribution, i.e., a (long) power law for *n*/*θ* < 0.1 with exponent *α* at ~1.4 and another (short) power law for *n*/*θ* > 1 with exponent *β* at ~2.75; in one formula,

$$g(y) \propto \begin{cases} 1/y^{1.4} & \text{for } y \ll 1, \\ 1/y^{2.75} & \text{for } y > 1, \end{cases} \tag{2}$$

for *y* = *<sup>n</sup>*/*θ*. In other words, there is a (smooth) change of exponent (a change of log-log slope) at a value of *n Cθ*(-), with the proportionality constant *C* taking some value in between 0.1 and 1 (as the transition from one regime to the other is smooth there is not a well defined value of *C* that separates both). Fitting power laws to those ranges we ge<sup>t</sup> the results shown in Tables 1 and 2. Note that *Cθ*(-) can be understood as the characteristic scale of *f*(*n*|-) mentioned before, and can be also called a frequency crossover.

Nevertheless, although the power-law regime for intermediate frequencies (*n* < 0.1*θ*) is very clear, the validity of the other power law (the one for large frequencies) is questionable, in the sense that the power law provides an "acceptable" fit but other distributions could do the same good job, due to the limited range spanned by the tail (less than one order of magnitude). Our main reason to fit a power law to the large-frequency regime is the comparison with Zipf's law (*β* 2), and, as we see, the resulting value of *β* for *f*(*n*|-) turns out to be rather large (the results of *β* for all *f*(*n*|-) turn out to be statistically compatible with *β* = 2.75). In addition, we will show in the next subsection that the high-frequency behavior of the conditional distributions (power law or not) has nothing to do with Zipf's law.

#### *4.3. Brevity Law and Possible Origin of Zipf's Law*

Coming back to the scaling law, its fulfillment has an important consequence: it is the scale parameter *θ*(-) and not the conditional mean *n*|- what sets the scale of the conditional distributions *f*(*n*|-). Figure 6 represents the brevity law in terms of the scale parameter as a function of - (the conditional mean value is also shown, for comparison, overimposed to maps of *f*(*<sup>n</sup>*, -) and *f*(*n*|-)). Note that the authors of [13] dealt with the conditional mean, finding an exponential decay *n*|- ∝ 26−0.6-. Using our corpus (which is certainly different), we find that such an exponential decay for the mean is valid in a range of - between 1 and 5, approximately. In contrast, the scale parameter *θ* shows an approximate power-law decay from about - = 6 to 15, with an exponent *δ* around 3 (or 2.8, to be more precise), i.e.,

$$
\theta(\ell) \propto \frac{1}{\ell^{\beta}}
$$

(note that Herdan assumed this exponent to be 2.4, with no clear empirical support [12]). Beyond - = 15, the decay of *θ*(-) is much faster. Nevertheless, these results are somewhat qualitative.

With these limitations, we could write a new version of the scaling law as

$$f(n|\ell) \simeq \ell^{\delta n} \lg \left(\ell^{\delta} n\right) \tag{3}$$

where the proportionality constant between *θ* and *δ* has been reabsorbed in the scaling function *g*. The corresponding data collapse is shown in Figure 7, for 5 ≤ - ≤ 14. Despite the rough approximation provided by the power-law decay of *θ*(-), the data collapse in terms of scaling law (3) is nearly excellent for *δ* = 2.8. This version of the scaling law provides a clean formulation of the brevity law: the characteristic scale of the distribution of *n* conditioned to the value of - decays with increasing - as 1/*δ*; i.e., the larger -, the shorter the conditional distribution *f*(*n*|-), quantified by the exponent *δ*.

However, in addition to a new understanding of the brevity law, the scaling law in terms of - provides, as a by-product, an empirical explanation of the origin of Zipf's law. In the regime of - in which the scaling law is approximately valid, i.e., for -1 ≤ - ≤ -2, we can obtain the distribution of frequency as a mixture of conditional distributions (by the law of total probability),

$$f(n) = \int\_{\ell\_1}^{\ell\_2} f(n|\ell) f(\ell) d\ell$$

(where we take a continuous approximation, replacing sum over - by integration; this is essentially a mathematical rephrasing). Substituting the scaling law and introducing the change of variables *x* = *<sup>δ</sup>n* we ge<sup>t</sup>

$$f(\mathfrak{n}) = \int\_{\ell\_1}^{\ell\_2} \ell^{\delta \mathfrak{a}} \mathfrak{g}\left(\ell^{\delta} \mathfrak{n}\right) f(\ell) d\ell \propto \int\_{\ell\_1^{\delta} \mathfrak{n}}^{\ell\_2^{\delta} \mathfrak{n}} \left(\frac{\mathfrak{x}}{\mathfrak{n}}\right)^{\mathfrak{a}} \mathfrak{g}(\mathfrak{x}) \frac{\mathfrak{x}^{-1 + 1/\delta}}{n^{1/\delta}} d\mathfrak{x}$$
 
$$= \frac{1}{n^{a + 1/\delta}} \int\_{\ell\_1^{\delta} \mathfrak{n}}^{\ell\_2^{\delta} \mathfrak{n}} \mathfrak{x}^{a - 1 + 1/\delta} \mathfrak{g}(\mathfrak{x}) d\mathfrak{x}$$

where we also have taken advantage of the fact that, in the region of interest, *f*(-) can be considered (in a rough approximation) as constant.

**Figure 6.** Estimated value of the scale parameter *θ* of the frequency conditional distributions (*θ* = *n*<sup>2</sup>|-/*n*|-) as a function of type length -, together with conditional mean value *n*|-. A decaying power law with exponent 2.8, shown as a guide to the eye, is close to the values of the scale parameter for 6 ≤ - ≤ 13. The curves are overimposed to the values of the joint distribution *f*(*<sup>n</sup>*, -) in the (**top panel**) and to the conditional distribution *f*(*n*|-) in the (**bottom panel**). Notice that in the last case both axes are logarithmic. The shadower the green color, the higher the value of *f*(*<sup>n</sup>*, -) and *f*(*n*|-).

From here, we can see that in the case where the frequency is small (*n θ*(-2)), the integration limits are also small, and then the last integral scales with *n* as *n*1/*<sup>δ</sup>* (because we have that *g*(*x*) ∝ 1/*x<sup>α</sup>*), which implies that we recover a power law with exponent *α* for *f*(*n*), i.e., *f*(*n*) ∝ 1/*nα*. However, for larger frequencies (*n* above *θ*(-2) but below *θ*(-1)), the integral does not scale with *n* but can be considered instead as constant and then we ge<sup>t</sup> Zipf's law as

$$f(n) \propto n^{-\left(a+\frac{1}{5}\right)\_+} $$

This means that Zipf's exponent can be obtained from the values of the intermediate-frequency power-law conditional exponent *α* and the brevity exponent *δ* as

$$
\beta\_{\overline{z}} = \mathfrak{a} + \frac{1}{\delta'} 
$$

where we have introduced a subscript *z* in *β* to stress that this is the *β* exponent appearing in Zipf's law, corresponding to the marginal distribution *f*(*n*), and to distinguish it from the one of the conditional distributions, that we may call *β<sup>c</sup>*. Note then that *βc* plays no role in the determination of *β<sup>z</sup>*, and, in fact, the scaling function does not need to have a power-law tail to obtain Zipf's law. This sort of argumen<sup>t</sup> is similar to the one used in statistical seismology [32], but in that case the scaling law was elementary (i.e., *θ* = *n*|-).

**Figure 7.** Same as Figure 5, from - = 5 to 14, changing the scale factors from combination of powers of moments (*n*|- and *n*<sup>2</sup>|-) to powers of length (in concrete, -−*δ* and *δα*). The collapse signals the fulfilment of a scaling law. Two decreasing power laws with exponents 1.43 and 2.76 are shown as straight lines, for comparison.

We can check the previous exponent relation using the empirical values of the exponent. We do not have a unique measure of *α*, but from Table 2, we see that its value for the different *f*(*n*|-) is quite well defined. Taking the harmonic mean between the values 4 ≤ - ≤ 14 we ge<sup>t</sup> *α*¯ = 1.43, which together with *δ* = 2.8 leads to *βz* 1.79, not far from the ideal Zipf's value *βz* = 2 and closer to the empirical value *βz* = 1.94. The reason to calculate the harmonic mean of the exponents comes from the fact that it is the maximum-likelihood outcome when untruncated power-law datasets are put together [33]; when the power laws are truncated, the result is closer to the untruncated case when the range *b*/*a* is large.
