**2. Preliminary Considerations**

Given a sample of natural language (a text, a fragment of speech, or a corpus, in general), any word type (i.e., each unique word) has an associated word length, which we measure in number of characters (as we deal with a written corpus), and an associated word absolute frequency, which is the number of occurrences of the word type on the corpus under consideration (i.e., the number of tokens of the type). We denote these two random variables as - and *n*, respectively.

Zipf's law of word frequency is written as a power-law relation between *f*(*n*) and *n* [6], i.e.,

$$f(n) \propto \frac{1}{n^{\beta}} \text{ for } n \ge c,$$

where *f*(*n*) is the empirical probability mass function of the word frequency *n*, the symbol ∝ denotes proportionality, *β* is the power-law exponent, and *c* is a lower cut-off below which the law losses its validity (so, Zipf's law is a high-frequency phenomenon). The exponent *β* takes values typically close to 2. When very large corpora are analyzed (made from many different texts an authors) another (additional) power-law regime appears at smaller frequencies [16,17],

$$f(n) \propto \frac{1}{n^a} \text{ for } a \le n \le b\_{\prime\prime}$$

with *α* a new power law exponent smaller than *β*, and *a* and *b* lower and upper cut-offs, respectively (with *a* < *b* < *c*). This second power law is not identified with Zipf's law.

On the other hand, the law of word lengths [12] proposes a lognormal distribution for the empirical probability mass function of word lengths, that is,

$$f(\ell) \sim \text{LN}(\mu, \sigma^2),$$

where LN denotes a lognormal distribution, whose associated normal distribution has mean *μ* and variance *σ*<sup>2</sup> (note that with the lognormal assumption it would seem that one is taking a continuous approximation for *f*(-); nevertheless, discreteness of *f*(-) is still possible just redefining the normalization constant). The present paper challenges the lognormal law for *f*(-). Finally, the brevity law [14] can be summarized as

$$\text{corr}(\ell, n) < 0,$$

where corr(-, *n*) is a correlation measure between - and *n*, as, for instance, Pearson correlation, Spearman correlation, or Kendall correlation.

We claim that a more complete approach to the relationship between word length and word frequency can be obtained from the joint probability distribution *f*(-, *n*) of both variables, together with the associated conditional distributions *f*(*n*|-). To be more precise, *f*(-, *n*) is the joint probability mass function of type length and frequency, and *f*(*n*|-) is the probability mass function of type frequency conditioned to fixed length. Naturally, the word-frequency distribution *f*(*n*) and the word-length distribution *f*(-) are just the two marginal distributions of *f*(-, *<sup>n</sup>*).

The relationships between these quantities are

$$f(\ell) = \sum\_{n=1}^{\infty} f(\ell, n),$$

$$f(n) = \sum\_{\ell=1}^{\infty} f(\ell, n),$$

$$f(\ell, n) = f(n|\ell)f(\ell).$$

Note that we will not use in this paper the equivalent relation *f*(-, *n*) = *f*(-|*n*)*f*(*n*), for sampling reasons (*n* takes many more different values than -; so, for fixed values of *n* one may find there is not enough statistics to obtain *f*(-|*n*)). Obviously, all probability mass functions fulfil normalization,

$$\sum\_{\ell=1}^{\infty} \sum\_{n=1}^{\infty} f(\ell, n) = \sum\_{n=1}^{\infty} f(n|\ell) = \sum\_{\ell=1}^{\infty} f(\ell) = \sum\_{n=1}^{\infty} f(n) = 1.$$

We stress that, in our framework, each type yields one instance of the bivariate random variable (-, *<sup>n</sup>*), in contrast to another equivalent approach for which it is each token that gives one instance of the (perhaps-different) random variables, see [7]. The use of each approach has important consequences for the formulation of Zipf's law, as it is well known [7], and for the formulation of the word-length law (as it is not so well known [12]). Moreover, our bivariate framework is certainly different to the that in [18], where the frequency was understood as a four-variate distribution with the random variables taking 26 values from *a* to *z*, and also to the generalization in [19].

#### **3. Corpus and Statistical Methods**

We investigate the joint probability distribution of word-type length and frequency empirically, using all English books in the recently presented Standardized Project Gutenberg Corpus [20], which comprises more than 40,000 books in English, with a total number of tokens equal to 2,016,391,406 and a total number of types of 2,268,043. We disregard types with *n* < 10 (relative frequency below 5 × <sup>10</sup>−9) and also those not composed exclusively by the 26 usual letters from *a* to *z* (previously, capital letters were transformed to lower-case). This sub-corpus is further reduced by the elimination of types with length above 20 characters; to avoid typos and "spurious" words (among the eliminated

types with *n* ≥ 10 we only find three true English words: *incomprehensibilities, crystalloluminescence,* and *nitrosodimethylaniline*). This reduces the numbers of tokens and types, respectively, to 2,010,440,020 and 391,529. Thus, all we need for our study is the list of all types (a dictionary) including their absolute frequencies *n* and their lengths - (measured in terms of number of characters).

Power-law distributions are fitted to the empirical data by using the version for discrete random variables of the method for continuous distributions outlined in [21] and developed in Refs. [22,23], which is based on maximum-likelihood estimation and the Kolmogorov–Smirnov goodness-of-fit test. Acceptable (i.e., non-rejectable) fits require *p*-values not below 0.20, which are computed with 1000 Monte Carlo simulations. Complete details in the discrete case are available in Refs. [6,24]. This method is similar in spirit to the one by Clauset et al. [25], but avoiding some of the important problems that the latter presents [26,27]. Histograms are drawn to provide visual intuition for the shape of the empirical probability mass functions and the adequacy of fits; in the case of *f*(*n*|-) and *f*(*n*), we use logarithmic binning [22,28]. Nevertheless, the computation of the fits does not make use of the graphical representation of the distributions.

On the other side, the theory of scaling analysis, following the authors of [21,29], allows us to compare the shape of the conditional distributions *f*(*n*|-) for different values of -. This theory has revealed a very powerful tool in quantitative linguistics, allowing in previous research to show that the shape of the word-frequency distribution does not change as a text increases its length [30,31].
