**2. Results**

Our results are based on an analysis of Glissando corpus [40]. For illustration, Table 2 summarises some general characteristics of this oral corpus (for more details, see Section 4, Materials and Methods). For the phonetic inventory of Spanish and Catalan, note that only phonemes that appear effectively in the Glissando corpus have been taken into account, without considering other phonemes that could appear in other linguistic varieties of both languages [41].

In the next subsections we provide a systematic study of each of the six laws detailed above in Glissando corpus. Table 3 summarizes the fitted exponents and parameters for all the linguistic laws explored in this work at the level of words, whereas Table 4 does the same at the phonemic level, with the exception, as explained in the following section, of the lognormality law for Catalan and

Spanish. As we will see, linguistic laws are again recovered with only slight differences with respect to English [4] and some technical details that are worth detailing for each law.

**Table 2.** Main characteristics of Glissando. This Table summarises main characteristics of Glissando corpus [40] across linguistic levels for both Catalan and Spanish. For reference, a comparison to Buckeye corpus (English) is provided [4,42,43]. We report the total number of linguistic elements (phonemes, words and breath groups (BG)), specifying the number of different linguistic elements (types) and the total (tokens). Since time duration distribution of linguistic levels are usually heavy-tailed [4], we use median duration (instead of mean) as a reference.


**Table 3.** Summary of exponents and parameters for the case of words. Results are on reasonable good agreemen<sup>t</sup> to those found for English in [4]. Note that actual fit of lognormality law in Spanish and Catalan was not carried out due to low-resolution problems of the Glissando corpus, however we certified that this law also holds (see the text).


**Table 4.** Summary of exponents and parameters for the case of phonemes. Results are on reasonable good agreemen<sup>t</sup> to those found for English in [4]. Note that actual fit of lognormality law in Spanish and Catalan was not carried out due to low-resolution problems of the Glissando corpus, however we certified that this law also holds (see the text).


#### *2.1. Lognormality Law and Low-Resolution Effects*

Recently [4], after exploring most common plausible family of probability distributions with the use of maximum likelihood estimation method (MLE) [44], compelling statistical evidence showed that the time duration distribution in speech in an English corpus is lognormally distributed across linguistic scales (phonemes, words, and BG), and such regularity was robust for individual speakers. Moreover, a generative mechanism able to explain the stability of the lognormal law for different linguistic scales was also proposed [4], suggesting that such regularity is indeed universal, hence proposing the so-called lognormality law. Here we explore the fulfillment of such new law in two additional languages, Catalan and Spanish. Empirical results for the time duration distribution *P*(*t*) of phonemes, words, and breath-groups (BGs) are depicted in the main plots of the top panels of Figure 1. Since a lognormal distribution appears normal (Gaussian) in linear-logarithmic axis, we have logarithmically rescaled the time duration variable *t* as

$$t' = \frac{\log(t) - \langle \log(t) \rangle}{\sigma(\log(t))}.$$

Accordingly, if *P*(*t*) is lognormal, then *P*(*t* ) is a standard Gaussian N (0, 1) regardless of the linguistic scale. This fact is numerically checked in the inset panels of Figure 1.

Overall the data approximately collapse to the Gaussian shape—hence validating the lognormality law—however, there are small deviations, and these are notably stronger for phonemes at short timescales (note that only the right branch of the Gaussian is recovered and deviations are found at *t* < 0). In what follows we argue that this is indeed an artifact due to finite-precision and lower-bound resolution of the Catalan and Spanish corpus, rather than a genuine, linguistic effect.

First, note that lower-bound segmentation time in the Glissando Corpus is 30 ms, and the corpus has a precision (granularity) of 10 ms. The lower-bound segmentation time precludes us from experimentally observing the left-end of the phoneme time duration distribution. Furthermore, these artifacts can propagate up to a higher scale (i.e., to words), as evidenced by the fact that words with duration of 30, 40 or 50 ms turn out to be always composed by a single phoneme, words with time duration of 60, 70 and 80 ms have one or two phonemes, and so on.

**Figure 1.** Lognormality law for time duration. (**outer panels**) Time duration distribution of phonemes (orange), words (blue) and BGs (green) for Glissando corpus: Catalan (**top left**) and Spanish (**top right**). For comparison, in the bottom left panel we show the results of English from Buckeye corpus (extracted from [4]), where Buckeye has finer statistics (higher resolution) than Glissando. A coarsened version of the English corpus—developed to be comparable with Glissando's resolution—is plotted in the bottom, right panel (see the text for details). (**inset panels**) Collapse of all distributions after time rescaling *t* = (log(*t*) − log(*t*)/*std*(log(*t*))) (where *std*(log(*t*)) stands for the standard deviation of the random variable log *t*). If time durations at all levels comply with a lognormal distribution, then the collapsed data should approach a standard Gaussian N (0, 1) (solid line), in good agreemen<sup>t</sup> with the results. Small deviations found in Catalan and Spanish are similarly found in the coarsened version of English, thus concluding that such deviations are mainly due to finite-precision and lower-bound detectability effects, and the lognormality law otherwise holds.

In order to certify that these low-resolution issues are indeed underpinning the deviations from the pure lognormality law, we have added the following experiment. The so-called Buckeye corpus (English corpus) has higher resolution than Glissando and precision and is, therefore, free from these issues (also, Buckeye corpus has larger sample sizes than Glissando, see Table 2). Indeed for the Buckeye corpus, compliance to the lognormality law has recently found to be excellent (see bottom left panel of Figure 1). We thus proceed to construct a coarsened, low-resolution version of the Buckeye corpus, comparable to the particularities of the Glissando corpus under study, by rounding up time durations in Buckeye data to a precision of 10 ms and, by further setting the minimum observable time duration (the lower limit segmentation) to 30 ms (we do not deal with further limitations such as that words shorter than 60 ms are always composed of one phoneme in Glissando). The resulting time distribution of phonemes, words, and BGs in this coarsened Buckeyed corpus are plotted in the right panel of Figure 1. Interestingly, similar deviations from the lognormality law to the ones found in the Glissando corpus are now recovered in the low-resolution version of the Buckeye corpus. This evidence supports our hypothesis that the lognormality law indeed holds well in Catalan and Spanish, albeit it might not be fully observable at the phoneme level in the Glissando corpus. Furthermore, this analysis flags an important issue: low-resolution effects such as low precision and a too-large lower limit segmentation time can induce important deviations and hinder the observation of the true, underlying distribution.

To further investigate these effects, it is worth discussing at this point that the origin of the lognormality law has been mathematically discussed recently in terms of a stochastic model [4]. Suppose that phoneme time durations can be modeled by a random variable *Y*, which is indeed lognormally distributed. Since words can be understood as concatenation of phonemes, then the time duration of words can thus be modeled by a random variable *Z* = ∑*n i*=1 *Yi*, where each *Yi* is in principle a different lognormal distribution and *n* is ye<sup>t</sup> another random variable which describes the number of phonemes shaping up a word. Whereas when *n* is large the central limit theorem predicts *Z* is asymptotically normal when *n* is small and under some additional conditions, *Z* is well approximated by a lognormal distribution [4], thereby explaining why the time duration of words is indeed found to be lognormally distributed in practice. Now, how would *Z* be distributed if we imposed on its sampling the artifacts detected in Glissando, such as a large lower-bound detectability threshold, finite precision, or a smallish sample size? To illustrate these effects, we have run a numerical test where we initially sample words of duration *Z*, constructed by concatenating phonemes with time duration *Y* where *Y* = exp(*X*) and *X* is a Gaussian random variable. *Y* is therefore lognormal and if we log-rescale it *Y*˜ = [log *Y* − log *Y*]/*std*(log *Y*) (where *std*(log *Y*) stands for the standard deviation of the random variable log *Y*) we should recover a standard Gaussian N (0, <sup>1</sup>). This distribution is shown (black curve) in the left panel of Figure 2, whereas the case of words is plotted in the right panel of the same figure, approximately recovering again the lognormal shape (standard Gaussian in rescaled units). Then, we have repeated the same experiment and 'lowered its resolution' by imposing the following: (i) the precision of *Y* is rounded to two decimal digits, imitating the precision of 10 ms found in Glissando, (ii) any synthetic phoneme shorter than a lower bound *Y* < 0.03 s is forced to have the minimal allowed duration, *Y* = 0.03 s. Results for this low-resolution version of the original experiment are then shown as purple curves in the same Figure 2. In particular, we can see how the lognormal shape of the phoneme time duration is significantly affected for shorter timescales, and such issues propagate to the word case at short timescales. The phenomenology is similar to what we found by comparing the results on the Buckeye corpus (English) versus the same results on a low-resolution version of the Buckeye corpus (bottom panels of Figure 1). All in all, this provides ye<sup>t</sup> additional evidence explaining why the lognormality law might not be fully observable across all linguistic scales if the corpus has these kinds of limitations.

**Figure 2.** Lognormality law truncation. (**left**) Rescaled log-time duration distribution of synthetic 'phonemes' *<sup>P</sup>*(*Y*˜), estimated by (i) sampling *Y* = exp(*X*) where *X* is normally distributed *X* ∼ (*μ*, *σ*<sup>2</sup>) with *μ* = −3, *σ* = 2, and then (ii) rescaling *Y* ˜ = [log*Y* − log*Y*]/*std*(log*Y*) (where *std*(log*Y*) stands for the standard deviation of the random variable log*Y*). If *Y* is lognormal, then *Y* ˜ ∼ N (0, <sup>1</sup>). (**right**) Rescaled log-time duration distribution of synthetic 'words' *<sup>P</sup>*(*Z*˜), obtained using the stochastic model of [4] by concatenating *n* phonemes where *n* is another random variable whose distributed is approximated empirically. As the left panel, if *Z* is lognormal, then *Z* ˜ ∼ N (0, <sup>1</sup>). In both panels, the black curve is the original, high resolution experiment whereas the purple curve is the result of (i) reducing the precision by rounding off to two decimal digits, (ii) reducing the sampling size to match differences between Buckeye and Glissando, and (iii) impose a lower-bound detectability *τ* = 0.03 s (akin to the 30 ms of Glissando), such that all synthetically generated phonemes with a duration *Y* < *τ* are rounded to 0.03 s. Whereas lognormality is recovered in the original experiment, this shape is smeared out as soon as the lower-bound detectability threshold and other low-resolution artifacts are imposed, thereby explaining why the lognormality law might not be fully observable in Glissando.

#### *2.2. Zipf's Law for Words and Yule Distribution for Phonemes*

Results for Zipf's law are reported in Figure 3. The estimation of exponent *α* obtained for word frequencies applying the methodology of Clauset et al [45,46] are in agreemen<sup>t</sup> with those previously found [4] for the second regime in English (see Table 3), with *α* ≈ 1.41 (Spanish and English) and *α* ≈ 1.42 (Catalan), pointing to the robustness of the law also in speech. However, in the case of phonemes, whereas a Yule distribution can be fitted following the MLE method [44], fits are not very good—perhaps due to lack of statistics—and there are some slight differences between the distribution parameters of Catalan, Spanish and English (see Table 4). We conclude at this point that the Yule shape might not be universal for phoneme distribution and this hypothesis should be carefully revisited.

**Figure 3.** Zipf's law. Log-log frequency-rank of phonemes (orange squares) and words (blue circles) for the case of Catalan (**left**) and Spanish (**right**). Words are fitted to a power law distribution following [45,46] and leading to *xmin* = 1 and slopes almost similar for both languages. Phonemes are fitted to a Yule distribution with the help of the maximum likelihood estimation method (MLE).
