*2.4. Brevity Law*

A mathematical formulation of brevity law was developed in [4] based on the information-theoretic principle of compression [32,49], when the size of the units is expressed in symbolic and physical units. In Figure 5 we report the results obtained for brevity law in the Glissando corpus for the case of words (left panel for Catalan, right panel for Spanish). Raw data (light grey) was fitted to the theoretical exponential law (see Table 1), and the best fit is depicted as a red dashed line. Also, a data binning is added (blue dots) to be able to visually compare it with the fit (red dashed line).

When word size is measured in physical units (word duration), the best exponential fit to the raw data (red dashed line) accurately matches the binned data, with similar fitting parameters *λ* ≈ 23.8 (Catalan) and *λ* ≈ 24.1 (Spanish) (to be compared with *λ* ≈ 20.6 for English in Buckeye Corpus [4]), with significant Spearman correlations. Note that deviations of binned data from the red dashed line take place for short timescales: we argue that these are indeed related to the finite-precision and resolution issues discussed before, which propagate into (short) words.

When word size is measured in symbolic units (i.e., in the number of phonemes and number of characters), the law is again recovered (inset panels of Figure 5). Interestingly, the mathematical formulation of this law assigns a specific interpretation of the exponent *λ* when units are measured in symbolic space (i.e., when a code is available): the exponent in this case is always bounded 0 ≤ *λ* ≤ 1 and quantifies the deviation of the language under study from compression optimality, where the closer to 1 the closer to optimality [4]. Out of the three languages, results sugges<sup>t</sup> that Spanish (*λ*D = 0.56 for phonemes and *λ*D = 0.60 for characters) is slightly closest to optimality, followed by English (*λ*D = 0.5

for phonemes and *λ*D = 0.6 for characters) and Catalan (*λ*D = 0.49 for phonemes and *λ*D = 0.53 for characters).

In the case of phoneme duration the statistics—and thus fit—are much poorer, especially for Catalan (we recall again on the finite-precision and resolution issues of Glissando corpus). Nevertheless, Spearman correlations are significant, both for Spanish and Catalan (Figure 6), although Spearman's correlation is better for Spanish (−0.54) than for Catalan (−0.3).

**Figure 5.** Brevity law: words (Catalan on the left panel and Spanish on the right panel). Red dashed lines are fits to the exponential law *f* ∼ exp(−*λ*-), where - is the word size which can be measured in physical units (mean duration) (**outer panels**) or in symbolical units (number of phonemes or number of characters, inset panels). See the text for and Table 3 for data fits and interpretation. Blue dots are the result of a data binning. Note that the fits are performed to the raw data, but the resulting exponential shape accurately matches the binned data within a range (deviations occur for shorter sizes, when the resolution and finite-precision issues of the Glissando corpus are important). Spearman test shows consistent negative correlations for the three formulations for the case of Catalan of −0.27, while for the case of Spanish the correlation is slightly stronger in physical magnitudes (−0.25) than in symbolic units (−0.22).

**Figure 6.** Brevity law: phonemes (Catalan on the left panel and Spanish on the right panel). Red dashed lines are fits to the exponential law *f* ∼ exp(−*λ*-), where - is the phoneme size measured in physical units (mean duration). Orange squares are the result of a data binning. Spearman test always denote negative correlations (−0.3 for Catalan, −0.54 for Spanish) but the data sample is too small to evaluate the agreemen<sup>t</sup> to the exponential law.
