**3. Results**

Using the methodology described above, we have coarse-grained the words of a written corpus, first, into the 34 grammatical classes shown in Table 1. This process is illustrated by Equation (2). The resulting symbolic series was binarized to create samples akin to spin glasses, a well studied model from statistical mechanics that allows us to use powerful mathematical tools on our problem. This process was then repeated at several levels of coarse graining as words were further lumped into abstract grammatical categories (e.g., as in Equation (4)). At each level of description, the inferred spin glass model plays the role of a grammar that constrains, in a probabilistic fashion, how word classes can follow each other in a text. These mathematical tools from spin glass theory allow us to test grammars from different description levels against each other as will become clear now.

In spin glasses, a collection of little magnets (or spins) is arranged in space. We say that a magne<sup>t</sup> is in state *σ* = 1 if its north pole is pointing upwards and in state *σ* = −1 if its pointing downwards

(these are equivalent to the 1s and 0s in our word samples). Two of these little magnets interact through their magnetic fields. These fields build up a force that tends to align both spins in the same direction, whichever it is, just as two magnets in your hand try to fall along a specific direction with respect to each other. On top of this, the spins can interact with an external magnetic field—bringing in a much bigger magne<sup>t</sup> which orientation cannot be controlled. This external field tends to align the little spins along its fixed, preferred direction. Given the spin states *σ*1 and *σ*2, the energy of their interaction with the external magnetic field and with each other can be written as

$$\begin{aligned} E(\sigma\_1, \sigma\_2) &= -\frac{1}{2} \left( 2h\_1 \sigma\_1 + \sigma\_1 l\_{12} \sigma\_2 + \sigma\_2 l\_{21} \sigma\_1 + 2h\_2 \sigma\_2 \right) \\ &= -\frac{1}{2} \left( l\_{11} \sigma\_1 + \sigma\_1 l\_{12} \sigma\_2 + \sigma\_2 l\_{21} \sigma\_1 + l\_{22} \sigma\_2 \right). \end{aligned} \tag{15}$$

*J*12 and *J*21 (with *J*12 = *J*21) denote the strength of the interaction between the spins, and *J*11 ≡ 2*h*1 and *J*22 = 2*h*2 denote the interaction of each spin with the external field. The terms *h*1 and *h*2 are also known as biases. If the spins are aligned with each other and with the external field, the resulting energy is the lowest possible. Each misalignment increases the energy of the system. In physics, states with less energy are more probable. Statistical mechanics allows us to write precisely the likelihood of finding this system in each of its four ({1, <sup>1</sup>}, {1, <sup>−</sup><sup>1</sup>}, {−1, <sup>1</sup>}, and {−1, −<sup>1</sup>}) possible states:

$$P(\sigma\_1, \sigma\_2) \quad = \quad \frac{\varepsilon^{-\beta E(\sigma\_1, \sigma\_2)}}{Z},\tag{16}$$

where *β* = 1/*T* is the inverse of the temperature. The term

$$\begin{aligned} Z &= \ \ & \varepsilon^{-\beta E(1,1)} + \varepsilon^{-\beta E(1,-1)} + \varepsilon^{-\beta E(-1,1)} + \varepsilon^{-\beta E(-1,-1)} \\ &= \ & \sum\_{\sigma\_1,\sigma\_2=-\pm 1} \varepsilon^{-\beta E(\sigma\_1,\sigma\_2)} \end{aligned} \tag{17}$$

is known as the partition function and is a normalizing factor that guarantees that the probability distribution in Equation (16) is well defined. 

Back to our text corpus in its binary representation, we know the empirical frequency *F σλj* |*σλj* with which each of the possible spin configurations shows up—we just need to read it from our corpus. We can treat our collection of 0s and 1s as if they were ±1 samples of a spin glass, and attempt to infer the *βλ* and *Jλ* which (through a formula similar to Equation (16)) more faithfully reproduce the observed sample frequencies. The superindex in *βλ* and *Jλ* indicates that they will change with the level of coarse-graining. Inferring those *βλ* and *Jλ* amounts to finding the MaxEnt model at that coarse-grained level. As advanced above, MaxEnt models are convenient because they are the models that introduce less extra hypotheses given some observations. In other words, if we infer the MaxEnt model for some *λ*, any other model with the same coarse-graining would be introducing spurious hypotheses that are not suggested by the data. To infer MaxEnt models, we used Minimum Probability Flow Learning (MPFL [50]), a fast and reliable method that infers the *Jλ* given a sufficiently large sample.

Each grammatical class is represented by *N<sup>λ</sup>* spins at the *λ*-th coarse-graining. This implies, as we know, that our samples consists of 2*N<sup>λ</sup>* spints. MPFL returns a matrix *Jλ* of size 2*N<sup>λ</sup>* × 2*Nλ*. This matrix embodies our abstract, probabilistic grammar (and plays the role of Ψ in Figure 1). Each entry *Jλkk* of this matrix tells us the interaction energy between the *k*-th and *k*-th bits in a sample (with *k*, *k* = 1, ... , 2*N<sup>λ</sup>*). However, each grammatical class is represented not by one spin, but by a configuration of spins that has

only one 1. To obtain the interaction energies between grammatical classes (rather than between spins), we need to compute

$$\mathcal{V}^{\lambda}(\boldsymbol{c}\_{\boldsymbol{\slash}}^{\lambda}, \boldsymbol{c}\_{\boldsymbol{\slash}}^{\lambda}) \quad = \quad \frac{1}{2} \sum\_{\boldsymbol{k}, \boldsymbol{k}'} \sigma\_{\boldsymbol{j}, \boldsymbol{k}}^{\lambda} l\_{\boldsymbol{k} \boldsymbol{k}'}^{\lambda} \sigma\_{\boldsymbol{j}', \boldsymbol{k}'}^{\lambda}. \tag{18}$$

This energy in turn tells us the frequency with which we should observe each pair of words according to the model:

$$P^{\lambda}\left(\left\right) = \quad \frac{1}{Z^{\lambda}}e^{\beta V^{\lambda}(c\_{\!\!\!/^{\lambda}}^{\lambda}c\_{\!\!/^{\lambda}}^{\lambda})}.\tag{19}$$

We inferred MaxEnt models for the more fine-grained level of description ( *χ*0 as given by the grammatical classes in Table 1), as well as for every other intermediate level *<sup>χ</sup>λ*. Figure 2a shows the emerging spin-spin interactions for *l* = 15, which consists of only 19 (versus the original 34) grammatical classes. This matrix presents a clear box structure:

$$J^{\lambda} = \begin{bmatrix} \begin{array}{c} 2\text{h}^{\lambda} \\ \text{inc}\,\overleftarrow{\partial}}\,^{\lambda} \end{array} \begin{array}{c} \begin{array}{c} \overleftarrow{\partial}}\,^{\lambda} \\ 2\text{h}^{\lambda} \end{array} \end{array} \right]. \tag{20}$$

The diagonal blocks (2*h<sup>λ</sup>* and 2¯ *h<sup>λ</sup>*) represent the interactions between all spins that define, separately, the first and second words in each sample. As our corpus becomes infinitely large, *h<sup>λ</sup>* → ¯ *hλ*. These terms do not capture the interaction between grammatical classes. In the spin-glass analogy, they are equivalent to the interaction of each word with the external magne<sup>t</sup> that biases the presence of some grammatical classes over others. Such biases affect the frequencies *<sup>P</sup><sup>λ</sup>*(*c<sup>λ</sup> j* ) with which individual classes show up, but not the frequency with which they are paired up. Therefore, the *h<sup>λ</sup>* and ¯ *h<sup>λ</sup>* are not giving us much syntactic information.

More interesting for us are the interaction terms stored in −→*∂ λ* and ←− *∂ λ*. The inference method used guarantees that −→*∂ λ* = ( ←− *∂ <sup>λ</sup>*)*<sup>T</sup>*. It is from these terms that we can compute the part of *<sup>V</sup><sup>λ</sup>*(*c<sup>λ</sup> j* , *cλ j*) (shown in Figure 2b) that pertains to pairwise interaction alone (i.e., the energy of the spin system when we discount the interaction with the external field). *<sup>V</sup><sup>λ</sup>*(*c<sup>λ</sup> j* , *cλ j*) encodes the energy of two word classes when they are put next to each other in a text. The order in which words appear after each other is relevant, therefore that matrix is not symmetric. These energies reflect some of the rules of English. For example, the first row (labeled "E, M") is a class that has lumped together the existential "there" (as in "there is" and "there are") with all modal verbs. These tend to be followed by a verb in English, thus the matrix entry coding for "*E*, *M*|"*verb* (marked in red) is much lower than most entries for any other "*E*, *<sup>M</sup>*|·. The blue square encompasses verbs, nouns, and determiners. Although the differences there are very subtle, the energies reflect that it is more likely to see a noun after a determiner and not the other way around, and also that it is less likely to see a verb after a determiner.

It is not straightforward to compare all energies because they are affected by the raw frequency with which pairs of words show up in a text. In that sense, our corpus size might be sampling some pairings insufficiently so that their energies do not reflect proper English use. On the other hand, classes such as nouns, verbs, and determiners happen so often (and so often combined with each other) that they present very low energies as compared with other possible pairs. This makes the comparison more difficult by visual inspection.

It is possible to use *<sup>V</sup><sup>λ</sup>*(*c<sup>λ</sup> j* , *cλ j*) to generate a synthetic text *T*˜ *λ* and evaluate its energy *E*<sup>0</sup>(*T*˜ *λ*) using the most fine-grained model *J*0. If the coarse-grained model *<sup>V</sup><sup>λ</sup>*(*c<sup>λ</sup> j*, *cλ j*) retains a lot of the original structure,

the generated text will fit gracefully in the rules dictated by *J*0—just as magnets falling into place. Such texts would present very low energy when evaluated by *J*0. If the coarse-grained model has erased much of the original structure, the synthetic text will present odd pairings. These would feel similar to magnets that we are forcing into a wrong disposition, therefore resulting in a large energy when *J*0 is used. In other words, this energy reflects how accurate each coarse-grained model is.

That accuracy is one of the targets in our MOO problem, in which we attempt to retain as much information as possible with models as simple as possible. To quantify that second target, simplicity, we turn to entropy. The simplest model possible generates words that fall in either class of *χ*0 randomly and uniformly, thus presenting the largest entropy possible. More complex models, in their attempt to remain accurate, introduce constraints as to how the words in the coarse-grained model must be mapped back into the classes available in *χ*0. That operation would be the reverse of *<sup>π</sup>λ*. This reverse mapping, however, cannot be undone without error because the coarse-graining erases information. Entropy measures the amount of information that has been erased, and therefore how simple the model has been made.

Figure 3b shows the energy *E*<sup>0</sup>(*T<sup>λ</sup>*) and entropy *S*<sup>0</sup>(*T<sup>λ</sup>*) for synthetic texts generated with the whole range of coarse-grainings explored. In terms of Pareto optimality, we expect our models to have as low an energy as possible while having the largest entropy compatible with each energy—just as thermodynamic systems do. Such models would simultaneously optimize their simplicity and accuracy. Within the sample, some of these models are Pareto dominated (crosses in Figure 3b) by some others. This means that for each of those models at least some other one exists that is simpler and more accurate at the same time. These models are suboptimal regarding both optimization targets, so we do not need to bother with them.The non-dominated ones (marked by circles in Figure 3b) capture better descriptions in both senses (accuracy and simplicity). They are such that we cannot move from one to another without improving an optimization target and worsening the other. They embody the optimal trade-off possible (of course, limited by all the approximations made in this paper), and we cannot choose a model over the others without introducing some degree of artificial preference either for simplicity or accuracy.

In statistical mechanics the energy and entropy of a system are brought together by the free energy:

$$F = E - \Upsilon S = E - S / \mathcal{B}.\tag{21}$$

Here, *T* ˆ plays a role akin to a temperature and *β* ˆ plays the role of its inverse. We noted *β* ˆ = *β* to indicate that these temperature and inverse temperature are different from the ones in Equation (19). Those temperatures control how often a word shows up given a model, whereas *β* ˆ controls how appropriate each level of description is. When *β* ˆ is low (and *T* ˆ is large), a minimum free energy in Equation (21) is attained by maximizing the entropy rather than minimizing the energy. This is, low *β* ˆ selects for simpler descriptions. When *β* ˆ is large (and *T* ˆ is small), we prefer models with lower energy, i.e., higher accuracy.

By varying *β* ˆ we visit the range of models available, i.e., we visit the collection of Pareto optimal grammars (circles in Figure 3b). In statistical mechanics, by varying the temperature of a system we visit a series of states of matter (this is, we put, e.g., a glass of water at different temperatures and observe how its volume and pressure change). At some relevant points, called phase transitions, the states of matter change radically, e.g., water freezes swiftly at 0 degrees Celsius, and evaporates right at 100 degrees Celsius. The geometry of Pareto optimal states of matter tells us when such transitions occur [25–29].

Similarly, the geometric disposition of Pareto optimal models in Figure 3b tells us when a drastic change in our best description is needed as we vary *β* ˆ . Relevant phase transitions are given by cavities and salient points along the Pareto optimal solutions. In the first approach, we observe several cavities. More interestingly, perhaps, is the possibility that our Pareto optimal models might fall along a straight line; one has been added as a guideline in Figure 3b. Although there are obvious deviations from it, such description might be feasible at large. Straight lines in this plot are interesting because they indicate the existence of special critical points [28,37,46–48]. In the next section, we discuss what criticality might mean in this context.
