**2. Methods**

#### *2.1. Corpus Description and Preparation*

We took a sample of 49 newspaper articles from the Corpus of Contemporary American English [39]. The articles were selected such that they did not contain foreign (non-English) words or symbols. We substituted by a period every punctuation mark that indicated the end of a sentence and removed any other punctuation mark except for the apostrophes indicating a contraction (e.g., "don't") or a genitive (e.g., "someone's"). Ideally, we would like to use raw texts and see Pareto optimal grammars emerging from them. These should also include instructions about how alien symbols or words (loosely speaking, any items that are not proper of English language, e.g., french terms, accent marks, etc.) are treated. However, these are rather minor details. Effective grammars should specify first how its own words are articulated.

Our more basic level of analysis will already be a coarse-grained one. Again, ideally, we would present our methods with texts in which each word is explicitly expelled out. Our blind techniques should then infer grammatical classes (if any were useful) based on how different words correlate. For example, we expect that our blind methods would be able, at some point, to group all nouns together based on their syntactic regularities. While this is possible, it is very time- and resource-consuming for the demonstration intended here. Therefore, we preprocessed our corpus using Python's Natural Language Processing Toolkit [40] to map every word into one of the *NG* = 34 grammatical classes shown in Table 1. We then substituted every word in the corpus by its grammatical class. The resulting texts constitute the symbolic dynamics that we analyze.



#### *2.2. Word Embeddings and Coarse-Graining*

We would like to explore the most general grammars possible. However, as advanced above, to make some headway we restrict ourselves to grammar models that encode a tongue's rules in a probabilistic

way, telling us how likely it is that words follow each other in a text. Even in this narrower class there is an inscrutably large number of possibilities depending, e.g., on how far back we look into a sentence to determine the next word's likelihood, on whether we build intermediate phrases to keep track of the symbolic dynamics in a hierarchical way, etc. Here, we only attempt to predict the next word given the current one. We will also restrict ourselves to maximum entropy (*MaxEnt*) models, which are the models that introduce less further assumptions provided a series of observations [37,41–49]. We explain these kind of models in the next subsection. First, we need to introduce some notation and a suitable encoding of our corpus so we can manipulate it mathematically.

We use a one-hot embedding, which substitutes each word in a text by a binary string that consists of all zeros and exactly one 1. The position of the 1 indicates the class of word that we are dealing with. Above, we illustrated several levels of coarse-graining. In a very fundamental one, each word represents a class of its own. Our vocabulary in the simple example sentence "green colorless ideas sleep furiously" consists of

$$\chi^{\text{uvords}} \equiv \{ \text{ideas}, \text{sleep}, \text{green}, \text{colorless}, \text{furions} \} \tag{7}$$

which in its binary form becomes

$$\hat{\chi}\_{\text{nords}} = \{10000, 01000, 00100, 00010, 00001\}.\tag{8}$$

We also illustrated a level of coarse-graining in which nouns and verbs are retained, but all other words are grouped together in a third category (Equation (4)). The corresponding vocabulary

$$\chi \equiv \{ \text{nom}, \text{verb}, \text{cat}\_3 \} \tag{9}$$

becomes, through the one-hot embedding:

$$
\hat{\chi} = \{100, 010, 001\}.\tag{10}
$$

Throughout this paper, we will note by *χλ* the vocabulary (set of unique symbols) at a description level *λ*, and we will refer by *χ*˜*λ* to its one-hot representation. We will name *cλ j* ∈ *<sup>χ</sup>λ*, with *j* ∈ {1, ... , *<sup>N</sup><sup>λ</sup>*}, to each of the *N<sup>λ</sup>* unique symbols at description level *λ*. Each of these symbols stands for an abstract class of words, which might or might not correspond to actual grammatical classes in the standard literature. The binary representation of each class is correspondingly noted by *σ<sup>λ</sup> j* ∈ *<sup>χ</sup>*˜*λ*.

To explore models of different complexity we start with all the grammatical classes outlined in Table 1 and proceed by lumping categories together. We will elaborate a probabilistic grammar for each level of coarse-graining. Later, we will compare the performance of all descriptions. In lumping grammatical classes together there are some choices more effective than others. For example, it seems wise to group comparative and superlative adverbs earlier than nouns and verbs. We expect the former to behave more similarly than the later, and therefore to lose less descriptive power when treating both comparative and superlative adverbs as one class. In future versions of this work, we intend to explore arbitrary lumping strategies. Here, to produce results within a less demanding computational framework, we use an informed shortcut. We build the maximum entropy model of the least coarse-grained category (which, again, in this paper consists of the grammatical classes in Table 1). Through some manipulations explained below, this model allows us to extract correlations between a current word and the next one (illustrated in Figure 2). These correlations allow us to build a dendogram (Figure 3a) based on how similarly different grammatical classes behave.

**Figure 2.** Interactions between spins and word classes. (**a**) A first crude model with spins encloses more information than we need for the kind of calculations that we wish to do right now. (**b**) A reduced version of that model gives us an interaction energy between words or classes of words. These potentials capture some non-trivial features of English syntax, e.g., the existential "there" in "there is" or modal verbs (marked E and M respectively) have a lower interaction energy if they are followed by verbs. Interjections present fairly large interaction energy with any other word, perhaps as a consequence of their independence within sentences.

**Figure 3.** Pareto optimal maximum entropy models of human language. Among all the models that we try out, we prefer those Pareto optimal in energy minimization and entropy maximization. (**a**) These reveal a hierarchy of models in which different word classes group up at different levels. The clustering reveals a series of grammatical classes that belong together owing to the statistical properties of the symbolic dynamics, such as possessives and determiners which appear near to adjectives. (**b**) A first approximation to the Pareto front of the problem. Future implementations will try out more grammatical classes and produce better quality Pareto fronts, establishing whether phase transitions or criticality are truly present.

This dendogram suggests an order in which to merge the different classes, which is just a good guess. There are many reasons why the hierarchy emerging from the dendogram might not be the best coarse-graining. We will explore more exhaustive possibilities in the future. In any case, this scheme defines a series of functions *π<sup>λ</sup>* (which play the role of *π* in Figure 1) that map the elements of the most fine-grained vocabulary *χ*0 ≡ *χgrammar* (as defined by the classes in Table 1) into a series of each time more

coarse-grained and abstract categories *<sup>χ</sup>λ*, with *λ* = 1, ... , *NG* − 1 indicating how many categories have been merged at that level.
