*2.1. The Corpora*

Parallel corpora are a valuable resource for many NLP tasks and for linguistics studies. Translation documents preserve the same meaning and functions, to a certain extent, across languages. This allows analysis/comparison of the morphological and typological features of languages.

We used two different parallel corpora that are available for a wide set of languages. On one hand, we used a portion of the Parallel Bible Corpus [15]; in particular, we used a subset of 1150 parallel verses that overlapped across 47 languages (the selection of languages and pre-processing of this dataset was part of the Interactive Workshop on Measuring Language Complexity (IWMLC 2019) http://www.christianbentz.de/MLC2019\_index.html). These languages are part of the WALS 100-language sample, a selection of languages that are typologically diverse [16] (https://wals.info/ languoid/samples/100).

On the other hand, we used the JW300 parallel corpus that compiles magazine articles for many languages [17] (these articles were originally obtained from the Jehovah's Witnesses website https: //www.jw.org). In this case, we extracted a subset of 68 parallel magazine articles that overlapped across 133 languages. Table 1 summarizes information about the corpora.


**Table 1.** General information about the parallel corpora.

We ran the experiments in both corpora independently. The intersection of languages covered by the two parallel corpora is 25. This shared set of languages was useful to compare the complexity rankings obtained with our measures, i.e., test if our complexity measures are consistent across different corpora.

It is important to mention that no sentence alignment was applied to the corpora. The Bibles corpus was already aligned at the verse level while the JW300 corpus was only aligned at the document level. However, for the aim of our experiments, alignment annotation (at the sentence or verse level) was not required.

#### *2.2. Type-Token Relationship (TTR)*

The type-token relationship (TTR) has proven to be a simple, ye<sup>t</sup> effective, way to quantify the morphological complexity of a language using relatively small corpora [14]. It has also shown a high correlation with other types of complexity measures such as paradigm-based approaches that are based on typological information databases [11].

Morphologically rich languages will produce many different word forms (types) in a text, this is captured by measures such as TTR. From a linguistic perspective, Joan Bybee [18] affirms that "the token frequency of certain items in constructions [i.e., words] as well as the range of types [. . . ] determines representation of the construction as well as its productivity".

TTR can be influenced by the size of a text (Heaps' law) or even by the domain of a corpus [19,20]. Some alternatives to make TTR more comparable include normalizing the text size or using logarithm, however, Covington and McFall [19] argue that these strategies are not fully successful, and they propose the moving-Average Type-Token Ratio. On the other hand, using parallel corpora has shown to be a simple way to make TTR more comparable across languages [21,22]. In principle, translations preserve the same meaning in two languages, therefore, there is no need for the texts to have the exact same length in tokens.

We calculated the TTR for a corpus by simply using Equation (1). Where #*types* are the different word types in the corpus (vocabulary size), and #*tokens* is the total number of word tokens in the

corpus. Values closer to 1 would represent greater complexity. This simple way of measuring TTR, without any normalization, has been used in similar works [11,22,23].

$$\text{TTR} = \frac{\text{\#types}}{\text{\#tokens}} \tag{1}$$

We use this measure as an easy way to approach the e-complexity dimension; i.e., different morphosyntactic distinctions, and their productivity, could be reflected in the type and token distribution over a corpus.

#### *2.3. Entropy Rate of a Sub-Word Language Model*

Entropy as a measure of unpredictability represents a useful tool to quantify different linguistic phenomena, in particular, the complexity of morphological systems [9,12,24].

Our method aims to reflect the predictability of the internal structure of words in a language. We conjecture that morphological processes that are irregular/suppletive, unproductive, etc., will increase the entropy of a model that predicts the probability of sequences of morphs/sub-word units within a word.

To do this, we estimate a stochastic matrix *P*, where each cell contains the transition probability between two sub-word units in that language (see example Table 2). These probabilities are estimated using the corpus and a neural language model that we will describe below.

**Table 2.** Toy example of a stochastic matrix using the trigrams contained in the word 'cats'. The symbols #, \$ indicate beginning/end of a word.


We calculate the stochastic matrix *P* as follows (2):

$$P = p\_{i\bar{j}} = p(w\_{\bar{j}}|w\_{\bar{i}}) \tag{2}$$

where *wi* and *wj* are sub-word units. We used a neural probabilistic language model to estimate a probability function.
