**3. Results**

We applied the measures to each language contained in the JW300 and Bibles corpora. We use the notations H1, H3 for the entropy rate calculated with unigrams and trigrams respectively; TTR is the type-token relationship.

To combine the different complexity dimensions, we ranked the languages according to each measure, then we averaged the obtained ranks for each language (since we ranked the languages from the most complex to the less complex, we used the inverse of the average in order to be consistent with the complexity measures (0 for least complex, 1 for the most complex)). The notation for these combined rankings are the following: TTR+H1 (TTR rank averaged with H1 rank); TTR+H2 (TTR rank averaged with H2 rank); TTR+H1+H3 (TTR rank averaged with H1 and H3 ranks). In all the cases the scales go from 0 to 1 (0 for the least complex and 1 for the most complex).

Tables 3 and 4 contain the measures described above for each corpus. These tables only show the set of 25 languages that are shared between the two corpora. In Figures 2 and 3 we plot these different complexities, and their combinations. The complete list of languages and results are included in Appendices A and B.

**Table 3.** Complexity measures on the Bibles corpus (H1: unigrams entropy; H3: trigrams entropy; TTR: Type-token relationship); bold numbers indicate the highest and the lowest values for each measure, the rank is in brackets.



**Table 4.** Complexity measures on the JW300 corpus (H1: unigrams entropy; H3: trigrams entropy; TTR: Type-token relationship); bold numbers indicate the highest and the lowest values for each measure, the rank is in brackets.

**Figure 2.** Different complexity measures (**above**) and their combinations (**below**) from Bibles corpus.

**Figure 3.** Different complexity measures (**above**) and their combinations (**below**) from JW300 corpus.

We can see that languages can be complex under one measure but simpler under another one. For instance, in Figures 2 and 3, we can easily notice that Korean is the most complex language if we only take into account the entropy rate using trigrams (H3). However, this entropy dramatically drops using unigrams (H1); therefore, when we combine the different measures, Korean is not the most complex language anymore.

There are cases such as English where its TTR is one of the lowest. This is expected since English is a language with poor inflectional morphology. However, its entropy is high. This suggests that a language such as English, usually not considered morphologically complex, may have many irregular forms that are not so easy to predict for our model.

We can also find the opposite case, where a language has a high TTR but low entropy, suggesting that it may produce many different word forms, but the inner structure of the words was "easy" to predict. This trend can be observed in languages such as Finnish (high TTR, low H3), Korean (high TTR, low H1) or Swahili (high TTR, low H3).

The fact that a language has a low value of TTR does not necessarily imply that its entropy rate should be high (or vice versa). For instance, languages such as Vietnamese or Malagasy (Plateau), have some of the lowest values of entropy (H1, H3); however, their TTR values are not among the highest in the shared subset. In this sense, these languages seem to have low complexity in both dimensions.

Burmese language constitutes a peculiar case, it behaves differently among the two corpora. Burmese complexity seems very high in all dimensions (TTR and entropies) just in the Bibles corpora. We conjecture that TTR is oddly high due to tokenization issues [32]: this is a language without explicit word boundary delimiters, if the words are not well segmented then the text will have many different long words without repetitions (high TTR). The tokenization pre-processing of the Bibles was based only on whitespaces and punctuation marks, while the JW300 had a more sophisticated tokenization. In the latter, Burmese obtained a very low TTR and H1 entropy.

Cases with high complexity in both dimensions were less common. Arabic was perhaps the language that tends to be highly complex under both criteria (TTR and entropy) and this behavior remained the same for the two corpora. We conjecture that this is related to the root-and-pattern morphology of the language, i.e., these types of patterns were difficult to predict for our sequential character n-grams language model. We will discuss more about this in Section 4.

#### *3.1. Correlation across Corpora*

Since our set of measures was applied to two different parallel corpora, we wanted to check if the complexities measures were, more or less, independent from the type of corpora used, i.e., languages should ge<sup>t</sup> similar complexity ranks in the two corpora.

We used Spearman's correlation [33] for the subset of shared languages across corpora. Table 5 shows the correlation coefficient for each complexity measure between the two corpora. Burmese language was excluded from the correlations due to the tokenization problems.

**Table 5.** Correlation of complexities between the JW300 and Bibles corpora (H1: unigrams entropy; H3: trigrams entropy; TTR: Type-token relationship).


Although the Bibles and the JW300 corpora belong to the same domain (religion), they greatly differ in size and in the topics covered (they are also parallel at different levels). Despite this, all the measures were positively correlated. The weaker correlation was obtained with H1, while complexity measures such as TTR or TTR+H3 were strongly correlated across corpora.

The fact that the complexity measures are correlated among the two corpora sugges<sup>t</sup> that they are not very dependent of the corpus size, topics and other types of variations.

#### *3.2. Correlation between Complexity Measures*

In addition to the correlation across different corpora, we were interested in how the different complexity measures correlate between them (in the same corpus). Tables 6 and 7 show the Spearman's correlation between measures in each corpus.

**Table 6.** Spearman's correlations between measures in the corpus JW300 (all languages considered) (H1: unigrams entropy; H3: trigrams entropy; TTR: Type-token relationship).


**Table 7.** Spearman's correlations between measures in the Bibles corpus (all languages considered) (H1: unigrams entropy; H3: trigrams entropy; TTR: Type-token relationship).


In both corpora, the entropy-based measures (specially H3) were poorly correlated (or not correlated) with the type-token relationship TTR. If these two types of measures are capturing, in fact, two different dimensions of the morphological complexity then it should be expected that they are not correlated.

The combined measures (TTR+H1, TTR+H3 and TTR+H1+H3) tend to be strongly correlated between them. It seems that all of them can combine, to some extent, the two dimensions of complexity (productivity and predictability).

Surprisingly, the entropy-based measures (H1 and H3) are weakly correlated between them, despite both trying to capture predictability. We conjecture that this could be related to the fact that for some languages, is more suitable to apply a trigram model and for some others the unigram model. For instance, in the case of Korean, one character is equivalent to a whole syllable (syllabic writing system). When we took combinations of three characters (trigrams) the model became very complex (high H3), this does not necessarily reflect the real complexity. On the other hand, languages such as Turkish, Finnish or Yaqui (see Appendix B) obtained a very high value of H1 (difficult to predict using only unigrams, very long words), but if we use the trigrams the entropy H3 decreasse, trigram models may be more appropriate for these type of languages.

#### *3.3. Correlation with Paradigm-Based Approaches*

*MCC*  0.024

0.064

Finally, we compared our corpus-based morphological complexity measures against two paradigm-based measures. First, we used the *CWALS* measure proposed by [11], it is based on 28 morphological features/chapters extracted from the linguistic database WALS [16]. This measure maps each morphological feature to a numerical value, the complexity of a language is the average of the values of the morphological features.

The measure *CWALS* was originally applied to 34 typologically diverse languages. However, we only took 19 languages (the shared set of languages with our Bibles corpus). We calculated the correlation between our complexity measures and *CWALS* (Table 8).

In addition, we included the morphological counting complexity (*MCC*) as implemented by [34]. Their metric counts the number of inflectional categories for each language, the categories are obtained from the annotated lexicon UniMorph [35].

This measure was originally applied to 21 languages (mainly Indo-European), we calculated the correlation between *MCC* and our complexity measures using the JW300 corpus (which contained all of those 21 languages) Table 8.

Appendices C and D contain the list of languages used for each measure and the complexities.

 **H1 H3 TTR TTR+H1 TTR+H3 TTR+H1+H3** *CWALS*0.322−0.3920.8820.7300.3950.406

 0.442  0.585  0.366

 0.851


*CWALS* and TTR are strongly correlated, this was already pointed out by [11]. However, our entropy-based measures are weakly correlated with *CWALS*, it seems that they are capturing different things. *MCC* metric shows a similar behavior, it is highly correlated with TTR but not with H1 (unigrams entropy) or H3 (trigrams entropy).

It has been suggested that databases such as WALS, which provide paradigmatic distinctions of languages, reflect mainly the e-complexity dimension [2]. This could explain the high correlation between *CWALS*, *MCC*, and measures such as TTR. However, the i-complexity may be better captured by other types of approaches, e.g., the entropy rate measure that we have proposed.

The weak correlation between our entropy-based measures and *CWALS* (even negative correlation in the case of H3) could be a hint of the possible trade-off between the i-complexity and e-complexity. However, further investigation is needed.
