2.3.1. Sub-Word Units

Regarding to sub-word units, one initial thought would be to use character sequences that correspond to the linguistic notion of morphemes/morphs. However, it could be difficult to perform morphological segmentation to all the languages in the corpora. There are unsupervised morphological segmentation approaches, e.g., Morfessor [25], BPE encoding [26], but they still require parameter tuning to control over-segmentation/under-segmentation (making these approaches not completely language independent).

Instead of this, we focused on fixed-length sequences of characters (n-grams), which is more easily applicable to all the languages in the corpora. This decision is also driven by the evidence that trigrams encode morphological properties of the word [27]. Moreover, in some tasks such as language modeling, the use of character trigrams seems to lead to better word vector representations than unsupervised morphological segmentations [28].

Therefore, we trained the language models using character trigrams. We also took into account unigrams (characters) sequences, since there are languages with syllabic writing systems in the datasets and in these cases a single character can encode a whole syllable.

#### 2.3.2. Neural Language Model

Our model was estimated using a feedforward neural network; this network gets trained with pairs of consecutive n-grams that appear in the same word. Once the network is trained we can retrieve from the output layer the probability *pij* for any pair of n-grams. This architecture is based on [29]; however, we used character n-grams instead of words. The network comprises the following layers: (1) an input layer of one-hot vectors representing the n-grams; (2) an embedding layer; (3) a hyperbolic tangent hidden layer; (4) and finally, an output layer that contains the conditional probabilities obtained by a SoftMax function defined by Equation (3).

$$p\_{ij} = \frac{\mathfrak{e}^{a\_{ij}}}{\sum\_{k} \mathfrak{e}^{a\_{ik}}} \tag{3}$$

The factor *aij* in Equation (3) is the *j*th output of the network when the n-gram *wi* is the input. The architecture of the network is presented in Figure 1.

**Figure 1.** Neural probabilistic language model architecture, *wi*, *wj* are n-grams.

Once the neural network is trained, we can build the stochastic matrix *P* using the probabilities obtained for all the pairs of n-grams. We determine the entropy rate of the matrix (*P*) by using Equation (4) [30]:

$$H(P) = -\sum\_{i=1}^{N} \mu\_i \sum\_{j=1}^{N} p\_{ij} \log p\_{ij} \tag{4}$$

where *pij* are the entries of the matrix *P*, *N* is the size of the n-grams vocabulary, and *μ* represents the stationary distribution. This stationary distribution can be obtained using Equation (5), for each *i* = 1, . . . , *N*:

$$\mu\_i = \frac{1}{N} \sum\_{k=1}^{N} p\_{ik} \tag{5}$$

This equation defines a uniform distribution (we selected a uniform distribution since we observed that the stationary distribution, commonly defined by *Pμ* = *μ*, was uniform for several small test corpora. Due to the neural probabilistic function, we can guarantee that the matrix *P* is irreducible; we assume that the irreducibility of the matrices is what determines the uniform stationary distribution. See [31]). To normalize the entropy, we use the logarithm base *N*. Thus, *H*(*P*) can take values from 0 to 1. A value close to 1 would represent higher uncertainty in the sequence of n-grams within the words in a certain language, i.e., less predictability in the word formation processes.

The overall procedure can be summarized in the following steps: (the code is available at http://github.com/elotlmx/complexity-model)

1. For a given corpus, divide every word into its character n-grams. A vocabulary of size *N* (the number of n-grams) is obtained.

