*Article* **Innovatively Fused Deep Learning with Limited Noisy Data for Evaluating Translations from Poor into Rich Morphology**

**Despoina Mouratidis 1,\* , Katia Lida Kermanidis <sup>1</sup> and Vilelmini Sosoni <sup>2</sup>**


**Abstract:** Evaluation of machine translation (MT) into morphologically rich languages has not been well studied despite its importance. This paper proposes a classifier, that is, a deep learning (DL) schema for MT evaluation, based on different categories of information (linguistic features, natural language processing (NLP) metrics and embeddings), by using a model for machine learning based on noisy and small datasets. The linguistic features are string based for the language pairs English (EN)–Greek (EL) and EN–Italian (IT). The paper also explores the linguistic differences that affect evaluation accuracy between different kinds of corpora. A comparative study between using a simple embedding layer (mathematically calculated) and pre-trained embeddings is conducted. Moreover, an analysis of the impact of feature selection and dimensionality reduction on classification accuracy has been conducted. Results show that using a neural network (NN) model with different input representations produces results that clearly outperform the state-of-the-art for MT evaluation for EN–EL and EN–IT, by an increase of almost 0.40 points in correlation with human judgments on pairwise MT evaluation. It is observed that the proposed algorithm achieved better results on noisy and small datasets. In addition, for a more integrated analysis of the accuracy results, a qualitative linguistic analysis has been carried out in order to address complex linguistic phenomena.

**Citation:** Mouratidis, D.; Kermanidis, K.L.; Sosoni, V. Innovatively Fused Deep Learning with Limited Noisy Data for Evaluating Translations from Poor into Rich Morphology. *Appl. Sci.* **2021**, *11*, 639. https://doi.org/ 10.3390/app11020639

Received: 18 December 2020 Accepted: 4 January 2021 Published: 11 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Keywords:** machine learning; deep learning; machine translation; pairwise evaluation; educational data; small datasets; noisy datasets

### **1. Introduction**

Machine translation (MT) applications have nowadays infiltrated almost every aspect of everyday activities. For the development of efficient MT solutions, reliable automated evaluation schemata are required. Over the past few years, neural network (NN) models have improved the state-of-the-art of different natural language processing (NLP) applications [1], such as language modeling [2,3], improving answer ranking in community question answering [4], improving translation modeling [5–7], as well as evaluating machine translation output [4,8,9]. Embeddings are a powerful way of representing text, provided that they are able to capture the linguistic identity (morphosyntactic and semantic profile) of a sentence/word. In 2013, Mikolov et al. [3] released the word2vec library. Word2vec became quickly the dominant approach for vectorizing textual data. The NLP models that were already well studied based on traditional approaches, such as latent semantic indexing (LSI) and vector representations using term frequency–inverse document frequency (TF-IDF) weighting, have been tested against word embeddings and, in most cases, word embeddings have come out on top. Since then, the research focus has shifted towards embedding approaches.

The present study aims to find out how embeddings, obtained through various means, in combination with different kinds of information fuse, affect classification accuracy small and noisy dataset, when used to train a model to choose the best translation output. The target languages (in contrast to the source language) are rich in morphology, as the

proposed schema is applied to the English–Greek (EN–EL) and English–Italian (EN–IT) language pairs. Greek and Italian languages have a rich inflectional morphology, as the nouns have different grammatical morphemes for the genders and the verbs have different grammatical morphemes for the two numbers and for the first, second and third person as well. In particular, the proposed NN learning schema is set up to test:


Further innovative aspects of the present work include:


The rest of the paper is organized as follows—Section 2 presents the related work in the addressed scientific area. Section 3 describes the data sets (corpora), the feature set used, the learning framework and the network settings. Section 4 describes more experimental details and the results of the classification process. Finally, Section 5 presents the paper's conclusions and directions for future research.

### **2. Related Work**

Some of the most popular methods in automatic MT evaluation rely on score based metrics. These metrics include (i) metrics based on n-gram counts, such as Bilingual Evaluation Understudy (BLEU) [11] and National Institute of Standards and Technology (NIST) [12], or on the edit distance, like Word Error Rate (WER) [13], (ii) metrics using external resources, like WordNet and paraphrase databases—METEOR [14] and Translation Error Rate (TER) [15], (iii) metrics based on lexical similarity or syntactic similarity (involving higher level information, such as part of speech tags (POS)) between the MT outputs and the reference, and iv) neural metrics such as ReVal [8] and Regressor Using Sentence Embeddings (RUSE) [16], which directly learn embeddings for the entire translation and reference sentences using long short-term memory (LSTM) networks and pre-trained sentence representations.

Several research approaches on text classification, system ranking and selection techniques have been proposed using machine learning schemata. Guzmán et al. [4] focus on a ranking approach based on predicting BLEU scores. Duh [17] decomposes rankings into parallel decisions, with the best translation for each candidate pair predicted, using a ranking-specific feature set and BLEU score information. The framework involves a Support Vector Machine (SVM) classifier. A similar pairwise ranking approach was proposed by Mouratidis and Kermanidis [9], using a random forest (RF) classifier.

Neural networks are also used in the literature frameworks. Recurrent neural networks (RNN) and long short term memory (LSTM) networks [18], which are widely popular for learning sentence representations, have been taken up widely in a variety of NLP tasks [6,7]. Cho et al. [7] proposed a score-based scheme to learn the translation probability of a source phrase to a target phrase (MT output) with an RNN encoder-decoder. They showed that this learning scheme has improved the translation performance. The scheme proposed by Sutskever et al. [19] is similar to Cho et al. [7] work, but Sutskever et al. [19] chose the top 1000 best candidate translations produced by a Statistical Machine Translation (SMT) system with a 4-layer LSTM sequence-to-sequence model. LSTM networks are also widely adopted in MT evaluation [8]. LSTM memory units incorporate gates to control the information flow and they can preserve information for long periods of time. Wu et al. [20] trained a deep LSTM network to optimize BLEU scores when translating from English to German and English to French, but they found that the improvement in BLEU scores did not reflect the human evaluation of translation quality. Mouratidis et al. [21] used LSTM layers in a learning framework for evaluating pairwise MT outputs using vector representations, in order to show that the linguistic features of the source text can affect MT evaluation. Convolutional neural networks (CNN) are less common for sequence to sequence modeling, despite several advantages [22]. Compared to RNN, CNN create representations for fixed size contexts and do not depend on the computations of the previous time step because they do not maintain a hidden state. Gehring et al. [23] proposed an architecture for sequence to sequence modeling based on CNN. The model is equipped with linear units [24] and residual connections [25]. They also used attention in every decoder layer and demonstrated that each attention layer only adds a very small amount of overhead. Vaswani et al. [26] proposed a self-attention-based model and dispensed convolutions and recurrences entirely. Bradbury et al. [27] introduced recurrent pooling between a succession of convolutional layers, while Kalchbrenner et al. [28] studied neural translation without attention.

However, little attention has been paid to their direct applicability to languages with rich morphology. The present work focuses on the automatic evaluation of translation into morphologically rich languages, (Greek and Italian). The aim of this work is to identify the input information that is more effective for feeding a learning schema. Input information is investigated according to certain criteria, that is, the different means of calculating embeddings, the features of varying levels of linguistic information, the different dataset genres.

### **3. Materials and Methods**

This section describes the dataset, the linguistic features and the NN architecture used in the experiments.

### *3.1. Dataset*

In these experiments, two different types of parallel corpora in the two language pairs (EN-EL and EN-IT) are used. The first dataset (*C1*) consists of the test sets developed in the TraMOOC project [29]. It is a small and noisy dataset as it is comprised of educational video lecture subtitles, lecture presentation slides and assignments, while it contains mathematical expressions, spoken language features, fillers, repetitions, and many special characters, such as /, @. The second formal dataset (*C2*) consists of parallel corpora from European Union legal documents, found on EUR-Lex, the online gateway to European Union Law, under the category "Consolidated texts". The chosen sentences are from Directives, Decisions, Implementing Decisions, Regulations and Implementing Regulations of the European Council and the European Commission, on the following issues: general, financial and institutional matters, competition and development policy, energy, agriculture, economic, monetary and commercial policy, taxation, social policy and transport policy. As pointed out, *C1*, is not a well-structured corpus as it contains linguistic phenomena which are unorthodox and ungrammatical, like misspellings, repetitions, fillers, disfluencies, spoken language features and so forth. On the other hand, *C2* is formal language text. For the *C1* corpus it was necessary to perform data pre-processing, that is, removal of special symbols (@, /), and alignment corrections. For the *C2* corpus no pre-processing was required. Two MT outputs were used - one generated by SMT models, that is, the Moses toolkit [30] for

*C1* and Google Translate [31] for *C2*, and the second was generated by Neural Machine Translation (NMT) models, that is, the Nematus toolkit [32] for *C1* and Google Translate for *C2*. The Moses and Nematus prototypes are trained in both in- and out- of domain data. The Nematus is trained on additional in-domain data provided via crowdsourcing, and also includes layer normalization and improved domain adaptation. In-domain data included data from TED, Coursera, and so forth [33]. Out-of-domain data included data from Europal, OPUS, WMT News corpora and so forth. The Google Translate prototype was trained on over 25 billion examples. More details about the corpora are presented in Table 1.

**Table 1.** Corpora details on the two machine translation (MT) outputs (*S1* for the Statistical Machine Translation (SMT) output and *S2* for the Neural Machine Translation (NMT) output) *SSE* for the source sentences and the *Sr*.


### *3.2. Features*

The employed feature set is divided into two categories: one consisting of handcrafted string-based features from the MT outputs, *SSE* and *Sr*, and the other consisting of commonly used NLP Metrics. The first category contains (i) simple features (e.g., distances like Levenshtein [34], longest word for *S1*, *S2*, *Sr*, *SSE*, features using the Length Factor (LF) [35]), (ii) features identifying the noise in the corpus (e.g., repeated words/characters, unusually long words in number of characters), and (iii) features providing linguistic information from the *SSE* in EN (e.g., the length of the *SSE* in number of words and number of characters). The feature set was inspired by the work of References [36,37]. The second category contains the NLP metrics, that is, the BLEU score, METEOR, TER and WER for (*S1*, *S2*), (*S1*, *Sr*), (*S2*, *Sr*). To calculate the BLEU score, an implementation of the BLEU score from the Python Natural Language Toolkit library [38] is adopted. For the calculation of the other three metrics, the code from GitHub [39] is used. The total number of features is 82. A detailed description of the feature set can be found in Reference [21].

In the present work, the employed feature set is extended and two additional novel linguistic feature pairs, which belong to the first category, have been used (increasing thereby the feature dimensions from 82 to 86). These features are similarity-based. The first feature *cmt* shows the percentage of identical words between the MT outputs and *Sr*, without taking into account the word order. The second feature *rmt* shows the percentage of identical parts of MT output included in the *Sr*. More specifically, this feature shows whether the MT output is a contiguous subsequence of *Sr*. The features are defined in Equations (1) and (2) respectively:

$$\mathcal{L}\_{mt} = \frac{|\mathcal{S}\_{mt} \cap \mathcal{S}\_r|}{|\mathcal{S}\_{mt} \cup \mathcal{S}\_r|} \tag{1}$$

$$\tau\_{mt} = \frac{|\mathcal{S}\_{mt} \cap \mathcal{S}\_r|}{|(\mathcal{S}\_{mt} \cap \mathcal{S}\_r)'|} \, \text{with} |(\mathcal{S}\_{mt} \cap \mathcal{S}\_r)'| \neq 0. \tag{2}$$

where *Smt* is one of the *S1*, *S2*.

As an example, if

*Sr* = {η (*the*), υπηρεσία (*department*), προσδιορίζει (*specify*), το (*the*), διάστημα (*period*)},

*S1* = {το (*the*), χρονικό (*time*), διάστημα (*period*), η (*the*), υπηρεσία (*department*), καθορίζει (*determines*)},

*S2* = {η (*the*), υπηρεσία (*department*), προσδιορίζει (*specify*), την (*the*), περίοδο (*period*)}, then *cmt* = 0.57, *rmt* = 1.3 for *S1* MT output and *cmt* = 0.43, *rmt* = 0.75 for *S2* MT output.

All feature values were calculated using MATLAB, and their values have been normalized and vary between 0 and 1.

### *3.3. Embedding Layers*

Firstly, an embedding layer (mathematically-calculated embeddings) is used for the two MT outputs and the Sr. The encoding function applied is the one-hot function. The embedding layer size, in number of nodes, is 16. The input dimensions of the embedding layer is in agreement with the vocabulary of each language, taking into account the most frequent words (500 for EN-EL/700 for EN-IT). The embedding layer used is the one provided by Keras [40]. Secondly, a Greek version of WordSim353 [41] is adopted for pre-trained embeddings. More specifically WordSim353 contains the 300-dimensional Greek embeddings of 350 K words, trained on 20 M of URLs with Greek language content and they computed in 2018. More details about the number of unique sentences, unigrams, bigrams, trigrams and so forth can be found in Outsios et al. [41]. In this case, the embedding layer utilized the embedding matrix produced by the embedding\_index dictionary and the word\_index. The Embedding layer should be fed with padded sequences of integers. For this purpose, the *keras.preprocessing.text.Tokenizer* and the *keras.preprocessing.sequence.pad\_sequences* [40] were run. For the pre-trained Italian embeddings, the Wikipedia2Vec tool is used [42]. The size, in number of nodes, of the embedding layer is 300, as is the dimension of pretrained embeddings for both datasets.

### *3.4. NN Architecture*

This study aims to identify the best MT output out of the two provided. Two linguists annotated the sentences with 1 if the NMT output is better than the SMT one and with 0 if the SMT output is better than the NMT. A low annotation percentage is observed for the SMT class (EL: 37% for *C1*, 48% for *C2*, IT: 43% for *C1*, 48% for *C2*) compared with the NMT class (EL: 63% for *C1*, 52% for *C2*, IT: 57% for *C1*, 52% for *C2*). A low annotation agreement rate is observed (*C1*: 5% for EN-EL/6% for EN-IT, *C2*: 3% for EN-EL/5% for EN-IT). For the few different answers, the annotators had a discussion and finally agreed on one common label. The NN model takes as input the tuple (*S1*, *S2*, *Sr*). These sentences are passed to the embedding layer. Two ways for extracting embeddings are applied (described in Section 3.3) producing *EmbS1*, *EmbS2*, *EmbSr*. The *EmbS1*, *EmbS2*, *EmbSr* vectors are concatenated in a pairwise fashion as (*EmbS1*, *EmbS2*), (*EmbS1*, *EmbSr*), (*EmbS2*, *EmbSr*), and they form the input to the similarity-based hidden layers *h12*, *h1r*, *h2r*. As extra inputs, the hidden layers are fed with the matrices *H12[i,j]*, *H1r[i,j]*, *H2r[i,j]* (where *i* is the number of sentences and *j* the number of features), containing the second category features (NLP set). The hidden layer outputs form the input to the output layer. Moreover, an extra input to the output layer is used: the matrix *A[i,j]*, containing the first category features (described in Section 3.2). The DL NN schema is shown in Figure 1.

The binary classification problem is modeled as a Bernoulli distribution (Equation (3)):

$$Y \sim Bernoulli(Y \mid by),\tag{3}$$

where *by* is the sigmoid function *σ*(*wTx* + *b*), *w<sup>T</sup>* and *b* are the network's parameters.

### *3.5. Network Settings*

The network model architecture for the experiments is a classic architecture of RNN networks (2 LSTM layers with 400 hidden units) and feedforward layers (4 Dense layers, that is, 3 layers with 50 hidden units and 1 layer with 400 hidden units). The network is trained using the Adam optimizer [43] to optimize parameters. To avoid over-fitting, dropout is applied with a rate of 0.05, using the loss function of binary cross entropy and the regularization parameter *λ* is set equal to "10<sup>−</sup>3". 10-fold CV and 70% percentage split were employed for testing.

### **4. Results**

### *4.1. Performance Evaluation*

In this experiment (a) we investigate whether the predicted classifications have any correlation with human annotation, (b) we compare the proposed classification mechanism against the baseline classification models for small noisy and formal datasets respectively, (c) we compare two different ways of generating the embedding layer, and (d) we test two different options of validation methods. Table 2 presents the classification results (Precision and Recall) for the different MT outputs over the two different datasets. The *C1* corpus presents a classification increase, for both language pairs (accuracy: 72% EN-EL/70% for EN-IT), in contrast to the *C2* corpus (accuracy: 68% for EN-EL/65% for EN-IT), even though the *C1* corpus contains a lot of noise. This is probably due to the fact that the *C1* corpus contains more sentences, and, also, because the *C2* corpus has richer vocabulary and more formal structure. It is more difficult for the classifier to choose the best MT output, because the SMT output is more similar to the NMT output in this corpus (*C2*). It is also observed that both evaluation metrics chose the NMT model over the SMT one, which is in accordance to the annotators' results. In addition, the aforementioned accuracy results are obtained when the NN uses the simple embedding layer. However, when the pre-trained embeddings are used, the model does not lead to better results (average accuracy of *C1* and *C2*: 66% for EN-EL/65% for EN-IT), since the embeddings are trained on the generalpurpose corpus, which is not representative of the input corpora used therein. At this point, it is worth mentioning that the pre-trained embeddings seem to be more effective for the EN-IT pair than for the EN-EL language pair. As far as the different types of corpora are concerned, pre-trained embeddings are more efficient for the *C2* corpus (average accuracy of EN-EL and EN-IT: 66%) than the *C1* corpus (average accuracy of EN-EL and EN-IT:

64%). This is probably due to the fact that the *C2* corpus has richer vocabulary than the *C1* corpus.

An approach to improve the classification accuracy of a small and noisy dataset is to apply the SMOTE oversampling technique on the training data. Using SMOTE, the sentences of the minority class (SMT) doubled in number, and the total number of sentences reached 3024 for *C1* and 2276 for *C2*. It is important to compare the performance between the 82 and the 86 feature dimensions, with and without the SMOTE filter. When SMOTE is applied, a small accuracy increase is observed on the 82 features (average accuracy of *C1* and *C2*: 68% for EN-EL/67% For EN-IT), and an even higher increase on the 86 features (average accuracy of *C1* and *C2*: 70% for EN-EL/68% for EN-IT). It is interesting that the EN-EL corpora outperformed EN-IT in all the experiments. The results with the use of the two new suggested features are generally better for both corpora and language pairs.

**Table 2.** Accuracy performance for two embeddings layer types for the two corpus English–Greek (EN–EL)/English–Italian (EN–IT).


Firstly, k-fold *CV* was used, which is a reliable method for testing the models, and a value of *k* = 10 is very common in the field of machine learning [44] (Table 2). Secondly, part of the data (70%) is kept for training, and part (30%) is applied for testing (Table 3). Given that both classes are of interest, the symmetric Matthews correlation coefficient (*MCC*) metric [45] (a special case of the *φ* phi coefficient [46]) is used, as it constitutes a good way to describe the relation of *TP* (true positive), *FP* (false positive) and *FN* (false negative) values by a single number. It is defined as follows:

$$\text{MCC} = \frac{TP \times TN + FP \times FN}{(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN + FN)}.\tag{4}$$

When using 10-fold CV, *C1* outperforms *C2* for both language pairs. When the percentage split method (70% training–30% testing) is used, a small performance improvement is observed for the *C2* corpus. Moreover, *MCC* achieves higher value for the *C2* corpus, when pre-trained embeddings are used.


**Table 3.** Accuracy performance (MMC) in different cross validation options.

Figure 2 shows the accuracy performance according to training speed and batch size. Increasing the batch size can increase the model's accuracy. As seen above, the training speed decays more quickly for the simple embedding layer compared to the pre-trained embedding layer model. Moreover, the accuracy of the pre-trained embeddings is consistently higher for corpus *C2*. The best performance has been consistently obtained for batch size 512.

It is important to analyze the correlation with human-performed evaluations [47]. In this work, the correlation of the predicted scores with human judgments is reported using Kendal *τ*. Kendall *τ*, is a coefficient that measures the agreement between rankings produced by human judgments, and rankings produced by the classifier. The WMT'12 (Workshop of Machine Translation) definition of Kendall's *τ* is used, and it is calculated as follows:

$$\tau = \frac{(concordantpairs - discordantpairs)}{total pairs} \tag{5}$$

where 'concordant pairs' is the number of times the human judgment and the predicted judgment agree in the ranking of any two translations that belong to the same *SSE*, and 'discordant pairs' is the opposite.

### 4.1.1. Comparison to Related Work

As mentioned earlier, there is limited work on pairwise evaluation based on the small and noisy dataset. In order to compare our results with other methods, additional experiments were reproduced in order to imitate as closely as possible earlier work settings, that were (i) based on different classifiers such as SVM [17] and RF [37] and (ii) based on other evaluation methods, that is, the use of the BLEU score [4,17].

Figure 3 shows the overall Kendall *τ* for the different approaches. The proposed DL schema has achieved comparable performance to the models proposed in earlier works. The SVM classifier succeeds in a strong positive relationship between the two classes for C1\_EN-EL: 0.7, and moderate positive relationship for C2\_EN-EL: 0.4, C1\_EN-IT: 0.4 and C2\_EN-IT: 0.6, while the RF classifier reached a moderate positive relationship for the *C1* corpus (0.4 for EN-EL/0.6 for EN-IT) and for the *C2* corpus (0.4 for EN-EL/0.6 for EN-IT). When the BLEU score information is used, the model achieved a moderate positive relationship. Kendall *τ* reached its highest value when the proposed schema uses the simple embedding layer, the feature set of 86 dimensions, and the NLP set for both language pairs (EN-EL: 0.7 for *C1*/0.6 for *C2* and EN-IT: 0.6 for *C1*/0.5 for *C2*).

**Figure 2.** Human correlation. Simple embedding layer vs Pre-trained embeddings.

**Figure 3.** Accuracy performance (Kendall τ) compared with related work

4.1.2. Feature Selection and Dimensionality Reduction

There are many techniques for improving the classifier's performance. Feature selection (FS) and Dimensionality reduction (DR) are two commonly used techniques that improve classification accuracy [48]. The main idea behind *FS* is to remove redundant or irrelevant features that are not useful for the classifier [49]. The advantage of *FS* is that no information about the importance of single features is lost. With *DR* the size of the feature space is decreased, but without losing vital information [50].

*FS* methods are usually categorized in two basic methods: wrappers and filters [51]. Wrapper *FS* methods evaluate multiple models with different subsets of input features and select those features that result in the best performing model according to a performance metric. The number of possible results will increase geometrically as the number of features increases. Filter *FS* methods use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (filter) those input variables that will be used in the model. Filters are either global or

local. Global methods assign a single score to a feature regardless of the number of classes while local methods assign several scores, as every feature in every class has a score [52]. Global methods typically calculate the score for every feature and then choose the top-*N* features as the feature set, where *N* is usually determined empirically. Local methods are similar but require converting a feature's single score before choosing the top-*N* features. Wrappers require much more computation time than filters, and may work only with a specific classifier [51].Filters are the most common *FS* method for text classification. Some commonly used *FS* methods are a. Recursive Feature Elimination Cross Validation (RFECV) that belongs to the Wrappers methods, b. the information gain (IG) [53] that belongs to filter global *FS* methods, and c. the Chi-square (CHI) [54], that belongs to the filter local methods. All these *FS* methods are language-independent feature selection methods that produce better accuracy.

In these experiments *RFECV* is tested using Support Vector Machines (SVM) with linear kernel and the number of cross validation folds is set to 10. Information gain is often applied to find out how well each single feature *A* separates the given feature data set *S* and it is calculated as follows:

$$IG(\mathcal{S}, A) = I(\mathcal{S}) - \sum\_{n \in A} = \frac{|\mathcal{S}\_n|}{|\mathcal{S}|} I(\mathcal{S}\_n) \tag{6}$$

where *n* is the value of every feature (*A*) and *Sv* is the set of instances where *A* has value *n*.

*CHI* is a supervised *FS* method that calculates the correlation of a feature value *n* with the class *m*, and it calculated as follows:

$$\mathbf{x}^2 = \sum\_{i=1}^n i \sum\_{j=1}^m j = \frac{(O\_{ij} - E\_{ij})^2}{E\_{ij}},\tag{7}$$

where *Oij* is the observed frequency and *Eij* is the expected frequency.

*DR* refers to algorithms and techniques that create new features which are combinations of the old features [54]. The most important *DR* technique is principal component analysis (PCA) [55]. *PCA* is an unsupervised dimensional reduction technique. *PCA* produces new features from the original features by converting the high dimensional space of the original features to a low dimensional space while keeping linear structure. Dimensionality reduction is accomplished by choosing enough eigenvectors to account for some percentage of the variance in the original data (a default value is 0.95). Attribute noise was filtered by transforming the original into the *PC* space, eliminating some of the worst eigenvectors, and then transforming back to the original space. The maximum number of attributes to include in the transformed space was set to 5.

Better accuracy results are observed, in general, when a feature selection method is used, in contrast to the whole feature set model (Table 4). The accuracy performance increased 4% for the *C1* corpus for EN-EL and 3% for EN-IT. It seems that the application of these methods is more efficient for the SMT for the informal *C1* corpus and NMT for the formal (well-structured) *C2* corpus. More specifically, there is an increase up to 4% for the SMT class for *C1* and 2% for *C2*, while, for the NMT class, there is 2% for *C1* and *C2*. In addition, the feature selection methods work better for *C1* (an increase up to 3.5% in average for both language pairs) rather than the *C2* (an increase up to 2.5% in average for both language pairs). We conclude that feature selection methods help more the noisy corpus. This is in accordance with the accuracy results of the previous model.


**Table 4.** Feature selection accuracy performance for the two corpus EN-EL/EN-IT.

Concerning the features, it is verified that, for the proposed model, the more effective features are those containing ratios, features identifying the presence of noise in a segment (for example the occurrence of repeated characters) and features used linguistic information from the *SSE*. They all seem to be useful for prediction. Also, the new string-based features added in this paper are presumed to enclose valuable information for the model as they capture the similarity between the MT outputs and the reference translation. The new string-based features were selected almost from every method. Regarding the *FS* method, it seems that better accuracy results were produced with *CHI* square and *IG*. Additionally, it is observed that the feature reduction space method (PCA) does not help the accuracy performance regardless of the corpora structure-type, since in all experiments the performance was less than or equal to the classifier performance using the whole feature set.

### *4.2. Linguistic Analysis*

In order to have a more comprehensive analysis of the accuracy results, we have carried out a qualitative linguistic analysis as well. In this context, problems have been identified regarding some complex linguistic phenomena for both language pairs (Table 5).

For the first sentence (ID1): (Both NN and the Annotator's choice was *S2*)


word (*the buzz of a bug*) and especially of its genitive case: ζουζουνιού, with some letters missing.


For the second sentence (ID2): (NN chose S1/Annotator's choice was S2)


For the third sentence (ID3): (NN chose *S1*/Annotator's choice was *S2*)

• *S1* incorrectly translated the phrase: *will get us accustomed to*, considering that the two verbs are independent of each other (θα δώσει *(wiil give)*, συνηθίσει *(will get used)*),without taking into account that the verb get has a metaphorical meaning: *cause something to happen*, and not the literal one: *take*. The verb *get*, in this sentence, forms a multi-word expression with the verb *accustomed* and the preposition *to*, which, as a past participle, depends on the first. *S2* correctly translated the phrase as: θα μας κάνει συνηθισμένους (*it will make us get used*), left the word untranslated.

• *S1* incorrectly translated the last link of the sentence: (να τις ιδιαιτερότητες *(here the particularities)*(!)), translating the preposition *to* as if it were before an infinitive, without taking into account that this is the second part of: *accustomed to. . . and to*. Related to the latter is that *S1* incorrectly translated the word after *to*, that is, the possessive adjective *their*, as a definite article in plural: τις (*the*).

For the third sentence (ID4):(Both NN and the Annotator's choice was *S2*)


In conclusion, the NN model has chosen *S2* in the first sentence, since *S1* faces difficulties with some linguistic phenomena, like homonymy (e.g., the homographs of *bug*), synonymy (e.g., the similar meanings of *fix*) and polysemy as well. In addition, *S1* often fails to address certain grammatical and syntactic phenomena: subject-verb agreement, phrase structure rules, phrasal verb schemata, and so forth. However, the NN model has mainly chosen *S1* in the second sentence, because *S1* "recognized difficult" grammatical morphemes (like "*kind of* "). *S2* addresses effectively the aforementioned linguistic phenomena, and generally "recognizes" the rich morphology of the Greek and Italian language (e.g., grammatical agreements, different grammatical genders, structure rules), and, in certain cases, it misses multi-word expressions and phrasal meanings as well. Nevertheless, *S1* seems to employ richer vocabulary (e.g., απαριθμούνται (*enumerate*), κρύπτη (*crypt*), πρόδηλο (*obvious*)) than *S2*. Indeed, *S1* supports different and not so common senses for each word and it often chooses the one closer to the correct translation, whereas *S2*, without this extended vocabulary, sometimes fails to translate the less common word, or translates it with a nonexistent word (e.g., *cache*, ζουζιού respectively).


### **Table 5.** Linguistic Analysis for EN-EL and EN-IT.

### **5. Conclusions and Future Work**

This paper presented an innovative DL NN architecture for MT evaluation into morphologically rich languages. The architecture is tested on two different types of small corpora, one noisy and one formal and two different language pairs (EN-EL and EN-IT). The proposed DL schema used linguistic information from two MT outputs, *SSE* as well as the NLP set. Experiments revealed that when the DL schema utilizes the simple embedding layer and not the pre-trained embeddings, the results are better. In addition, the results using the two new suggested features and the SMOTE filter are generally better. Based on the linguistic analysis, when the MT output "recognized" the grammatical morphemes, the proposed NN model chose it as the best translation. According to the validation method, percentage split gave more balanced results for both corpora, but the 10-CV method gave higher accuracy results. The DL schema used many features, so it is important to thoroughly investigate the importance of these features for assigning them with proper weights during the NN model training. In this paper, feature selection and dimensionality reduction methods were employed and they showed that feature selection methods help more the noisy corpus. It is noticed that the proposed algorithm

conducted better results on the noisy and small dataset. For further experimentation, it is quite interesting to explore why all the classifiers led to worse results in terms of the evaluation accuracy in EN-IT than in the EN-EL language pair, taking into account that the linguistic features employed are language independent. Another idea to explore would be the pre-trained embeddings utilization, as an initialization for the embedding layer. Finally, we plan to verify another morphological schema that could improve classification performance.

**Author Contributions:** D.M., K.L.K., conceived of the idea, D.M., designed and performed the experiments, analyzed the results, drafted the initial manuscript and K.L.K., V.S., revised the final manuscript, supervision. All authors read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Some or all data generated or used during the study are available from the corresponding author by request.

**Acknowledgments:** The authors would like to thank the two Greek and Italian and language experts for the annotation.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Applied Sciences* Editorial Office E-mail: applsci@mdpi.com www.mdpi.com/journal/applsci

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com ISBN 978-3-0365-1287-7