In this section, we discuss the results based on the BLEU score. First, we discuss the DSA results as the basis for the ISA approaches. Then, we discuss the results of the ISA and show the generated text.
4.1. Direct System Approach (DSA)
Table 3 shows the results of the DSA for each language model. Of 30 experiments, 20 showed that translation systems using LM05 produced higher BLEU scores than those using LM03.
Table 3 also shows the BLEU scores of each symmetrization of word alignment. We found that the highest BLEU scores were not always generated by gdfand. For example, Kk–En LM05 tgttosrc obtained a higher BLEU score than Kk–En LM05 gdfand, that is, 3.56, showing that non-standard symmetrization could be an alternative option to improving the BLEU scores of the pivot approaches. Our results confirmed the language-specific [
8,
9,
10] and dataset-specific [
12] characteristics of the symmetrization of word alignment.
We identified the reasons for the different BLEU scores in the same LM, despite using the same automatic word alignment and decoder weights, i.e., GIZA++ and moses.ini, respectively. We compared the phrase translation parameter scores between two phrase tables of the highest and second-highest BLEU scores in the same LM. Phrase translation parameter scores were computed from the co-occurrence of aligned phrases in the training corpora, then stored in the phrase table along with the phrase pair. The scores consisted of inverse phrase translation probability (
), inverse lexical weighting (
), direct phrase translation probability (
), and direct lexical weight (
). First, we collected 2000 phrase pairs and their phrase translation parameter scores from the phrase table with the highest BLEU score: phrase
Table 1 (PT1). Subsequently, we collected 12,000 phrase pairs and their phrase translation parameter scores from phrase table with the second-highest BLEU score: phrase
Table 2 (PT2). Last, we examined which component of the phrase translation parameter of PT1 obtained higher scores than the phrase translation parameter of PT2 in the same phrase pair, as shown in
Table 4. We changed PT2 from srctotgt to tgttosrc since the results could not obtain the same phrase pairs in two language pairs, Kk–Ru and Ru–En, marked by an asterisk (*) in
Table 4. The comparison algorithm of phrase translation parameter scores between the two phrase tables can be accessed in our repositories [
37].
Table 4 shows that most language pairs obtained higher score in
and
, except for two language pairs: Kk–En and Ms–Id. The inverse phrase translation probability (
) and inverse lexical weighting (
) were obtained from the target–source (
) parallel corpora. The results indicated that target-source (
) parallel corpora more strongly influence the phrase translation parameter score than source–target (
) parallel corpora.
Table 5 shows the phrase pair and their phrase translation parameter score examples from the two phrase tables.
4.2. Interpolation System Approach (ISA)
In the ISA, we constructed two subsystems: Std-ISA and H-ISA. Std-ISA was our interpolation system that uses gdfand, whereas H-ISA uses the symmetrization of word alignment that obtained the highest BLEU score. The choice of symmetrization of word alignment for H-ISA is shown in
Table 6. Considering Kk–En LM05 H-ISA as an example, we employed tgttosrc in Kk–En as src–trg, whereas we used gdfand in Kk–Ru as src–pvt and Ru–En as pvt–trg.
Table 7 shows the ISA result. We included the direct translation src–trg of Kk–En and Ja–Id as a baseline. We found that all the translation systems using LM05 obtained higher BLEU scores than those using LM03. For Kk–En, we found that H-ISA is a competitive approach because it provided absolute improvements of 0.35 and 0.22 BLEU points over baseline and Std-ISA in LM03 and LM05, respectively.
Table 7 shows the different effect of H-ISA on Ja–Id. H-ISA obtained absolute improvements of 0.11 BLEU points over baseline in LM03. However, H-ISA obtained an absolute drop of −0.12 BLEU points compared to baseline in LM05. We compared 2000 of the same phrase pairs from two phrase tables: H-ISA and baseline. We provide an example of H-ISA and baseline phrase pairs and phrase translation parameter score in
Table 8. We found that more than 1900 phrase pairs of H-ISA in LM05 obtained lower phrase translation parameter scores compared to baseline in LM05. Therefore, lower phrase translation parameter scores could be a reason for the lower BLEU score for Ja–Id using the H-ISA.
Additionally, we investigated why Ja–Id LM03 using the Std-ISA and Ja–Id LM03 using the H-ISA obtained same BLEU score: 12.07 and 12.07, respectively. We found that both systems used the same candidates of symmetrization of word alignment in three sides of the pivot approaches: gdfand in Ja–Id, gdfand in Ja–Ms, and gdfand in Ms-Id, as shown in
Table 6. In contrast to Ja–Id LM03, Ja–Id LM05 with the Std-ISA and Ja–Id LM05 with the H-ISA obtained the same BLEU score when using different candidates of symmetrization of word alignment, as shown in
Table 6. Ja–Id LM05 using the Std-ISA used gdfand in Ja–Id, gdfand in Ja–Ms and gdfand in Ms–Id. Ja–Id LM05 of H-ISA used gdfand in Ja–Id, gdfand in Ja–Ms and tgttosrc in Ms-Id. We compared 2000 of the same phrase pairs from two phrase tables: Std-ISA and H-ISA.
Table 8 presents an example of Std-ISA and H-ISA phrase pairs and phrase translation parameter scores. We found that more than 1700 phrase pairs of Std-ISA and H-ISA obtained relatively similar phrase translation parameter scores. The result demonstrated that relatively similar phrase translation parameter scores using different symmetrization of word alignment could obtain the same BLEU score.
We investigated the relationship between BLEU score and phrase table size in each system, as shown in
Table 7 and
Table 9. We found that systems with small phrase tables, i.e., Kk–En LM03 of H-ISA, Kk–En LM05 of H-ISA, and Ja–Id LM05 of baseline, obtained higher BLEU scores: 3.43, 3.64, and 12.20, respectively, as shown in
Table 7. However, we also found that systems with large phrase tables, i.e., Ja–Id LM03 of Std-ISA and Ja–Id LM03 of H-ISA, obtained higher BLEU scores, that is, 12.07 and 12.07, respectively. Our first finding aligns with [
38], who stated that higher BLEU scores could be obtained when using small phrase tables. Our second finding aligns with that of [
11], who obtained higher BLEU scores when using large phrase tables. Our results demonstrated that higher BLEU scores can be obtained when using either small or large phrase tables.
We identified why systems with small or large phrase tables could obtain higher BLEU scores when using the same LM order and decoder weights. We compared phrase translation parameter scores between the two phrase tables of the highest and second-highest BLEU score in the same LM. First, we collected 2000 phrase pairs and their phrase translation parameter scores from the phrase table of the highest BLEU score as phrase
Table 1 (PT1). Subsequently, we collected 12,000 phrase pairs and their phrase translation parameter scores from the phrase table of second-highest BLEU score as phrase
Table 2 (PT2). Lastly, we examined whether the phrase translation parameter of PT1 obtains higher scores than that of PT2 in the same phrase pair. We found that the phrase translation parameter of PT1 obtained higher scores than that of PT2 in a system with a small phrase table, as shown in
Table 10. Consider an example of Kk–En LM03 of H-ISA that obtained higher scores, 514, 576, and 639, in
,
and
, respectively, compared to Kk–En LM03 using the Std-ISA. The result indicated that a system with a small phrase table could obtain a higher BLEU score because of the higher phrase translation parameter scores, particularly in
and
. In contrast to the system with a small phrase table, we found that the phrase translation parameter of PT1 had a lower score than that of PT2 in a system with a large phrase table.
Table 10 shows that Ja–Id LM03 had lower scores, 13, 18, 17, and 34, in
,
,
, and
, respectively. The result demonstrated that a system with a large phrase table can obtain higher BLEU scores due to the lower phrase translation parameter scores.
Table 11 shows the phrase pair and its phrase translation parameter score examples from Ja–Id LM05.
We evaluated the perplexity score of each system, as shown in
Table 12. We found that the longer LM order of Kk–En, i.e., LM05, obtained a lower perplexity score than the shorter one, i.e., LM03. In contrast, the longer LM order could not obtain a lower perplexity score than the shorter one for Ja–Id.
Table 12 shows that Ja–Id LM05 obtained higher perplexity scores than Ja–Id LM03, although Ja–Id LM05 obtained a higher BLEU score than Ja–Id LM03. We identified this higher perplexity score in Ja–Id using the target monolingual corpus size as the first parameter. The target monolingual corpus was trained by the LM toolkit, i.e., KenLM, and generated various lists of
n-gram probabilities stored in
arpa file. We compared the English and Indonesian target monolingual corpus in Kk–En and Ja–Id. English, with a larger target monolingual size, i.e., 532,560 and longer LM order, i.e., LM05, had larger various lists of
n-gram probabilities, i.e., 29,181,816. In contrast, Indonesian with a smaller target monolingual size, i.e., 8500 and longer LM order, i.e., LM05, had smaller various lists of
n-gram probabilities, i.e., 740,766. As a result, the choice of language model probabilities in the decoding process could be smaller and affected the perplexity score of Ja–Id.
Additionally, we identified another parameter that could have influenced the increase in perplexity scores in Ja–Id LM05: the feature function weight of LM. This parameter weight was generated from the target monolingual corpus and then stored in the decoder, moses.ini. A decoder is an SMT component that finds the best translation according to the product of the translation and language model probabilities [
39]. A good value for a feature function weight of LM is 0.1–1. We found that the feature function weight of LM for Kk–En LM05 was higher than for Kk–En LM03 (0.10 and 0.06, respectively) when using a larger target monolingual corpus, i.e., 532,560. In contrast to Kk–En, we found that the feature function weight of LM for Ja–Id LM05 was lower, 0.09, than for Ja–Id LM03, 0.11, when using a smaller target monolingual corpus, i.e., 8500. As a result, Kk–En LM05 could have obtained a lower perplexity score than Kk–En LM03, whereas Ja–Id LM05 obtained a higher perplexity score than Ja–Id LM03, as shown in
Table 12.
Figure 2 shows the feature function weight of LM for Kk–En and Ja–Id.
We evaluated the generated text of the systems.
Table 13 and
Table 14 show two sentence examples in each language pair, marked by (1) and (2). Sentence (1) is a long sentence, whereas (2) is a short one. We added the English translation in the generated text of Ja–Id to better understand the translation results, marked in italics.
Table 13 shows that the H-ISA generated a compact sentence compared to others in the short sentence. The compact sentence means that the generated text obtained the same keywords as a reference, i.e., without additional words. Consider an example of Kk–En LM03 H-ISA that generated compact keywords, i.e.,
that means ensuring that business investment, jobs. Kk–En LM05 baseline generated additional words, i.e.,
federal government provides, is to reverse the lost, which were not available in the reference. In contrast to Kk–En, all the systems of Ja–Id generated compact sentences, as shown in
Table 14.
Table 13 and
Table 14 also show that the word order of the generated text was incorrect. We found that the generated text appeared to follow the source language’s sentence pattern, i.e., subject-object-verb (SOV), whereas the sentence pattern of the target language is subject-verb-object (SVO). We used the default reordering model, i.e.,
msd-bidirectional-fe; however, our generated text results still followed the source language’s pattern. The
msd-bidirectional-fe is a default reordering model in SMT that considers the orientation of the model, directionality, and languages. The incorrect word order may also be the reason for the insignificant improvement in the BLEU score in our system, as shown in
Table 7. BLEU is an evaluation metric that measures the similarity between two text strings and assigns too much weight to correct word order [
40]. BLEU was used a reference file to evaluate the generated text. Our generated text results follow the source languages’ sentence pattern, i.e., SOV, whereas our reference files used the SOV pattern. Therefore, the evaluation process could not obtain the maximum results due to the differing word order between the generated text and the reference file.
Figure 3 illustrates an example of the generated text from the system of Ja–Id LM03 H-ISA. The system translates a source sentence
into the target sentence
. The result showed that the generated Indonesian text had the same word position as the Japanese text. Additionally,
Figure 3 illustrates a comparison between
t and
r with a different phrase. Given an example of
t, which used the phrase
"3 orang tewas/3 people died",
r used the phrase
"tiga orang yang tewas/three people who were killed". Furthermore,
t and
r had different word positions, leading to the difficulty in maximizing BLEU script usage.