4. Survey on the Data
The experiments were carried on six European languages, ranked by the order of morphological richness: English (en), French (fr), German (de), Czech (cs), Polish (pl) and Finnish (fi). We considered surveying four different corpora that cover the previously mentioned languages. Most of these corpora are mainly used in machine translation tasks.
Table 2 shows the language availability for each corpus. Below are the corpora ranked in the decreasing order of expected analogical density:
Tatoeba (available at:
tatoeba.org accessed on 20 September 2020) is a collection of sentences that are translations provided through collaborative works online (crowd-sourcing). It covers hundreds of languages. However, the amount of data between languages are not balanced because it also depends on the number of members who are native speakers of that language. Sentences contained in Tatoeba corpus are usually short. These sentences are mostly about daily life conversations.
Table 3 shows the statistics of Tatoeba corpus used in the experiments.
Multi30K (available at:
github.com/multi30k/dataset accessed on 20 September 2020) [
22,
23,
24] is a collection of image descriptions (captions) provided in several languages. This dataset is mainly used for multilingual image description and multimodal machine translation tasks. It is an extension of Flickr30K [
25], and more data are added from time to time, for example, the COCO dataset (available at:
cocodataset.org accessed on 20 September 2020).
Table 4 shows the statistics of Multi30K corpus.
CommonCrawl (available at:
commoncrawl.org accessed on 20 September 2020) is a crawled web archive and dataset. Due to its nature as web archives, this corpus covers a lot of topics. In this paper, we used the version that is provided as training data for the Shared Task: Machine Translation of WMT-2015 (available at:
statmt.org/wmt15/translation-task.html accessed on September 2020).
Table 5 shows the statistics on the CommonCrawl corpus.
Europarl (available at:
statmt.org/europarl/ accessed on September 2020) [
14] is a corpus that contains transcriptions of the European Parliament in 11 European languages. It was first introduced for Statistical Machine Translation and is still used as the basic corpus for machine translation tasks. In this paper, we use version 7.
Table 6 shows the statistics on Europarl corpus.
Europarl emerges as the corpus with the highest number of lines. It also has the highest average number of tokens per line. In contrast, Tatoeba has the smallest average number of tokens per line, as expected. As an overview, Multi30K has two times the number of tokens in a sentence in comparison with Tatoeba, and CommonCrawl has three times while Europarl has around four times. Our hypothesis is that tokens in shorter sentences have more chances to commute. Thus, it has more analogies.
These four corpora can be characterised into two groups based on the diversity of the sentence context. Multi30K and CommonCrawl are corpora with diverse contexts. In comparison with that, sentences contained in Tatoeba and Europarl are less diverse. Tatoeba is mostly about daily life conversation, while Europarl is a discussion on parliament. We expect that corpora with less diversity in their context share words between sentences more often. Thus, it has more analogies and a higher analogical density.
Let us now compare the statistics between languages. English has the lowest number of types. Finnish, Polish and Czech always have the highest number of types, around two times higher than English across the corpora. It is even more than four times higher for Europarl. We can observe that languages with poor morphology have fewer of types and hapaxes. On the contrary, languages with high morphological richness have less number of tokens due to a richer vocabulary. These languages also tend to have longer words (in characters). One can easily understand that with richer morphological features, we have larger vocabulary. The consequence of this is that the words are longer. We also observe that a higher number of types means less words to repeat (higher Type–Token Ratio). Thus, the number of tokens is lower.
However, we also see that there are some interesting exceptions, in this case, French and German. French has a higher number of tokens than English despite having higher vocabulary size. The greater variety in the number of functional words (propositions, articles, etc.) in French is probably one of the explanations for this phenomenon. As for German, it has a pretty high average length of type in comparison with other languages. This is perhaps caused by words in German being originally longer. German is known to glue several words into a compound word.
Table 7 provides example of sentences contained in the corpora.
Aligning Sentences across Languages
For Europarl and Multi30K, there exist parallel corpora. However, some corpora are not aligned, in this case, Tatoeba and CommonCrawl. For these corpora, we need to align the sentences contained in the corpus. Having parallel corpora allows us to make a comparison between languages.
For each corpus, we used English as the pivot language to align the sentences across the other languages. We added an English sentence to the collection of aligned sentences if the sentence has translations in the other languages. If there were several translation references are available in another language, one sentence was randomly picked to represent that particular language. Thus, for each English sentence, there is only one corresponding sentence in every language at the end of the alignment process.
Author Contributions
Conceptualization, R.F. and Y.L.; methodology, R.F. and Y.L.; software, R.F.; validation, R.F. and Y.L.; formal analysis, R.F. and Y.L.; investigation, R.F. and Y.L.; resources, R.F.; data curation, R.F.; writing—original draft preparation, R.F.; writing—review and editing, R.F. and Y.L.; visualization, R.F.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. Both authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by a JSPS grant, number 18K11447 (Kakenhi Kiban C), entitled “Self-explainable and fast-to-train example-based machine translation”.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
- Hathout, N. Acquisition of morphological families and derivational series from a machine readable dictionary. arXiv 2009, arXiv:0905.1609. [Google Scholar]
- Lavallée, J.F.; Langlais, P. Morphological acquisition by formal analogy. In Morpho Challenge 2009; Knowledge 4 All Foundation Ltd.: Surrey, UK, 2009. [Google Scholar]
- Blevins, J.P.; Blevins, J. (Eds.) Analogy in Grammar: Form and Acquisition. Oxford Scholarship Online. 2009. Available online: https://oxford.universitypressscholarship.com/view/10.1093/acprof:oso/9780199547548.001.0001/acprof-9780199547548 (accessed on 25 July 2021).
- Fam, R.; Lepage, Y. A study of the saturation of analogical grids agnostically extracted from texts. In Proceedings of the Computational Analogy Workshop at the 25th International Conference on Case-Based Reasoning (ICCBR-CA-17), Trondheim, Norway, 26–28 June 2017; pp. 11–20. Available online: http://ceur-ws.org/Vol-2028/paper1.pdf (accessed on 25 July 2021).
- Wang, W.; Fam, R.; Bao, F.; Lepage, Y.; Gao, G. Neural Morphological Segmentation Model for Mongolian. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN-2019), Budapest, Hungary, 14–19 July 2019; pp. 1–7. [Google Scholar]
- Langlais, P.; Patry, A. Translating Unknown Words by Analogical Learning. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL-07), Prague, Czech Republic, 28–30 June 2007; pp. 877–886. Available online: https://aclanthology.org/D07-1092 (accessed on 25 July 2021).
- Lindén, K. Entry Generation by Analogy—Encoding New Words for Morphological Lexicons. North. Eur. J. Lang. Technol. 2009, 1, 1–25. [Google Scholar] [CrossRef]
- Fam, R.; Purwarianti, A.; Lepage, Y. Plausibility of word forms generated from analogical grids in Indonesian. In Proceedings of the 16th International Conference on Computer Applications (ICCA-18), Beirut, Lebanon, 25–26 July 2018; pp. 179–184. [Google Scholar]
- Hathout, N.; Namer, F. Automatic Construction and Validation of French Large Lexical Resources: Reuse of Verb Theoretical Linguistic Descriptions. Available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.549.5396&rep=rep1&type=pdf (accessed on 25 July 2021).
- Hathout, N. Acquistion of the Morphological Structure of the Lexicon Based on Lexical Similarity and Formal Analogy. In Proceedings of the 3rd Textgraphs Workshop on Graph-Based Algorithms for Natural Language Processing, Manchester, UK, 24 August 2008; pp. 1–8. Available online: https://aclanthology.org/W08-2001 (accessed on 25 July 2021).
- Lepage, Y.; Denoual, E. Purest ever example-based machine translation: Detailed presentation and assessment. Mach. Transl. 2005, 19, 251–282. [Google Scholar] [CrossRef]
- Takezawa, T.; Sumita, E.; Sugaya, F.; Yamamoto, H.; Yamamoto, S. Toward a Broad-coverage Bilingual Corpus for Speech Translation of Travel Conversations in the Real World. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Spain, 29–31 May 2002; Available online: http://www.lrec-conf.org/proceedings/lrec2002/pdf/305.pdf (accessed on 25 July 2021).
- Lepage, Y. Lower and Higher Estimates of the Number of “True Analogies” between Sentences Contained in a Large Multilingual Corpus. Available online: https://aclanthology.org/C04-1106.pdf (accessed on 25 July 2021).
- Koehn, P. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: The Tenth Machine Translation Summit; AAMT: Phuket, Thailand, 2005; pp. 79–86. [Google Scholar]
- Lepage, Y. Solving Analogies on Words: An Algorithm. In Proceedings of the 17th International Conference on Computational Linguistics (COLING 1998), Montreal, QC, Canada, 10–14 August 1998; Volume 1, pp. 728–734. [Google Scholar]
- Stroppa, N.; Yvon, F. An Analogical Learner for Morphological Analysis. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), Ann Arbor, MI, USA, 29–30 June 2005; pp. 120–127. [Google Scholar]
- Langlais, P.; Yvon, F. Scaling up Analogical Learning. In Proceedings of the Coling 2008: Companion Volume: Posters, Manchester, UK, 18–22 August 2008; pp. 51–54. [Google Scholar]
- Beesley, K.R. Consonant Spreading in Arabic Stems. COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics. Available online: https://aclanthology.org/C98-1018.pdf (accessed on 25 July 2021).
- Wintner, S. Chapter Morphological Processing of Semitic Languages. In Natural Language Processing of Semitic Languages; Springer: Berlin/Heidelberg, Germany, 2014; pp. 43–66. [Google Scholar]
- Gil, D. From Repetition to Reduplication in Riau Indonesian. In Studies on Reduplication; De Gruyter: Berlin, Germany, 2011; pp. 31–64. [Google Scholar]
- Lepage, Y. Analogies Between Binary Images: Application to Chinese Characters. In Computational Approaches to Analogical Reasoning: Current Trends; Prade, H., Richard, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 25–57. [Google Scholar]
- Elliott, D.; Frank, S.; Sima’an, K.; Specia, L. Multi30K: Multilingual English-German Image Descriptions. In Proceedings of the 5th Workshop on Vision and Language, Berlin, Germany, 7–12 August 2016; pp. 70–74. [Google Scholar]
- Elliott, D.; Frank, S.; Barrault, L.; Bougares, F.; Specia, L. Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description. In Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers, Copenhagen, Denmark, 7–8 September 2017; pp. 215–233. [Google Scholar]
- Barrault, L.; Bougares, F.; Specia, L.; Lala, C.; Elliott, D.; Frank, S. Findings of the Third Shared Task on Multimodal Machine Translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Brussels, Belgium, 31 October–1 November 2018; pp. 304–323. [Google Scholar]
- Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
- Kudo, T. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 66–75. Available online: https://aclanthology.org/P18-1007 (accessed on 25 July 2021).
- Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1715–1725. Available online: https://aclanthology.org/P16-1162 (accessed on 25 July 2021).
- Provilkov, I.; Emelianenko, D.; Voita, E. BPE-Dropout: Simple and Effective Subword Regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, 5–10 July 2020; pp. 1882–1892. Available online: https://aclanthology.org/2020.acl-main.170 (accessed on 25 July 2021).
- Koehn, P.; Knight, K. Estimating Word Translation Probabilities from Unrelated Monolingual Corpora Using the EM Algorithm. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, Austin, TX, USA, 30 July–3 August 2000; pp. 711–715. [Google Scholar]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).