Cross-Lingual Short-Text Semantic Similarity for Kannada–English Language Pair
Abstract
:1. Introduction
- We provide a method for computing the semantic textual similarity (STS) between sentences in Kannada and English that utilizes lexical decomposition, embedding space alignment, and convolutional neural networks. To the best of our knowledge, this is the first attempt to measure STS for the Kannada–English language pair.
- We assess the proposed method’s performance in terms of precision and correlation with the human-annotated scores, as well as word-level alignment in the embedding space.
2. A Review of Existing Works
3. Methodology
3.1. Word-Based Similarity vs. Vector-Based Similarity
3.2. Word Embedding
Monolingual Embedding and Alignment
3.3. Lexical Decomposition and Similarity Calculation
3.4. Word-Order Similarity
3.5. Score-Level Fusion
4. Experimentation and Results
- activation function = ReLU,
- epochs = 100,
- learning rate = 0.001,
- ,
- loss function = cross-entropy.
Type of Learning | Alignment Method | Source–Target Pair | P@1 | P@5 | P@10 |
---|---|---|---|---|---|
Unsupervised | VecMap | KAN - ENG | 42.3 | 72.5 | 76.2 |
MUSE | KAN - ENG | 36.9 | 70.1 | 78.3 | |
Supervised | VecMap | KAN - ENG | 49.6 | 73.7 | 80.3 |
MUSE | KAN - ENG | 47.1 | 74.3 | 83.7 |
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Pikuliak, M.; Šimko, M.; Bieliková, M. Cross-lingual learning for text processing: A survey. Expert Syst. Appl. 2021, 165, 113765. [Google Scholar] [CrossRef]
- Saad, M.; Langlois, D.; Smaïli, K. Cross-Lingual Semantic Similarity Measure for Comparable Articles. In Advances in Natural Language Processing; Springer International Publishing: Cham, Switzerland, 2014; pp. 105–115. [Google Scholar]
- Cer, D.M.; Diab, M.T.; Agirre, E.; Lopez-Gazpio, I.; Specia, L. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the SemEval@ACL, Vancouver, BC, Canada, 3–4 August 2017. [Google Scholar]
- Camacho-Collados, J.; Pilehvar, M.T.; Collier, N.; Navigli, R. SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity. In Proceedings of the SemEval@ACL, Vancouver, BC, Canada, 3–4 August 2017. [Google Scholar]
- Zhao, C.; Wu, M.; Yang, X.; Zhang, W.; Zhang, S.; Wang, S.; Li, D. A Systematic Review of Cross-Lingual Sentiment Analysis: Tasks, Strategies, and Prospects. ACM Comput. Surv. 2024, 56, 1–37. [Google Scholar] [CrossRef]
- Xu, Y.; Cao, H.; Du, W.; Wang, W. A Survey of Cross-lingual Sentiment Analysis: Methodologies, Models and Evaluations. Data Sci. Eng. 2022, 7, 279–299. [Google Scholar] [CrossRef]
- Chandrasekaran, D.; Mago, V. Evolution of Semantic Similarity—A Survey. ACM Comput. Surv. 2021, 54, 41. [Google Scholar] [CrossRef]
- Prakoso, D.W.; Abdi, A.; Amrit, C. Short text similarity measurement methods: A Review. Soft Comput. 2021, 25, 1–25. [Google Scholar] [CrossRef]
- Navigli, R.; Martelli, F. An overview of word and sense similarity. Nat. Lang. Eng. 2019, 25, 693–714. [Google Scholar] [CrossRef]
- Khattak, F.K.; Jeblee, S.; Pou-Prom, C.; Abdalla, M.; Meaney, C.; Rudzicz, F. A survey of word embeddings for clinical text. J. Biomed. Inform. 2019, 100, 100057. [Google Scholar] [CrossRef]
- Alian, M.; Awajan, A. Factors affecting sentence similarity and paraphrasing identification. Int. J. Speech Technol. 2020, 23, 851–859. [Google Scholar] [CrossRef]
- Alian, M.; Awajan, A. Arabic sentence similarity based on similarity features and machine learning. Soft Comput. 2021, 25, 10089–10101. [Google Scholar] [CrossRef]
- Yan, R.; Qiu, D.; Jiang, H. Sentence Similarity Calculation Based on Probabilistic Tolerance Rough Sets. Math. Probl. Eng. 2021, 2021, 1–9. [Google Scholar] [CrossRef]
- Quan, Z.; Wang, Z.J.; Le, Y.; Yao, B.; Li, K.; Yin, J. An efficient framework for sentence similarity modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 853–865. [Google Scholar] [CrossRef]
- Chatterjee, N.; Yadav, N. Fuzzy Rough Set-Based Sentence Similarity Measure and its Application to Text Summarization. IETE Tech. Rev. 2019, 36, 517–525. [Google Scholar] [CrossRef]
- Kenter, T.; De Rijke, M. Short text similarity with word embeddings. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 18–23 October 2015; pp. 1411–1420. [Google Scholar]
- Farouk, M. Measuring text similarity based on structure and word embedding. Cogn. Syst. Res. 2020, 63, 1–10. [Google Scholar] [CrossRef]
- Nguyen, H.T.; Duong, P.H.; Cambria, E. Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowl.-Based Syst. 2019, 182, 104842. [Google Scholar] [CrossRef]
- Glavaš, G.; Franco-Salvador, M.; Ponzetto, S.P.; Rosso, P. A resource-light method for cross-lingual semantic textual similarity. Knowl.-Based Syst. 2018, 143, 1–9. [Google Scholar] [CrossRef]
- Qiang, J.; Chen, P.; Wang, T.; Wu, X. Topic Modeling over Short Texts by Incorporating Word Embeddings. In Advances in Knowledge Discovery and Data Mining; Kim, J., Shim, K., Cao, L., Lee, J.G., Lin, X., Moon, Y.S., Eds.; Springer: Cham, Switzerland, 2017; pp. 363–374. [Google Scholar]
- Otter, D.W.; Medina, J.R.; Kalita, J.K. A Survey of the Usages of Deep Learning for Natural Language Processing. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 604–624. [Google Scholar] [CrossRef]
- Hu, B.; Lu, Z.; Li, H.; Chen, Q. Convolutional Neural Network Architectures for Matching Natural Language Sentences. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2, Cambridge, MA, USA, 18–22 November 2014; pp. 2042–2050. [Google Scholar]
- Hou, S.L.; Huang, X.K.; Fei, C.Q.; Zhang, S.H.; Li, Y.Y.; Sun, Q.L.; Wang, C.Q. A Survey of Text Summarization Approaches Based on Deep Learning. J. Comput. Sci. Technol. 2021, 36, 633–663. [Google Scholar] [CrossRef]
- Alshemali, B.; Kalita, J. Improving the Reliability of Deep Neural Networks in NLP: A Review. Knowl.-Based Syst. 2020, 191, 105210. [Google Scholar] [CrossRef]
- Zheng, T.; Gao, Y.; Wang, F.; Fan, C.; Fu, X.; Li, M.; Zhang, Y.; Zhang, S.; Ma, H. Detection of medical text semantic similarity based on convolutional neural network. BMC Med. Inform. Decis. Mak. 2019, 19, 156. [Google Scholar] [CrossRef]
- Chicco, D. Siamese Neural Networks: An Overview. In Artificial Neural Networks; Humana: New York, NY, USA, 2021; pp. 73–94. [Google Scholar]
- Ranasinghe, T.; Orasan, C.; Mitkov, R. Semantic Textual Similarity with Siamese Neural Networks. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), Varna, Bulgaria, 2–4 September 2019; pp. 1004–1011. [Google Scholar] [CrossRef]
- Peng, S.; Cui, H.; Xie, N.; Li, S.; Zhang, J.; Li, X. Enhanced-RCNN: An Efficient Method for Learning Sentence Similarity. In Proceedings of the Web Conference 2020, New York, NY, USA, 20–24 April 2020; pp. 2500–2506. [Google Scholar]
- Ferreira, R.; Lins, R.D.; Simske, S.J.; Freitas, F.; Riss, M. Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. 2016, 39, 1–28. [Google Scholar] [CrossRef]
- Lopez-Gazpio, I.; Maritxalar, M.; Lapata, M.; Agirre, E. Word n-gram attention models for sentence similarity and inference. Expert Syst. Appl. 2019, 132, 1–11. [Google Scholar] [CrossRef]
- Wang, Z.; Mi, H.; Ittycheriah, A. Sentence Similarity Learning by Lexical Decomposition and Composition. In Proceedings of the COLING, Osaka, Japan, 11–16 December 2016. [Google Scholar]
- Majumder, G.; Pakray, P.; Das, R.; Pinto, D. Interpretable semantic textual similarity of sentences using alignment of chunks with classification and regression. Appl. Intell. 2021, 51, 7322–7349. [Google Scholar] [CrossRef]
- Das, A.; Mandal, J.; Danial, Z.; Pal, A.; Saha, D. A novel approach for automatic Bengali question answering system using semantic similarity analysis. Int. J. Speech Technol. 2020, 23, 873–884. [Google Scholar] [CrossRef]
- Viji, D.; Revathy, S. A hybrid approach of Weighted Fine-Tuned BERT extraction with deep Siamese Bi—LSTM model for semantic text similarity identification. Multimed. Tools Appl. 2022, 81, 1–27. [Google Scholar] [CrossRef]
- Liu, M.; Zhang, Y.; Xu, J.; Chen, Y. Deep bi-directional interaction network for sentence matching. Appl. Intell. 2021, 51, 4305–4329. [Google Scholar] [CrossRef]
- Xiong, Y.; Chen, S.; Qin, H.; Cao, H.; Shen, Y.; Wang, X.; Chen, Q.; Yan, J.; Tang, B. Distributed representation and one-hot representation fusion with gated network for clinical semantic textual similarity. BMC Med. Inform. Decis. Mak. 2020, 20, 72. [Google Scholar] [CrossRef]
- Do, P.; Pham, P. W-KG2Vec: A weighted text-enhanced meta-path-based knowledge graph embedding for similarity search. Neural Comput. Appl. 2021, 33, 16533–16555. [Google Scholar] [CrossRef]
- Araque, O.; Zhu, G.; Iglesias, C.A. A semantic similarity-based perspective of affect lexicons for sentiment analysis. Knowl.-Based Syst. 2019, 165, 346–359. [Google Scholar] [CrossRef]
- Yang, J.; Li, Y.; Gao, C.; Zhang, Y. Measuring the short text similarity based on semantic and syntactic information. Future Gener. Comput. Syst. 2021, 114, 169–180. [Google Scholar] [CrossRef]
- Jain, A.; Jain, G.; Tewari, D. KNetwork: Advancing cross-lingual sentiment analysis for enhanced decision-making in linguistically diverse environments. Knowl. Inf. Syst. 2024, 66, 1–19. [Google Scholar] [CrossRef]
- Wang, L.; Liu, S.; Qiao, L.; Sun, W.; Sun, Q.; Cheng, H. A Cross-Lingual Sentence Similarity Calculation Method with Multifeature Fusion. IEEE Access 2022, 10, 30666–30675. [Google Scholar] [CrossRef]
- Kumar, P.; Pathania, K.; Raman, B. Zero-shot learning based cross-lingual sentiment analysis for sanskrit text with insufficient labeled data. Appl. Intell. 2023, 53, 10096–10113. [Google Scholar] [CrossRef]
- Xiao, X.; Zhou, C.; Ping, H.; Cao, D.; Li, Y.; Zhou, Y.; Li, S.; Bogdan, P. Exploring Neuron Interactions and Emergence in LLMs: From the Multifractal Analysis Perspective. arXiv 2024, arXiv:2402.09099. [Google Scholar]
- Cheng, M.; Li, Y.; Nazarian, S.; Bogdan, P. From rumor to genetic mutation detection with explanations: A GAN approach. Sci. Rep. 2021, 11, 5861. [Google Scholar] [CrossRef] [PubMed]
- Cheng, M.; Yin, C.; Nazarian, S.; Bogdan, P. Deciphering the laws of social network-transcendent COVID-19 misinformation dynamics and implications for combating misinformation phenomena. Sci. Rep. 2021, 11, 10424. [Google Scholar] [CrossRef]
- Li, H.; Wang, W.; Liu, Z.; Niu, Y.; Wang, H.; Zhao, S.; Liao, Y.; Yang, W.; Liu, X. A Novel Locality-Sensitive Hashing Relational Graph Matching Network for Semantic Textual Similarity Measurement. Expert Syst. Appl. 2022, 207, 117832. [Google Scholar] [CrossRef]
- Almuhaimeed, A.; Alhomidi, M.A.; Alenezi, M.N.; Alamoud, E.; Alqahtani, S. A modern semantic similarity method using multiple resources for enhancing influenza detection. Expert Syst. Appl. 2022, 193, 116466. [Google Scholar] [CrossRef]
- Giabelli, A.; Malandri, L.; Mercorio, F.; Mezzanzanica, M.; Nobani, N. Embeddings Evaluation Using a Novel Measure of Semantic Similarity. Cogn. Comput. 2022, 14, 749–763. [Google Scholar] [CrossRef]
- Guo, W.; Zeng, Q.; Duan, H.; Ni, W.; Liu, C. Process-extraction-based text similarity measure for emergency response plans. Expert Syst. Appl. 2021, 183, 115301. [Google Scholar] [CrossRef]
- Lu, W.; Zhang, X.; Lu, H.; Li, F. Deep hierarchical encoding model for sentence semantic matching. J. Vis. Commun. Image Represent. 2020, 71, 102794. [Google Scholar] [CrossRef]
- Kleenankandy, J.; K A, A.N. An enhanced Tree-LSTM architecture for sentence semantic modeling using typed dependencies. Inf. Process. Manag. 2020, 57, 102362. [Google Scholar] [CrossRef]
- Oussalah, M.; Mohamed, M. Knowledge-based sentence semantic similarity: Algebraical properties. Prog. Artif. Intell. 2021, 11, 43–63. [Google Scholar] [CrossRef]
- Meshram, S.; Anand Kumar, M. Long short-term memory network for learning sentences similarity using deep contextual embeddings. Int. J. Inf. Technol. 2021, 13, 1633–1641. [Google Scholar] [CrossRef]
- Miller, G.A. WordNet: A Lexical Database for English. In Proceedings of the Human Language Technology: Proceedings of a Workshop, Plainsboro, NJ, USA, 8–11 March 1994. [Google Scholar]
- Bhattacharyya, P. IndoWordNet. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, 17–18 May 2010. [Google Scholar]
- Panjwani, R.; Kanojia, D.; Bhattacharyya, P. pyiwn: A Python based API to access Indian Language WordNets. In Proceedings of the 9th Global Wordnet Conference, Nanyang Technological University (NTU), Singapore, 8–12 January 2018; pp. 378–383. [Google Scholar]
- Pawar, A.; Mago, V.K. Calculating the similarity between words and sentences using a lexical database and corpus statistics. arXiv 2018, arXiv:1802.05667. [Google Scholar]
- Jurafsky, D.; Martin, J.H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition; Pearson/Prentice Hall: Upper Saddle River, NJ, USA, 2013. [Google Scholar]
- Iacobacci, I.; Pilehvar, M.T.; Navigli, R. Embeddings for Word Sense Disambiguation: An Evaluation Study. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 897–907. [Google Scholar]
- Kumar, S.; Kumar, S.; Kanojia, D.; Bhattacharyya, P. “A Passage to India”: Pre-trained Word Embeddings for Indian Languages. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), Marseille, France, 11–12 May 2020; pp. 352–357. [Google Scholar]
- Ruder, S.; Vulić, I.; Søgaard, A. A Survey of Cross-Lingual Word Embedding Models. J. Artif. Int. Res. 2019, 65, 569–630. [Google Scholar] [CrossRef]
- Honnibal, M.; Montani, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Sentometrics Res. 2018, 7, 411–420. [Google Scholar]
- Grave, E.; Bojanowski, P.; Gupta, P.; Joulin, A.; Mikolov, T. Learning Word Vectors for 157 Languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
- Artetxe, M.; Labaka, G.; Agirre, E. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 2289–2294. [Google Scholar]
- Artetxe, M.; Labaka, G.; Agirre, E. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 451–462. [Google Scholar]
- Artetxe, M.; Labaka, G.; Agirre, E. Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 5012–5019. [Google Scholar]
- Artetxe, M.; Labaka, G.; Agirre, E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 789–798. [Google Scholar]
- Yang, Y.; Cer, D.; Ahmad, A.; Guo, M.; Law, J.; Constant, N.; Hernandez Abrego, G.; Yuan, S.; Tar, C.; Sung, Y.h.; et al. Multilingual Universal Sentence Encoder for Semantic Retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics, Online, 5–10 July 2020; pp. 87–94. [Google Scholar]
- Smith, S.L.; Turban, D.H.P.; Hamblin, S.; Hammerla, N.Y. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
- Liu, X.; Zhou, Y.; Zheng, R. Sentence Similarity based on Dynamic Time Warping. In Proceedings of the International Conference on Semantic Computing (ICSC 2007), Irvine, CA, USA, 17–19 September 2007; pp. 250–256. [Google Scholar] [CrossRef]
- Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
- Pvs, A.; Gali, K. Part of Speech Tagging and Chunking using Conditional Random Fields and Transformation Based Learning. In Proceedings of the Shallow Parsing for South Asian Languages (SPSAL) Workshop, Hyderabad, India, 13–14 January 2007; pp. 21–24. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Wang, B.; Zhao, D.; Lioma, C.; Li, Q.; Zhang, P.; Simonsen, J.G. Encoding word order in complex embeddings. In Proceedings of the International Conference on Learning Representations, Online, 26 April–1 May 2020. [Google Scholar]
- Ramesh, G.; Doddapaneni, S.; Bheemaraj, A.; Jobanputra, M.; AK, R.; Sharma, A.; Sahoo, S.; Diddee, H.; J, M.; Kakwani, D.; et al. Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages. Trans. Assoc. Comput. Linguist. 2022, 10, 145–162. [Google Scholar] [CrossRef]
Model | MAP | MRR | Spearman |
---|---|---|---|
Correlation Coefficient | |||
OptAlign | 0.71 | 0.82 | 0.75 |
GrAssoc | 0.68 | 0.64 | 0.71 |
Aggreg | 0.59 | 0.62 | 0.62 |
Proposed Method | 0.81 | 0.85 | 0.83 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
S N, M.; Holla, R.; N, H.; Ganiga, R. Cross-Lingual Short-Text Semantic Similarity for Kannada–English Language Pair. Computers 2024, 13, 236. https://doi.org/10.3390/computers13090236
S N M, Holla R, N H, Ganiga R. Cross-Lingual Short-Text Semantic Similarity for Kannada–English Language Pair. Computers. 2024; 13(9):236. https://doi.org/10.3390/computers13090236
Chicago/Turabian StyleS N, Muralikrishna, Raghurama Holla, Harivinod N, and Raghavendra Ganiga. 2024. "Cross-Lingual Short-Text Semantic Similarity for Kannada–English Language Pair" Computers 13, no. 9: 236. https://doi.org/10.3390/computers13090236
APA StyleS N, M., Holla, R., N, H., & Ganiga, R. (2024). Cross-Lingual Short-Text Semantic Similarity for Kannada–English Language Pair. Computers, 13(9), 236. https://doi.org/10.3390/computers13090236