Improving Sentence Retrieval Using Sequence Similarity
Abstract
:1. Introduction
2. Previous Research
3. Research Objective
4. Partial Match of Terms Using Sequence Similarity
4.1. Method with Sentence Retrieval
- is the number of sentences in which the term appears,
- is the number of sentences in the collection,
- is number of appearances of the term in a query , and
- is number of appearances of the term in a sentence .
4.2. Model with Sentence Retrieval
- is the number of sentences in the collection,
- is the number of sentences in which the term appears,
- is the number of appearances of the term in a sentence ,
- is the number of appearances of the term in a query ,
- is the sentence length,
- is the average sentence length, and
- , , and are the adjustment parameters.
4.3. Sentence Retrieval with Language Model ()
- is number of appearances of the term in a sentence , and
- is the sentence length.
- is the number of appearances of the term in a sentence ,
- is the sentence length,
- is the parameter that control the amount of smoothing, and
- can be calculated using the maximum likelihood estimator of the term in a large collection: (where is the collection) [6].
4.4. Sentence Retrieval Using Sequence Similarity
- presents subsequence of sequence .
- is the normalization parameter.
- presents subsequence of sequence (term from the query as a subsequence of term from the sentence as a sequence.
- is the postulate that only terms that are in the query and in the sentence are considered. In this case, there was a minimum of at least one match of the terms from the query and from the sentence.
5. Experiments and Results
- |Rel| is the number of relevant sentences to the query, and
- r is the number of relevant sentences in top |Rel| sentences of the result.
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Manning, C.D.; Raghavan, P.; Schutze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
- Murdock, V.G. Aspects of Sentence Retrieval. Ph.D. Thesis, University of Massachussetts, Amherst, MA, USA, 2006. [Google Scholar]
- Doko, A.; Štula, M.; Seric, L. Using TF-ISF with Local Context to Generate an Owl Document Representation for Sentence Retrieval. Comput. Sci. Eng. Int. J. 2015, 5, 1–15. [Google Scholar] [CrossRef]
- Doko, A.; Štula, M.; Seric, L. Improved sentence retrieval using local context and sentence length. Inf. Process. Manag. 2013, 49, 1301–1312. [Google Scholar] [CrossRef]
- Allan, J.; Wade, C.; Bolivar, A. Retrieval and novelty detection at the sentence level. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval -SIGIR ’03, Toronto, ON, Canada, 28 July–1 August 2003. [Google Scholar]
- Fernández, R.T.; Losada, D.E.; Azzopardi, L. Extending the language modeling framework for sentence retrieval to include local context. Inf. Retr. 2010, 14, 355–389. [Google Scholar] [CrossRef] [Green Version]
- Agarwal, B.; Ramampiaro, H.; Langseth, H.; Ruocco, M. A deep network model for paraphrase detection in short text messages. Inf. Process. Manag. 2018, 54, 922–937. [Google Scholar] [CrossRef] [Green Version]
- Kenter, T.; de Rijke, M. Short Text Similarity with Word Embeddings. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management—CIKM ’15, Melbourne, Australia, 19–23 October 2015. [Google Scholar]
- Harman, D. Overview of the TREC 2002 novelty track. In Proceedings of the Eleventh Text Retrieval Conference (TREC), Gaithersburg, MD, USA, 19–22 November 2002. [Google Scholar]
- Soboroff, I.; Harman, D. Overview of the TREC 2003 novelty track. In Proceedings of the Twelfth Text Retrieval Conference (TREC), Gaithersburg, MD, USA, 18–21 November 2003. [Google Scholar]
- Soboroff, I. Overview of the TREC 2004 novelty track. In Proceedings of the Thirteenth Text Retrieval Conference (TREC), Gaithersburg, MD, USA, 16–19 November 2004. [Google Scholar]
- Chiranjeevi, H.; Manjula, K.S. An Text Document Retrieval System for University Support Service on a High Performance Distributed Information System. In Proceedings of the 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), Chengdu, China, 12–15 April 2019. [Google Scholar]
- Yahav, I.; Shehory, O.; Schwartz, D.G. Comments Mining With TF-IDF: The Inherent Bias and Its Removal. IEEE Trans. Knowl. Data Eng. 2018, 31, 437–450. [Google Scholar] [CrossRef]
- Kadhim, A.I. Term Weighting for Feature Extraction on Twitter: A Comparison between BM25 and TF-IDF. In Proceedings of the 2019 International Conference on Advanced Science and Engineering (ICOASE), Zakho-Duhok, Iraq, 2–4 April 2019. [Google Scholar]
- Fu, X.; Ch’Ng, E.; Aickelin, U.; Zhang, L. An Improved System for Sentence-level Novelty Detection in Textual Streams. SSRN Electron. J. 2015. [Google Scholar] [CrossRef] [Green Version]
- Niyigena, P.; Zuping, Z.; Khuhro, M.A.; Hanyurwimfura, D. Efficient Document Similarity Detection Using Weighted Phrase Indexing. Int. J. Multimedia Ubiquitous Eng. 2016, 11, 231–244. [Google Scholar] [CrossRef]
- Zhu, Z.; Liang, J.; Li, D.; Yu, H.; Liu, G. Hot Topic Detection Based on a Refined TF-IDF Algorithm. IEEE Access 2019, 7, 26996–27007. [Google Scholar] [CrossRef]
- Lei, L.; Qi, J.; Zheng, K. Patent Analytics Based on Feature Vector Space Model: A Case of IoT. IEEE Access 2019, 7, 45705–45715. [Google Scholar] [CrossRef]
- Xue, M. A Text Retrieval Algorithm Based on the Hybrid LDA and Word2Vec Model. In Proceedings of the 2019 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), Changsha, China, 12–13 January 2019. [Google Scholar]
- Losada, D.E.; Fernández, R.T. Highly frequent terms and sentence retrieval. In International Symposium on String Processing and Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2007; pp. 217–228. [Google Scholar]
- Sharaff, A.; Shrawgi, H.; Arora, P.; Verma, A. Document Summarization by Agglomerative nested clustering approach. In Proceedings of the 2016 IEEE International Conference on Advances in Electronics, Communication and Computer Technology (ICAECCT), Pune, India, 2–3 December 2016. [Google Scholar]
- Tan, C.; Wei, F.; Zhou, Q.; Yang, N.; Du, B.; Lv, W.; Zhou, M. Context-Aware Answer Sentence Selection with Hierarchical Gated Recurrent Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 540–549. [Google Scholar] [CrossRef]
- Losada, D.E. Statistical query expansion for sentence retrieval and its effects on weak and strong queries. Inf. Retr. 2010, 13, 485–506. [Google Scholar] [CrossRef]
- Srividhya, V.; Anitha, R. Evaluating preprocessing techniques in text categorization. J. Comput. Sci. Appl. 2010, 47, 49–51. [Google Scholar]
- Vijayarani, S.; Ilamathi, M.J.; Nithya, M. Preprocessing techniques for text mining-an overview. Int. J. Comput. Sci. Commun. Netw. 2015, 5, 7–16. [Google Scholar]
- Behera, S. Implementation of a Finite State Automaton to Recognize and Remove Stop Words in English Text on its Retrieval. In Proceedings of the 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 11–12 May 2018. [Google Scholar]
- Karic, I.; Vejzovic, Z. Contextual Similarity: Quasilinear-Time Search and Comparison for Sequential Data. In Proceedings of the 2017 International Conference on Control, Artificial Intelligence, Robotics & Optimization (ICCAIRO), Prague, Czech Republic, 20–22 May 2017. [Google Scholar]
- Singh, J.; Singh, G.; Singh, R.; Singh, P. Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification. J. King Saud Univ.—Comput. Inf. Sci. 2018. [Google Scholar] [CrossRef]
- Gupta, S.; Gupta, S.K. A Hybrid Approach to Single Document Extractive Summarization. Int. J. Comput. Sci. Mob. Comput. 2018, 7, 142–149. [Google Scholar]
- Boban, I.; Doko, A.; Gotovac, S. Sentence Retrieval using Stemming and Lemmatization with Different Length of the Queries. Adv. Sci. Technol. Eng. Syst. J. 2020, 5, 349–354. [Google Scholar]
Method | Ranking Function |
---|---|
Name of the Collection | Number of Topics (Queries) | Number of Documents per Topic | Number of Sentences |
---|---|---|---|
TREC 2002 | 50 | 25 | 57,792 |
TREC 2003 | 50 | 25 | 39,820 |
TREC 2004 | 50 | 25 | 52,447 |
Data Collection | Measures | ||
---|---|---|---|
TREC 2002 | P@10 | 0.304 | 0.32 |
MAP | 0.196 | * 0.204 | |
R-prec. | 0.245 | 0.250 | |
TREC 2003 | P@10 | 0.692 | 0.714 |
MAP | 0.576 | * 0.591 | |
R-prec. | 0.547 | * 0.560 | |
TREC 2004 | P@10 | 0.434 | 0.468 |
MAP | 0.324 | * 0.335 | |
R-prec. | 0.336 | * 0.355 |
Data Collection | Measures | ||
---|---|---|---|
TREC 2002 | P@10 | 0.142 | * 0.33 |
MAP | 0.105 | * 0.209 | |
R-prec. | 0.097 | * 0.255 | |
TREC 2003 | P@10 | 0.628 | * 0.75 |
MAP | 0.464 | * 0.601 | |
R-prec. | 0.4281 | * 0.565 | |
TREC 2004 | P@10 | 0.366 | * 0.472 |
MAP | 0.242 | * 0.342 | |
R-prec. | 0.236 | * 0.363 |
Data Collection | Measures | ||
---|---|---|---|
TREC 2002 | P@10 | 0.268 | * 0.356 |
MAP | 0.170 | * 0.207 | |
R-prec. | 0.215 | * 0.250 | |
TREC 2003 | P@10 | 0.71 | 0.7 |
MAP | 0.528 | * 0.597 | |
R-prec. | 0.501 | * 0.567 | |
TREC 2004 | P@10 | 0.388 | * 0.458 |
MAP | 0.287 | * 0.334 | |
R-prec. | 0.306 | * 0.355 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Boban, I.; Doko, A.; Gotovac, S. Improving Sentence Retrieval Using Sequence Similarity. Appl. Sci. 2020, 10, 4316. https://doi.org/10.3390/app10124316
Boban I, Doko A, Gotovac S. Improving Sentence Retrieval Using Sequence Similarity. Applied Sciences. 2020; 10(12):4316. https://doi.org/10.3390/app10124316
Chicago/Turabian StyleBoban, Ivan, Alen Doko, and Sven Gotovac. 2020. "Improving Sentence Retrieval Using Sequence Similarity" Applied Sciences 10, no. 12: 4316. https://doi.org/10.3390/app10124316
APA StyleBoban, I., Doko, A., & Gotovac, S. (2020). Improving Sentence Retrieval Using Sequence Similarity. Applied Sciences, 10(12), 4316. https://doi.org/10.3390/app10124316