Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information
Abstract
:1. Introduction
- (1)
- A framework integrating sample and semantic information for keyword pool generation was proposed, which includes both a data phase and a model phase.
- (2)
- Two kinds of keyword generation methods, Recursive Feature Introduction (RFI) and the Recursive Feature Introduction and Elimination (RFIE), were proposed.
- (3)
- This paper used the feature ranking algorithm based on word embedding to construct regression models about the topic vector and word vectors for the first time, and generated keyword pools from the perspective of model performance (i.e., model average loss).
- (4)
- The experimental results show that, comparing different feature ranking algorithms, keyword generation methods, and regression models, Light Gradient Boosting Machine (LGBM) using RFI methods with SHAP-based ranked features (SHAP-based + RFI) not only performs best in terms of prediction performance, but also its generated keyword pools perform best in terms of average similarity scores and cumulative similarity scores.
2. Related Works
3. Methods
3.1. Framework Overview
3.2. The Phase for Web Text Collecting and Preprocessing
3.3. The Phase for a Keyword Pool Generating
3.4. Algorithms and Methods
3.4.1. Word Embedding Approaches
3.4.2. Ranking Algorithm and Feature Selection
3.4.3. Regression Model
4. Experiments
4.1. Data Sources and Preprocessing
4.2. Experimental Details and Evaluation Metrics
4.2.1. Experimental Details
4.2.2. Evaluation Metrics
5. Results Analysis
5.1. General Performance of the Models
5.2. Evaluation of the Keyword Pools
6. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Xie, X.; Fu, Y.; Jin, H.; Zhao, Y.; Cao, W. A novel text mining approach for scholar information extraction from web content in Chinese. Future Gener. Comput. Syst. 2020, 111, 859–872. [Google Scholar] [CrossRef]
- Zhang, Y.; Chen, K.; Weng, Y.; Chen, Z.; Zhang, J.; Hubbard, R. An Intelligent Early Warning System of Analyzing Twitter Data Using Machine Learning on COVID-19 Surveillance in the US. Expert Syst. Appl. 2022, 198, 116882. [Google Scholar] [CrossRef]
- Hung, M.; Lauren, E.; Hon, E.S.; Birmingham, W.C.; Xu, J.; Su, S.; Hon, S.D.; Park, J.; Dang, P.; Lipsky, M.S. Social network analysis of COVID-19 sentiments: Application of artificial intelligence. J. Med. Internet Res. 2020, 22, e22590. [Google Scholar] [CrossRef]
- Ozbay, F.A.; Alatas, B. Fake news detection within online social media using supervised artificial intelligence algorithms. Phys. A Stat. Mech. Its Appl. 2020, 540, 123174. [Google Scholar] [CrossRef]
- Akbari Torkestani, J. An adaptive focused Web crawling algorithm based on learning automata. Appl. Intell. 2012, 37, 586–601. [Google Scholar] [CrossRef]
- Batsakis, S.; Petrakis, E.G.; Milios, E. Improving the performance of focused web crawlers. Data Knowl. Eng. 2009, 68, 1001–1013. [Google Scholar] [CrossRef]
- Kaur, S.; Singh, A.; Geetha, G.; Masud, M.; Alzain, M.A. SmartCrawler: A Three-Stage Ranking Based Web Crawler for Harvesting Hidden Web Sources. CMC-Comput. Mater. Contin. 2021, 69, 2933–2948. [Google Scholar] [CrossRef]
- Nie, H.; Yang, Y.; Zeng, D. Keyword generation for sponsored search advertising: Balancing coverage and relevance. IEEE Intell. Syst. 2019, 34, 14–24. [Google Scholar] [CrossRef]
- Joshi, A.; Motwani, R. Keyword generation for search engine advertising. In Proceedings of the Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06), Hong Kong, China, 18–22 December 2006; pp. 490–496. [Google Scholar] [CrossRef]
- Cronin, J.; Mao, Y.; Menchen-Trevino, E. Connecting During a Government Shutdown: Networked Care and the Temporal Aspects of Social Media Activism. Soc. Media+ Soc. 2022, 8, 20563051211069054. [Google Scholar] [CrossRef]
- Michalko, D.; Plichtová, J.; Šestáková, A. Network analysis approach for exploring dementia representations in the Slovak media. Dementia 2022, 21, 781–793. [Google Scholar] [CrossRef] [PubMed]
- Zhao, F.; Skums, P.; Zelikovsky, A.; Sevigny, E.L.; Swahn, M.H.; Strasser, S.M.; Huang, Y.; Wu, Y. Computational approaches to detect illicit drug ads and find vendor communities within social media platforms. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020, 19, 180–191. [Google Scholar] [CrossRef]
- Wu, X.; Wang, W.; Li, Q.; Peng, Z.; Zhu, J. Current Situation with Organ Donation and Transplantation in China: Application of Machine Learning. Transplant. Proc. 2022, 54, 1711–1723. [Google Scholar] [CrossRef] [PubMed]
- Chen, B.; Chen, X.; Pan, J.; Liu, K.; Xie, B.; Wang, W.; Peng, Y.; Wang, F.; Li, N.; Jiang, J. Dissemination and refutation of rumors during the COVID-19 outbreak in China: Infodemiology study. J. Med. Internet Res. 2021, 23, e22427. [Google Scholar] [CrossRef]
- Bhatt, P.; Vemprala, N.; Valecha, R.; Hariharan, G.; Rao, H.R. User Privacy, Surveillance and Public Health during COVID-19–An Examination of Twitter verse. Inf. Syst. Front. 2022, 25, 1667–1682. [Google Scholar] [CrossRef] [PubMed]
- Barchiesi, M.A.; Colladon, A.F. Big data and big values: When companies need to rethink themselves. J. Bus. Res. 2021, 129, 714–722. [Google Scholar] [CrossRef]
- Chen, Y.; Xue, G.R.; Yu, Y. Advertising keyword suggestion based on concept hierarchy. In Proceedings of the 2008 International Conference on Web Search and Data Mining, Palo Alto, CA, USA, 11–12 February 2008; pp. 251–260. [Google Scholar] [CrossRef]
- Zhang, W.; Wang, D.; Xue, G.R.; Zha, H. Advertising keywords recommendation for short-text web pages using Wikipedia. ACM Trans. Intell. Syst. Technol. (TIST) 2012, 3, 1–25. [Google Scholar] [CrossRef]
- Zhou, H.; Huang, M.; Mao, Y.; Zhu, C.; Shu, P.; Zhu, X. Domain-constrained advertising keyword generation. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 2448–2459. [Google Scholar] [CrossRef]
- Martin-Galan, B.; Hernandez-Perez, T.; Rodriguez-Mateos, D.; Pena-Gil, D. The use of robots.txt and sitemaps in the Spanish public administration. Inf. Prof. 2009, 18, 625–632. [Google Scholar] [CrossRef]
- Wen, Y.F.; Hung, K.Y.; Hwang, Y.T.; Lin, Y.S.F. Sports lottery game prediction system development and evaluation on social networks. Internet Res. 2016, 26, 758–788. [Google Scholar] [CrossRef]
- Hickman, L.; Thapa, S.; Tay, L.; Cao, M.; Srinivasan, P. Text preprocessing for text mining in organizational research: Review and recommendations. Organ. Res. Methods 2022, 25, 114–146. [Google Scholar] [CrossRef]
- Wang, H.; Liu, Z.; Xu, Y.; Wei, X.; Wang, L. Short text mining framework with specific design for operation and maintenance of power equipment. CSEE J. Power Energy Syst. 2020, 7, 1267–1277. [Google Scholar] [CrossRef]
- Rahimi, Z.; Homayounpour, M.M. Tens-embedding: A tensor-based document embedding method. Expert Syst. Appl. 2020, 162, 113770. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language under-standing. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
- Shapley, L.S. A Value for n-Person Games. In Contributions to the Theory of Games; Princeton University Press: Princeton, NJ, USA, 1953; Volume 2, pp. 307–317. [Google Scholar] [CrossRef]
- Medelyan, O.; Frank, E.; Witten, I.H. Human-Competitive Tagging Using Automatic Keyphrase Extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–7 August 2009; pp. 1318–1327. Available online: https://dl.acm.org/doi/10.5555/1699648.1699678 (accessed on 1 September 2023).
- Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene Selection for Cancer Classification using Support Vector Machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
- Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of tricks for efficient text classification. arXiv 2016, arXiv:1607.01759. [Google Scholar] [CrossRef]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist 2017, 5, 135–146. [Google Scholar] [CrossRef]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word represen-tations. arXiv 2018, arXiv:1802.05365. [Google Scholar] [CrossRef]
- Janaki, M.; Geethalakshmi, S.N. A review of swarm intelligence-based feature selection methods and its application. In International Conference on Soft Computing for Security Applications (ICSCS), Advances in Intelligent Systems and Computing; Springer: Singapore, 2023; Volume 1428, pp. 435–447. [Google Scholar] [CrossRef]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA, 2009. [Google Scholar]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef] [PubMed]
- Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984. [Google Scholar]
- Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: New York, NY, USA, 1999. [Google Scholar]
- Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis, 3rd ed.; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
- Wasserman, L. All of Nonparametric Statistics; Springer Science & Business Media: New York, NY, USA, 2006. [Google Scholar]
- Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning (No. 2); MIT Press: Cambridge, UK, 2016. [Google Scholar]
- Dietterich, T.G. Ensemble Methods in Machine Learning. In Multiple Classifier Systems. MCS 2000; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2000; Volume 1857. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–16 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3146–3154. Available online: https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf (accessed on 1 September 2023).
- Salton, G.; Wong, A.; Yang, C.S. A vector space model for automatic indexing. Commun. ACM 1975, 18, 613–620. [Google Scholar] [CrossRef]
RFI a | SHAP-Based + RFI a | ||||||
---|---|---|---|---|---|---|---|
Model | Keyword Number | MSE b | S c | Model | Keyword Number | MSE | S |
RF | 11184 | 0.0824 | 0.0357 | RF | 119 | 0.0735 | 0.0719 |
GBDT | 6788 | 0.0815 | 0.0357 | GBDT | 210 | 0.0782 | 0.0744 |
XGB | 5186 | 0.0932 | 0.0357 | XGB | 49 | 0.0809 | 0.1079 |
LGBM | 11942 | 0.0835 | 0.0357 | LGBM | 118 | 0.0590 | 0.1659 |
Tree-Based + RFI a | Tree-Based + RFE a | ||||||
Model | Keyword Number | MSE | S | Model | Keyword Number | MSE | S |
RF | 78 | 0.0682 | 0.1107 | RF | 8162 | 0.0813 | 0.1346 |
GBDT | 27 | 0.0776 | 0.0834 | GBDT | 14459 | 0.0802 | 0.0585 |
XGB | 90 | 0.0662 | 0.0740 | XGB | 317 | 0.1189 | 0.0740 |
LGBM | 58 | 0.0442 | 0.1746 | LGBM | 158 | 0.0886 | 0.1746 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, X.; Feng, C.; Li, Q.; Zhu, J. Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information. Mathematics 2024, 12, 405. https://doi.org/10.3390/math12030405
Wu X, Feng C, Li Q, Zhu J. Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information. Mathematics. 2024; 12(3):405. https://doi.org/10.3390/math12030405
Chicago/Turabian StyleWu, Xiaolong, Chong Feng, Qiyuan Li, and Jianping Zhu. 2024. "Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information" Mathematics 12, no. 3: 405. https://doi.org/10.3390/math12030405
APA StyleWu, X., Feng, C., Li, Q., & Zhu, J. (2024). Keyword Pool Generation for Web Text Collecting: A Framework Integrating Sample and Semantic Information. Mathematics, 12(3), 405. https://doi.org/10.3390/math12030405