Evaluation of the Optimal Topic Classification for Social Media Data Combined with Text Semantics: A Case Study of Public Opinion Analysis Related to COVID-19 with Microblogs
Abstract
:1. Introduction
2. Related Works
3. Methods
3.1. LDA Topic Model
3.2. Evaluation Methods to Determine the Optimal Number of Topics
3.3. Evaluation Method for the Optimal Topics Combined with Text Semantics
- (1)
- Given the number of topics (denoted by K), execute the LDA model, and obtain the document–topic and topic–word probability distribution matrices.
- (2)
- Set the number of words N, which is meaningful when characterizing the topic as a series of words. Then, each topic is represented by the first N words with the largest probability values in the topic–word matrix.
- (3)
- The words of the ith (i = 1, 2, …, K) topic are connected into sentences, and the BERT model is used to obtain the sentence vector of the ith topic.
- (4)
- The word vector of each word under the ith topic is obtained using the BERT model, and the cosine formula is used to calculate the text similarity between words. Then, the mean value is calculated as the intra-cluster similarity of the ith topic.
- (5)
- Repeat steps (3)–(4) until i = K. Then, calculate the similarity between the words of two topics and take the mean value as the inter-cluster similarity, and take the mean value of the intra-cluster similarities as the intra-cluster similarity for all topics. Finally, the RI value is obtained by taking the ratio of intra-cluster similarity to inter-cluster similarity.
- (6)
- Choose different values of N and repeat steps (2)–(5).
- (7)
- Choose different values of K and repeat steps (2)–(6).
Algorithm 1: How to get RI values under different number of topics K |
4. Experiment and Results
4.1. Comparison Experiments Based on the Standard Corpus
4.2. Case Study of Public Opinion Analysis during COVID-19 Epidemic in Wuhan
4.2.1. Data Sets and Preprocessing
4.2.2. Time-Series Analysis
4.2.3. Spatial Distribution of the COVID-19 Related Microblogs
4.2.4. Evaluation Experiment on Determining the Optimal Number of Topics
4.3. Analysis and Discussion
4.3.1. Time-Series Analysis of Public Opinion Topics
4.3.2. Spatial Distribution of Topics of Public Opinion
5. Conclusions
- (1)
- The text semantic-based evaluation method of finding the optimal number of topics is an objective representation of subjective experience, essentially, which can generate the best number of explainable topics, to some extent, by considering semantic similarity between and within topics. In addition, the validity of the proposed method was verified using a standard Chinese news corpus. In terms of practical application, our evaluation method was also feasible in the case of public opinion analysis during the COVID-19 epidemic period. The temporal and spatial distributions of five topics generated by the LDA model were all closely related to the development trends of the COVID-19 epidemic. As the results of LDA depend closely on the text content, it is necessary to complete corpus cleaning and filtering carefully when selecting topic numbers based on this method.
- (2)
- Considering the spatial and temporal distribution characteristics of check-in microblogs in Wuhan, there was a certain correlation between check-in microblogs and the number of confirmed COVID-19 cases. Therefore, it is still feasible to analyze public opinion using the text content of check-in microblogs, instead of all microblogs, although the results may be biased. In addition, on the basis of obtaining five topic categories after text classification using LDA, the check-in microblogs also contained temporal and spatial information, which can be utilized to further analyze public opinion topics in Wuhan from temporal and spatial aspects.
- (3)
- Our research showed that the variation trends of the time-series of the different topics all peaked on the day when Wuhan city went into lockdown, and the fluctuation characteristics were consistent with the text contents of various topics. For example, “appreciation and praying” reached a peak on the National Day of Mourning. The trends of the topics over time were in sync with the development of the epidemic situation, indicating that the public is greatly influenced by the Internet and policies, while the COVID-19 epidemic was gradually suppressed.
- (4)
- The spatial distributions of the public opinion topics were mainly clustered in the residential areas within the Fifth Ring Road of Wuhan. On one hand, due to the fact that most people stayed at home for the purposes of epidemic prevention and control, there were relatively slight differences among the spatial distributions of the different topics. Among them, a hot-spot of the “appreciation and praying” topic appeared around Leishenshan Hospital. On the other hand, check-in posts on Weibo are highly correlated with the population distribution. The population density in the center of Wuhan was higher than the surrounding areas, such that the urban center remained a hot-spot for check-in behavior. According to the spatial distribution of diverse public opinion topics, various other hot-spots could be obtained. Based on the above analysis results, differentiated management of public opinion can be executed, and the direction of public opinion can be accurately guided.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- The 45th China Statistical Report on Internet Development. 2020; p. 19. Available online: http://www.cac.gov.cn/2020-04/27/c_1589535470378587.htm (accessed on 24 June 2020). (In Chinese)
- Wang, J.; Zhang, M.; Han, X.; Wang, X.; Zheng, L. Spatio-Temporal Evolution and Regional Differences of the Public Opinion on the Prevention and Control of COVID-19 Epidemic in China. Acta Geogr. Sin. 2020, 75, 2490–2504. (In Chinese) [Google Scholar] [CrossRef]
- Du, Y.; Xu, J.; Zhong, L.; Hou, Y.; Shen, J. Analysis and Visualization of Multi-Dimensional Characteristics of Network Public Opinion Situation and Sentiment: Taking COVID-19 Epidemic as an Example. J. Geo-Inf. Sci. 2021, 23, 318–330. (In Chinese) [Google Scholar] [CrossRef]
- Debnath, R.; Bardhan, R. India Nudges to Contain COVID-19 Pandemic: A Reactive Public Policy Analysis Using Machine-Learning Based Topic Modelling. PLoS ONE 2020, 15, e0238972. [Google Scholar] [CrossRef] [PubMed]
- Zheng, H.; Goh, D.H.-L.; Lee, C.S.; Lee, E.W.J.; Theng, Y.L. Uncovering Temporal Differences in COVID-19 Tweets. Proc. Assoc. Inf. Sci. Technol. 2020, 57, e233. [Google Scholar] [CrossRef]
- Han, X.; Wang, J.; Zhang, M.; Wang, X. Using Social Media to Mine and Analyze Public Opinion Related to COVID-19 in China. Int. J. Environ. Res. Public Health 2020, 17, 2788. [Google Scholar] [CrossRef] [Green Version]
- Kang, Y.; Wang, Y.; Zhang, D.; Zhou, L. The Public’s Opinions on a New School Meals Policy for Childhood Obesity Prevention in the U.S.: A Social Media Analytics Approach. Int. J. Med. Inform. 2017, 103, 83–88. [Google Scholar] [CrossRef]
- Wu, J.; Sivaraman, V.; Kumar, D.; Banda, J.M.; Sontag, D. Pulse of the Pandemic: Iterative Topic Filtering for Clinical Information Extraction from Social Media. J. Biomed. Inform. 2021, 120, 103844. [Google Scholar] [CrossRef]
- Gorodnichenko, Y.; Pham, T.; Talavera, O. Social Media, Sentiment and Public Opinions: Evidence from #Brexit and #USElection. Eur. Econ. Rev. 2021, 136, 103772. [Google Scholar] [CrossRef]
- Krasnov, F.; Sen, A. The Number of Topics Optimization: Clustering Approach. Mach. Learn. Knowl. Extr. 2019, 1, 416–426. [Google Scholar] [CrossRef] [Green Version]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Barachi, M.E.; AlKhatib, M.; Mathew, S.; Oroumchian, F. A Novel Sentiment Analysis Framework for Monitoring the Evolving Public Opinion in Real-Time: Case Study on Climate Change. J. Clean. Prod. 2021, 312, 127820. [Google Scholar] [CrossRef]
- Bird, D.K.; Haynes, K.; van den Honert, R.; McAneney, J.; Poortinga, W. Nuclear Power in Australia: A Comparative Analysis of Public Opinion Regarding Climate Change and the Fukushima Disaster. Energy Policy 2014, 65, 644–653. [Google Scholar] [CrossRef] [Green Version]
- Shibuya, Y.; Tanaka, H. Public Sentiment and Demand for Used Cars after a Large-Scale Disaster: Social Media Sentiment Analysis with Facebook Pages 2018. arXiv 2018, arXiv:1801.07004. [Google Scholar]
- Karami, A.; Shah, V.; Vaezi, R.; Bansal, A. Twitter Speaks: A Case of National Disaster Situational Awareness. J. Inf. Sci. 2020, 46, 313–324. [Google Scholar] [CrossRef] [Green Version]
- Zhang, C.; Ma, X.; Zhou, Y.; Guo, R. Analysis of Public Opinion Evolution in COVID-19 Pandemic from a Perspective of Sentiment Variation. J. Geo-Inf. Sci. 2021, 23, 341–350. (In Chinese) [Google Scholar] [CrossRef]
- Chen, X.-S.; Chang, T.-Y.; Wang, H.-Z.; Zhao, Z.-L.; Zhang, J. Spatial and Temporal Analysis on Public Opinion Evolution of Epidemic Situation about Novel Coronavirus Pneumonia Based on Micro-Blog Data. J. Sichuan Univ. 2020, 57, 409–416. (In Chinese) [Google Scholar]
- Boon-Itt, S.; Skunkan, Y. Public Perception of the COVID-19 Pandemic on Twitter: Sentiment Analysis and Topic Modeling Study. JMIR Public Health Surveill. 2020, 6, e21978. [Google Scholar] [CrossRef]
- Cao, J.; Xia, T.; Li, J.; Zhang, Y.; Tang, S. A Density-Based Method for Adaptive LDA Model Selection. Neurocomputing 2009, 72, 1775–1781. [Google Scholar] [CrossRef]
- Deveaud, R.; Sanjuan, E.; Bellot, P. Accurate and Effective Latent Concept Modeling for Ad Hoc Information Retrieval. Doc. Numér. 2014, 17, 61–84. [Google Scholar] [CrossRef] [Green Version]
- Han, K.; Xing, Z.; Liu, Z.; Liu, J.; Zhang, X. Research on Public Opinion Analysis Methods in Major Public Health Events: Take COVID-19 Epidemic as an Example. J. Geo-Inf. Sci. 2021, 23, 331–340. (In Chinese) [Google Scholar] [CrossRef]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Ye, X.; Li, S.; Yang, X.; Qin, C. Use of Social Media for the Detection and Analysis of Infectious Diseases in China. ISPRS Int. J. Geo-Inf. 2016, 5, 156. [Google Scholar] [CrossRef] [Green Version]
- Wang, Y.; Li, H.; Wang, T.; Zhu, J. The Mining and Analysis of Emergency Information in Sudden Events Based on Social Media. Geomat. Inf. Sci. Wuhan Univ. 2016, 41, 290–297. (In Chinese) [Google Scholar] [CrossRef]
- Amara, A.; Hadj Taieb, M.A.; Ben Aouicha, M. Multilingual Topic Modeling for Tracking COVID-19 Trends Based on Facebook Data Analysis. Appl. Intell. 2021, 51, 3052–3073. [Google Scholar] [CrossRef]
- Guo, J. Classification for Chinese Short Text Based on Multi LDA Models. Master’s Thesis, Harbin Institute of Technology, Harbin, China, 2014. (In Chinese). [Google Scholar]
- Wang, T.; Han, M.; Wang, Y. Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature. Data Anal. Knowl. Discov. 2018, 2, 29–40. (In Chinese) [Google Scholar] [CrossRef]
- Griffiths, T.; Steyvers, M. Finding Scientific Topics. Proc. Natl. Acad. Sci. USA 2004, 101 (Suppl. 1), 5228–5235. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Arun, R.; Suresh, V.; Madhavan, C.E.V.; Murthy, M.N.N. On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations. In Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery & Data Mining, Hyderabad, India, 21–24 June 2010. [Google Scholar]
- Li, L.; Zhao, X. A Research Summary of Topic Discovery Methods Based on Topic Model. J. MUC 2021, 30, 59–66. (In Chinese) [Google Scholar]
- Guan, P.; Wang, Y.; Fu, Z. Effect Analysis of Scientific Literature Topic Extraction Based on LDA Topic Model with Different Corpus. Libr. Inf. Serv. 2016, 60, 10. (In Chinese) [Google Scholar] [CrossRef]
- Teh, Y.W.; Jordan, M.I.; Beal, M.J.; Blei, D.M. Hierarchical Dirichlet Processes. J. Am. Stat. Assoc. 2006, 101, 1566–1581. [Google Scholar] [CrossRef]
- Ignatenko, V.; Koltcov, S.; Staab, S.; Boukhers, Z. Fractal Approach for Determining the Optimal Number of Topics in the Field of Topic Modeling. J. Phys. Conf. Ser. 2019, 1163, 012025. [Google Scholar] [CrossRef]
- Koltcov, S. Application of Rényi and Tsallis Entropies to Topic Modeling Optimization. Phys. A Stat. Mech. Its Appl. 2018, 512, 1192–1204. [Google Scholar] [CrossRef] [Green Version]
- Chen, E.; Jiang, E. Review of Studies on Text Similarity Measures. Data Anal. Knowl. Discov. 2017, 1, 1–11. (In Chinese) [Google Scholar]
- Ma, C. The Hitchhiker’s Guide to LDA. arXiv 2019, arXiv:1908.03142. (In Chinese) [Google Scholar]
- Vayansky, I.; Kumar, S.A.P. A Review of Topic Modeling Methods. Inf. Syst. 2020, 94, 101582. [Google Scholar] [CrossRef]
- Smith, H.; Cipolli, W. The Instagram/Facebook Ban on Graphic Self-Harm Imagery: A Sentiment Analysis and Topic Modeling Approach. Policy Internet 2021. [Google Scholar] [CrossRef]
- Scikit-Learn: Machine Learning in Python. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition (accessed on 30 March 2021).
- Kang, Y.; Wang, Y.; Zhang, D.; Zhou, L.; Sun, M.; Li, J.; Guo, Z.; Zhao, Y.; Zheng, Y.; Si, X.; et al. THUCTC: An Efficient Chinese Text Classifier. Available online: http://thuctc.thunlp.org/ (accessed on 30 April 2021).
- Nikita, M. Ldatuning: Tuning of the Latent Dirichlet Allocation Models Parameters. Available online: https://CRAN.R-project.org/package=ldatuning (accessed on 30 March 2021).
- Hu, Y.; Huang, H.; Chen, A.; Mao, X.-L. Weibo-COV: A Large-Scale COVID-19 Social Media Dataset from Weibo. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online, December 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020. [Google Scholar]
- Hu, Y.; Huang, H.; Chen, A.; Mao, X.-L. Weibo-Public-Opinion-Datasets. Available online: https://github.com/nghuyong/weibo-public-opinion-datasets (accessed on 24 June 2020).
- Full Daily Statistics of 2019-NCoV. Available online: https://github.com/canghailan/Wuhan-2019-nCoV (accessed on 1 February 2021).
- Huang, C.; Wang, Y.; Li, X.; Ren, L.; Zhao, J.; Hu, Y.; Zhang, L.; Fan, G.; Xu, J.; Gu, X.; et al. Clinical Features of Patients Infected with 2019 Novel Coronavirus in Wuhan, China. Lancet 2020, 395, 497–506. [Google Scholar] [CrossRef] [Green Version]
- Weibo User Development Report in 2020. 2021, p. 4. Available online: https://data.weibo.com/report/reportDetail?id=456 (accessed on 24 June 2020). (In Chinese).
- Xiao, H. Bert-as-Service. Available online: https://github.com/hanxiao/bert-as-service (accessed on 30 April 2021).
Category | Headline Text | Amount of Instances in the Category |
---|---|---|
finance | Fund online hit a situation exposure, the proportion and retail investors have no difference. | 18,000 |
realty | The pain of high land price in Beijing property market: land costs account for more than 30% of the housing price. | 18,000 |
education | Microblogs are a popular way for college entrance exam candidates to relieve stress. | 18,000 |
science | Buyers will see: recommendations for mobile phones that worth buying in November | 18,000 |
society | A woman has been looking for her husband for 14 years before she finds out he has a new family. | 18,000 |
politics | US President Barack Obama has postponed the move of the US embassy in Israel. | 18,000 |
sports | Maldini: AC Milan are far behind Real Madrid and Barcelona, Ibrobi is unlikely to start a new dynasty. | 18,000 |
game | Theme Day Activity of online game “King of Kings 3” in Tencent version. | 18,000 |
entertainment | Jay Chou responded to the poor ratings by saying that it is normal to lose sometimes | 18,000 |
Category | Words That Describe the Topic | Actual Number of Documents | The Number of Documents Classified Correctly |
---|---|---|---|
finance | fund, future, goods, market, company | 18,000 | 17,999 |
realty | price, opening, fine decoration, villa, Beijing | 18,000 | 17,986 |
education | college entrance examination, graduate school exam, enrollment, examinee, offer | 18,000 | 17,996 |
science | cellphone, Sony, Canon, internet, Nikon | 18,000 | 17,937 |
society | man, woman, driver, ten thousand yuan, dead | 18,000 | 17,989 |
politics | dead, president, happened, Obama, Iran | 18,000 | 17,987 |
sports | Rocket, Barcelona, Real Madrid, Milan, player | 18,000 | 17,996 |
game | game, online, online game, publish, player | 18,000 | 17,981 |
entertainment | deny, expose, pose, star, respond | 18,000 | 17,985 |
User_id | Created_at (GMT + 8) | Text Content | Check-in Location in GCJ-02 Coordinate System |
---|---|---|---|
cc44af5e7e03be03 | 31 December 2019 21:43:58 | This morning’s epidemic news didn’t dampen Wuhan people’s enthusiasm for New Year’s Eve in Jiangtan. [ha-ha] Wuhan⋅Jiangtan; show map. | “114.298421,30.57753” |
806ac40de73607d7 | 28 January 2020 15:40 | The sun is shining for the first time since the lockdown of the city. I just want to clean my house. I really hope the epidemic will pass soon. Wuhan will win! #Go Wuhan# Wuhan; show map. | “114.200958,30.600012” |
86a81fd176af326f | 26 April 2020 19:01 | For the better resumption of work and production in Wuhan, we will not rest [doge]. Wuhan⋅ Wuhan Sixth Hospital; show map. | “114.28953,30.60014” |
Datasets | Time Period (GMT + 8) | Space Range | Amount | Amount of Contributed Users | Fields |
---|---|---|---|---|---|
THUCNews | None | None | 162,000 | None | Category/headline text |
All check-in microblogs | From 1 December 2019 0:00 to 30 April 2020 23:59 | Wuhan city | 617,032 | 124,281 | User_id/time/location/text content |
COVID-19-related check-in microblogs | From 1 December 2019 0:00 to 30 April 2020 23:59 | Wuhan city | 46,774 | 22,733 | User_id/time/location/text content |
ID | Topic Summary | Words That Describe the Topic | Number of Microblogs |
---|---|---|---|
1 | Family care | Mom, at home, friend, family, dad, go home, child, life, work, on duty | 11,912 |
2 | Home life | Mask, work resumption, go out, community, lockdown, supermarket, lift the lockdown, at home, on duty, express | 11,444 |
3 | Epidemic report | Virus, confirm, case, new, pneumonia, COVID-19, China, infect, country, coronavirus | 6918 |
4 | Response status | Hospital, community, patient, quarantine, community, detect, nucleic acid, confirm, doctor, infect | 6618 |
5 | Appreciation and praying | Frontline, appreciate, city, people, medical workers, anti-epidemic, China, hero, early, national | 9882 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liang, Q.; Hu, C.; Chen, S. Evaluation of the Optimal Topic Classification for Social Media Data Combined with Text Semantics: A Case Study of Public Opinion Analysis Related to COVID-19 with Microblogs. ISPRS Int. J. Geo-Inf. 2021, 10, 811. https://doi.org/10.3390/ijgi10120811
Liang Q, Hu C, Chen S. Evaluation of the Optimal Topic Classification for Social Media Data Combined with Text Semantics: A Case Study of Public Opinion Analysis Related to COVID-19 with Microblogs. ISPRS International Journal of Geo-Information. 2021; 10(12):811. https://doi.org/10.3390/ijgi10120811
Chicago/Turabian StyleLiang, Qin, Chunchun Hu, and Si Chen. 2021. "Evaluation of the Optimal Topic Classification for Social Media Data Combined with Text Semantics: A Case Study of Public Opinion Analysis Related to COVID-19 with Microblogs" ISPRS International Journal of Geo-Information 10, no. 12: 811. https://doi.org/10.3390/ijgi10120811
APA StyleLiang, Q., Hu, C., & Chen, S. (2021). Evaluation of the Optimal Topic Classification for Social Media Data Combined with Text Semantics: A Case Study of Public Opinion Analysis Related to COVID-19 with Microblogs. ISPRS International Journal of Geo-Information, 10(12), 811. https://doi.org/10.3390/ijgi10120811