Num-Symbolic Homophonic Social Net-Words
Abstract
:1. Introduction
2. Literature Review
2.1. Supervised Chinese Neologism Discovery
2.2. Unsupervised Chinese Neologism Discovery
3. Materials and Methods
3.1. N-Gram
3.2. Trie (Tree)
3.3. Information Entropy (IE)
3.4. Point(Wise) Mutual Information (PMI)
3.5. Bidirectional Encoder Representations from Transformers (BERT)
3.6. Cosine Similarity
4. Experiment
4.1. Algorithm Overview
- Set initial dictionary Di.
- Parsing the user-generated content UGCi from the social web page, filter out the hyperlinks, quote replies, news sharing and announcements.
- Calculate the statistical characteristics of the probability Pi of the Trie-tree, mutual information PMIi, point(wise) information entropy IEi and word frequency WFi as a data pipeline for the user-generated content with an unsupervised approach and retain and build up the set of candidate words found to be above the threshold as SETi.
- Screen out out-of-vocabulary OOVi words by comparing the candidate word set SETi and the initial dictionary Di.Above steps please refer to the Algorithm 1, and following steps please refer to the Algorithm 2.
- Comparing the sentences include Ni and another sentences for comparison by mean pooling.
- Calculate the cosine similarity for the sentences via teh included angle.
- Update the vocabulary dictionary Di+1.
Algorithm 1 Neologisms. |
Input: |
Di—Initial Dictionary. |
RAW—PTT Post Set trims out hyperlinks, quote replies, sharing and announcements. |
SETi—Character portfolio candidate set. |
OOVi—SETi does not exist in the part of Di. |
Output: |
Ni—Neologisms. |
1. Set initial dictionary Di. |
2. Maintain pure UGCi by trimming out the noise from RAW. |
3. SETi by calculating the character portfolio of probability Pi by Trie Tee. |
4. For SETi in UGCi |
5. –Point(wise) Mutual Information PMIi >= Predefined threshold. |
6. –Information Entropy IEi >= Predefined threshold. |
7. –Term Frequency TFi >= Predefined threshold. |
8. Return UGCi |
9. Obtain Ni by mapping SETi and Di |
Algorithm 2 Similarity verification. |
Input: |
Ni—Neologisms. |
SentNi—Sentences with Ni. |
RSentNi—Randomly Extract Sentence with Similarity SentNi. |
Output: |
Similarity. |
1. Padding the sentences to max length. |
2. Comparing the sentences include Ni and another sentences for comparison by mean pooling. |
3. Calculate the cosine similarity for the sentences via included angle. |
4.2. Algorithm Flowchart
4.3. Dataset
4.4. Evaluation Index
4.5. BNShCN Detection
4.6. Contextualized Evaluation
4.7. Comparative Experiment
4.8. Analysis of Results
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- PTT Web Forum. Available online: https://www.ptt.cc/bbs/index.html (accessed on 20 December 2021).
- Liu, T.-J.; Hsieh, S.-K.; Prévot, L. Observing features of PTT neologisms: A corpus-driven study with N-gram model. In Proceedings of the 25th Conference on Computational Linguistics and Speech Processing (ROCLING 2013), Kaohsiung, Taiwan, 4–5 October 2013; pp. 250–259. [Google Scholar]
- Huang, L.-F.; Liu, X.; Ng, V. Associating sentimental orientation of Chinese neologism in social media data. In Proceedings of the 2015 IEEE 19th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Calabria, Italy, 6–8 May 2015; IEEE: New York, NY, USA, 2015; pp. 240–246. [Google Scholar]
- Cole, J.R.; Ghafurian, M.; Reitter, D. Is word adoption a grassroots process? An analysis of Reddit communities. In Proceedings of the International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, Washington, DC, USA, 5–8 July 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 236–241. [Google Scholar]
- Muravyev, N.; Panchenko, A.; Obiedkov, S. Neologisms on facebook. arXiv 2018, arXiv:1804.05831. [Google Scholar]
- Qian, Y.; Du, Y.; Deng, X.; Ma, B.; Ye, Q.; Yuan, H. Detecting new Chinese words from massive domain texts with word embedding. J. Inf. Sci. 2019, 45, 196–211. [Google Scholar] [CrossRef]
- Zalmout, N.; Thadani, K.; Pappu, A. Unsupervised neologism normalization using embedding space mapping. In Proceedings of the 5th Workshop on Noisy User-Generated Text (W-NUT 2019), Hong Kong, China, 4 November 2019; pp. 425–430. [Google Scholar]
- Chu, Y.; Ruthrof, H. The social semiotic of homophone phrase substitution in Chinese netizen discourse. Soc. Semiot. 2017, 27, 640–655. [Google Scholar] [CrossRef]
- Xu, J. Interpretation of Metaphorical Neologisms in Cognitive Linguistics under “Internet Plus”. Front. Soc. Sci. Technol. 2019, 1, 67–74. [Google Scholar] [CrossRef]
- Li, W.; Guo, K.; Shi, Y.; Zhu, L.; Zheng, Y. DWWP: Domain-specific new words detection and word propagation system for sentiment analysis in the tourism domain. Knowl.-Based Syst. 2018, 146, 203–214. [Google Scholar] [CrossRef]
- Wang, K.; Wu, H. Research on neologism detection in entity attribute knowledge acquisition. In Proceedings of the 5th International Conference on Computer Science, Electronics Technology and Automation, Hangzhou, China, 15 April 2017. [Google Scholar]
- Ma, B.; Zhang, N.; Liu, G.; Li, L.; Yuan, H. Semantic search for public opinions on urban affairs: A probabilistic topic modeling-based approach. Inf. Process. Manag. 2016, 52, 430–445. [Google Scholar] [CrossRef]
- Liu, Y.-C.; Lin, C.-W. A new method to compose long unknown Chinese keywords. J. Inf. Sci. 2012, 38, 366–382. [Google Scholar] [CrossRef]
- Liang, Y.; Yang, M.; Zhu, J.; Yiu, S.-M. Out-domain Chinese new word detection with statistics-based character embedding. Nat. Lang. Eng. 2019, 25, 239–255. [Google Scholar] [CrossRef]
- Roll, Uri and Correia, Ricardo A and Berger-Tal, Oded. Conserv. Biol. 2018, 32, 716–724.
- McCrae, J.P. Identification of adjective–noun neologisms using pretrained language models. In Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), Florence, Italy, 2 August 2019; pp. 135–141. [Google Scholar]
- Wang, M.; Li, X.; Wei, Z.; Zhi, S.; Wang, H. Chinese word segmentation based on deep learning. In Proceedings of the 2018 10th international Conference on Machine Learning and Computing, Macau, China, 26–28 February 2018; pp. 16–20. [Google Scholar]
- Xie, T.; Wu, B.; Wang, B. New Word Detection in Ancient Chinese Literature. In Proceedings of the Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data, Beijing, China, 7–9 July 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 260–275. [Google Scholar]
- Xiong, Y.; Wang, Z.; Jiang, D.; Wang, X.; Chen, Q.; Xu, H.; Yan, J.; Tang, B. A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text. BMC Med. Inform. Decis. Mak. 2019, 19, 66. [Google Scholar] [CrossRef] [Green Version]
- Li, X.; Zhang, K.; Zhu, Q.; Wang, Y.; Ma, J. Hybrid Feature Fusion Learning Towards Chinese Chemical Literature Word Segmentation. IEEE Access 2021, 9, 7233–7242. [Google Scholar] [CrossRef]
- Wang, X.; Wang, M.; Zhang, Q. Realization of Chinese Word Segmentation Based on Deep Learning Method. In AIP Conference Proceedings; AIP Publishing LLC: College Park, MD, USA, 2017; p. 020150. [Google Scholar]
- Qiu, Q.; Xie, Z.; Wu, L.; Li, W. DGeoSegmenter: A dictionary-based Chinese word segmenter for the geoscience domain. Comput. Geosci. 2018, 121, 1–11. [Google Scholar] [CrossRef]
- Qiu, X.; Pei, H.; Yan, H.; Huang, X. A Concise Model for Multi-Criteria Chinese Word Segmentation with Transformer Encoder. arXiv 2019, arXiv:1906.12035. [Google Scholar]
- Qun, N.; Yan, H.; Qiu, X.-P.; Huang, X.-J. Chinese word segmentation via BiLSTM+ Semi-CRF with relay node. J. Comput. Sci. Technol. 2020, 35, 1115–1126. [Google Scholar] [CrossRef]
- Li, Y.; Cheng, J.; Huang, C.; Chen, Z.; Niu, W. NEDetector: Automatically extracting cybersecurity neologisms from hacker forums. J. Inf. Secur. Appl. 2021, 58, 102784. [Google Scholar] [CrossRef]
- Sarna, G.; Bhatia, M.P.S. A probalistic approach to automatically extract new words from social media. In Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Francisco, CA, USA, 18–21 August 2016; IEEE: New York, NY, USA, 2016; pp. 719–725. [Google Scholar]
- Wang, X.; Sha, Y.; Tan, J.-L.; Guo, L. Research of New Words Identification in Social Network for Monitoring Public Opinion. In Proceedings of the International Conference on Trustworthy Computing and Services, Beijing, China, 28 May–2 June 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 598–603. [Google Scholar]
- Breen, J.; Baldwin, T.; Bond, F. The Company They Keep: Extracting Japanese Neologisms Using Language Patterns. In Proceedings of the 9th Global Wordnet Conference, Singapore, 8–12 January 2018; pp. 163–171. [Google Scholar]
- Cheng, K.; Wen, X.; Zhou, K. A Survey of Internet Public Opinion and Internet New Words. In DEStech Transactions on Social Science, Education and Human Science; DEStech Publishing Inc.: Lancaster, PA, USA, 2017. [Google Scholar]
- Zhou, Q.; Chen, Y. New words recognition algorithm and application based on micro-blog hot. In Proceedings of the 2015 Seventh International Conference on Measuring Technology and Mechatronics Automation, Nanchang, China, 13–14 June 2015; IEEE: New York, NY, USA, 2015; pp. 698–700. [Google Scholar]
- Zeng, H.-L.; Zhou, C.-L.; Zheng, X.-L. A New Word Detection Method for Chinese based on local context information. J. Donghua Univ. (Engl. Ed.) 2010. Available online: https://www.researchgate.net/publication/291707984_A_new_word_detection_method_for_chinese_based_on_local_context_information (accessed on 17 December 2021).
- Li, X.; Chen, X. New Word Discovery Algorithm Based on N-Gram for Multi-word Internal Solidification Degree and Frequency. In Proceedings of the 2020 5th International Conference on Control, Robotics and Cybernetics (CRC), Wuhan, China, 16–18 October 2020; IEEE: New York, NY, USA, 2020; pp. 51–55. [Google Scholar]
- Zhao, K.; Zhang, Y.; Xing, C.; Li, W.; Chen, H. Chinese underground market jargon analysis based on unsupervised learning. In Proceedings of the 2016 IEEE Conference on Intelligence and Security Informatics (ISI), Tucson, AZ, USA, 28–30 September 2016; IEEE: New York, NY, USA, 2016; pp. 97–102. [Google Scholar]
- Chen, Q.; Cheng, G.; Li, D.; Zhang, J. Closeness Based New Word Detection Method for Mechanical Design and Manufacturing Area. J. Comput. Comput. Soc. Repub. China (CSROC) 2017, 28, 210–219. [Google Scholar]
- Yang, C.; Zhu, J. New Word Identification Algorithm in Natural Language Processing. In Proceedings of the 2020 2nd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Taiyuan, China, 23–25 October 2020; IEEE: New York, NY, USA, 2020; pp. 199–203. [Google Scholar]
- Brown, P.F.; Della Pietra, V.J.; Desouza, P.V.; Lai, J.C.; Mercer, R.L. Class-based n-gram models of natural language. Comput. Linguist. 1992, 18, 467–480. [Google Scholar]
- Gao, Y.; Zhou, L.; Zhang, Y.; Xing, C.; Sun, Y.; Zhu, X. Sentiment classification for stock news. In Proceedings of the 5th International Conference on Pervasive Computing and Applications, Hualien Taiwan, 10–13 May 2010. [Google Scholar]
- Liang, F.M. Word Hy-phen-a-tion by Com-put-er; Department of Computer Science, Stanford University: Palo Alto, CA, USA, 1983. [Google Scholar]
- Wang, J.; Ge, B.; He, C. Domain Neural Chinese Word Segmentation with Mutual Information and Entropy. In Proceedings of the 2019 7th International Conference on Information Technology: IoT and Smart City, Shanghai, China, 20–23 December 2019; pp. 75–79. [Google Scholar]
- Shang, G. Research on Chinese New Word Discovery Algorithm Based on Mutual Information. In Proceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial Intelligence, Sanya, China, 20–22 December 2019; pp. 580–584. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Lee, Y. Systematic Homonym Detection and Replacement Based on Contextual Word Embedding. Neural Process. Lett. 2021, 53, 17–36. [Google Scholar] [CrossRef]
- Chen, W.; Cai, Y.; Lai, K.; Yao, L.; Zhang, J.; Li, J.; Jia, X. WeiboFinder: A topic-based Chinese word finding and learning system. In Proceedings of the International Conference on Web-Based Learning, Cape Town, South Africa, 20–22 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 33–42. [Google Scholar]
- Kerremans, D.; Prokić, J. Mining the web for new words: Semi-automatic neologism identification with the NeoCrawler. Anglia 2018, 136, 239–268. [Google Scholar] [CrossRef]
- Wang, F. Statistic Chinese New Word Recognition by Combing Supervised and Unsupervised Learning. In Proceedings of the 2019 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Xiamen, China, 16–18 December 2019; IEEE: New York, NY, USA, 2019; pp. 1239–1243. [Google Scholar]
Example | BNShCN | Original Sentence | Sentence Removed BNShCN |
---|---|---|---|
Sentence 1 | 484 | 你484還沒用餐? | 你還沒用餐 |
Yes or No | Have you eaten yet? | You have not eaten yet. | |
Sentence 2 | 377 | 他在家裡377 | 他在家裡 |
Angry | he is at home being angry. | he is at home. |
Environment | Intel i9-11900K |
DDR4 3200 96G | |
nVidia GeForce RTX3070Ti | |
Tokenizer | bert-base-chinese |
Model | ckiplab/albert-tiny-chinese |
Hyperparameters | Max Sequence Length = 128 |
Learning Rate = 5 × 10 | |
Batch Size = 16 | |
Epochs = 2 |
Years | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 |
---|---|---|---|---|---|---|---|
Posts | 2640 | 2380 | 1240 | 1620 | 448,600 | 314,280 | 189,340 |
Frequency = 5 | IE = 0.001 | IE = 0.01 | IE = 0.1 | IE = 1 | IE = 2 | IE = 3 |
---|---|---|---|---|---|---|
PMI = 6 | 7889 | 7888 | 7869 | 4316 | 900 | 179 |
PMI = 5 | 7969 | 7969 | 7936 | 5066 | 1611 | 327 |
PMI = 4 | 12,809 | 12,809 | 12,757 | 8097 | 2551 | 478 |
PMI = 3 | 19,275 | 19,351 | 19,284 | 12,090 | 3602 | 620 |
PMI = 2 | 28,436 | 28,436 | 28,356 | 17,636 | 4930 | 818 |
PMI = 1 | 41,527 | 41,527 | 41,431 | 25,868 | 6701 | 1038 |
Frequency = 4 | IE = 0.001 | IE = 0.01 | IE = 0.1 | IE = 1 | IE = 2 | IE = 3 |
PMI = 6 | 5812 | 5812 | 5796 | 3588 | 900 | 179 |
PMI = 5 | 10,018 | 10,018 | 9986 | 6141 | 1611 | 327 |
PMI = 4 | 16,074 | 16,074 | 16,022 | 9800 | 2551 | 478 |
PMI = 3 | 24,690 | 24,690 | 24,623 | 14,890 | 3602 | 620 |
PMI = 2 | 36,657 | 36,657 | 36,577 | 22,144 | 4930 | 818 |
PMI = 1 | 53,769 | 53,769 | 53,673 | 32,870 | 6701 | 1038 |
Frequency = 3 | IE = 0.001 | IE = 0.01 | IE = 0.1 | IE = 1 | IE = 2 | IE = 3 |
PMI = 6 | 4650 | 4650 | 4631 | 2953 | 901 | 179 |
PMI = 5 | 13,808 | 13,808 | 13,775 | 7438 | 1611 | 327 |
PMI = 4 | 22,353 | 22,353 | 22,300 | 12,002 | 2551 | 478 |
PMI = 3 | 34,732 | 34,732 | 34,665 | 18,428 | 3602 | 620 |
PMI = 2 | 52,061 | 52,061 | 51,982 | 27,789 | 4930 | 819 |
PMI = 1 | 76,230 | 76,230 | 76,134 | 41,446 | 6701 | 1038 |
Methodology | Precision | Recall | F-Score |
---|---|---|---|
Transformer Encoder (BERT) | 87.3 | 89.2 | 88.2 |
ELMO [42] | 81.6 | 83.8 | 82.7 |
Word2Vec [6] | 66.2 | 58.2 | 61.9 |
BNShCN | Sentence with BNShCN | Positive Sentence | Similarity |
---|---|---|---|
484 | 你484在生氣 | 你是在生氣嗎 | 0.85592383 |
Yes or No | Are you angry | Are you angry | |
377 | 你是不是在377 | 你是在生氣嗎 | 0.8032131 |
Angry | Are you angry | Are you angry | |
我跟他是8+9 | 0.84317076 | ||
8+9 | I am good friends with him. | 我跟他是好兄弟 | |
good friend | 我們是8+9 | I am good friends with him. | 0.7407087 |
We are good friends. |
BNShCN | Sentence with BNShCN | Positive Sentence | Similarity |
---|---|---|---|
484 | 你484在生氣 | 今天天氣真好 | 0.5758477 |
Yes or No | Are you angry? | The weather is nice today. | |
377 | 你是不是在377 | 今天天氣真好 | 0.56190294 |
Angry | Are you angry? | The weather is nice today. | |
我跟他是8+9 | 0.699402 | ||
8+9 | I am good friends with him. | 我不認識他 | |
good friend | 我們是8+9 | I do not know him. | 0.45311478 |
We are good friends. |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chung, Y.-L.; Hsu, P.-Y.; Huang, S.-H. Num-Symbolic Homophonic Social Net-Words. Information 2022, 13, 174. https://doi.org/10.3390/info13040174
Chung Y-L, Hsu P-Y, Huang S-H. Num-Symbolic Homophonic Social Net-Words. Information. 2022; 13(4):174. https://doi.org/10.3390/info13040174
Chicago/Turabian StyleChung, Yi-Liang, Ping-Yu Hsu, and Shih-Hsiang Huang. 2022. "Num-Symbolic Homophonic Social Net-Words" Information 13, no. 4: 174. https://doi.org/10.3390/info13040174
APA StyleChung, Y. -L., Hsu, P. -Y., & Huang, S. -H. (2022). Num-Symbolic Homophonic Social Net-Words. Information, 13(4), 174. https://doi.org/10.3390/info13040174