An Empirical Study of Korean Sentence Representation with Various Tokenizations
Abstract
:1. Introduction
2. Related Work
3. Sentence Representation with Tokenization
3.1. Tokenization
3.2. Sentence Representation Methods
4. Experiments
- Which tokenization is more robust to the OOV problem?
- Is the symbol replacing whitespaces meaningful in the Korean sentiment analysis?
- What is the optimal vocabulary size for sentence embedding?
4.1. Dataset and Token Analysis
4.2. Experiments and Results
5. Analysis and Discussion
- Which tokenization is more robust to the OOV problem?
- Is the symbol replacing whitespaces meaningful in the Korean sentiment analysis?
- What is the optimal vocabulary size for sentence embedding?
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Tokenization | N(Token) | OOV Rate | Avg_Length | ||||
---|---|---|---|---|---|---|---|
Trainset | Testset | Trainset | Testset | ||||
Submorpheme | Mecab_SMU | 2K | 4,143,841 | 1,385,508 | 0.029 | 1.701 | 1.702 |
4K | 3,475,644 | 1,163,011 | 0.045 | 2.028 | 2.028 | ||
8K | 3,338,314 | 1,118,941 | 0.057 | 2.111 | 2.108 | ||
16K | 3,291,225 | 1,104,407 | 0.084 | 2.142 | 1.136 | ||
Mecab_SMT | 2K | 4,101,705 | 1,371,610 | 0.029 | 1.048 | 1.048 | |
4K | 3,450,677 | 1,154,684 | 0.042 | 1.246 | 1.245 | ||
8K | 3,315,186 | 1,111,229 | 0.053 | 1.296 | 1.294 | ||
16K | 3,268,987 | 1,097,026 | 0.071 | 1.315 | 1.310 | ||
Okt_SMU | 2K | 3,976,273 | 1,330,247 | 0.028 | 1.623 | 1.623 | |
4K | 2,992,410 | 1,001,011 | 0.048 | 2.157 | 2.157 | ||
8K | 2,683,202 | 899,937 | 0.071 | 2.405 | 2.399 | ||
16K | 2,531,714 | 851,292 | 0.105 | 2.549 | 2.537 | ||
32K | 2,460,044 | 829,834 | 0.17 | 2.624 | 2.602 | ||
Okt_SMT | 2K | 3,940,007 | 1,318,202 | 0.028 | 1.091 | 1.090 | |
4K | 2,985,670 | 998,713 | 0.044 | 1.440 | 1.439 | ||
8K | 2,680,690 | 899,083 | 0.062 | 1.603 | 1.599 | ||
16K | 2,530,037 | 850,718 | 0.089 | 1.699 | 1.690 | ||
32K | 2,458,502 | 829,318 | 0.13 | 1.748 | 1.733 | ||
KLT2000_SMU | 2K | 3,051,372 | 1,019,474 | 0.034 | 1.530 | 1.531 | |
4K | 2,188,325 | 731,568 | 0.057 | 2.133 | 2.134 | ||
8K | 1,906,344 | 639,554 | 0.095 | 2.449 | 2.441 | ||
16K | 1,748,764 | 589,289 | 0.162 | 2.669 | 2.649 | ||
32K | 1,659,831 | 563,303 | 0.27 | 2.812 | 2.772 | ||
KLT2000_SMT | 2K | 3,013,758 | 1,007,099 | 0.034 | 1.085 | 1.085 | |
4K | 2,182,286 | 729,598 | 0.055 | 1.498 | 1.498 | ||
8K | 1,904,128 | 638,851 | 0.087 | 1.717 | 1.711 | ||
16K | 1,747,658 | 588,951 | 0.145 | 1.871 | 1.856 | ||
32K | 1,658,897 | 562,998 | 0.23 | 1.971 | 1.941 |
References
- Tang, D.; Wei, F.; Yang, N.; Zhou, M.; Liu, T.; Qin, B. Learning Sentiment-specific Word Embedding for Twitter Sentiment Classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MA, USA, 22 June 2014; Volume 1, pp. 1555–1565. [Google Scholar]
- Severyn, A.; Moschitti, A. Twitter Sentiment Analysis with Deep Convolutional Neural Networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Sandiago, Chille, 9–13 August 2015; pp. 959–962. [Google Scholar]
- Lee, H.; Kang, S. Word Embedding Method of SMS Message for Spam Message Filtering. In Proceedings of the 2019 IEEE International Conference on Big Data and Smart Computing, Kyoto, Japan, 27 February–2 March 2019; pp. 1–4. [Google Scholar]
- Zhao, J.; Zhou, Y.; Li, Z.; Wang, W.; Chang, K. Learning Gender-neutral Word Embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4847–4853. [Google Scholar]
- Tang, Y.; Tang, C.; Zhu, C. Resolve Out of Vocabulary with Long Short-Term Memory Networks for Morphology. In Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and Computer Applications, Dalian, China, 27–29 June 2020; pp. 115–119. [Google Scholar]
- Lee, D.; Lim, Y.; Ted Kwon, T. Morpheme-based efficient Korean word embedding. J. Korea Inf. Sci. Soc. 2018, 45, 444–450. [Google Scholar]
- Botha, J.; Blunsom, P. Compositional Morphology for Word Representations and Language Modelling. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; Volume 32, pp. 1899–1907. [Google Scholar]
- Banerjee, T.; Bhattacharrya, P. Meaningless yet Meaningful: Morphology Grounded Subword-level NMT. In Proceedings of the Second Workshop on Subword/Character level Models, New Orleans, LA, USA, 6 June 2018; pp. 55–60. [Google Scholar]
- Wang, C.; Cho, K.; Gu, J. Neural Machine Translation with Byte-level Subwords. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9154–9160. [Google Scholar]
- Cho, D.; Lee, H.; Jung, W.; Kang, S. Automatic classification and vocabulary analysis of political bias in news articles by using subword tokenization. J. Kips Trans. Softw. Data Eng. 2020, 10, 1–8. [Google Scholar]
- Kudo, T. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 66–75. [Google Scholar]
- Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 1715–1725. [Google Scholar]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Korean BERT (KoBERT). Available online: https://github.com/SKTBrain/KoBERT (accessed on 25 March 2021).
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, US, 5–8 December 2013; pp. 3111–3119. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the 1st International Conference on Learning Representations, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Domingo, M.; Garcia-Marinez, M.; Helle, A.; Casacuberta, F.; Herranz, M. How much does Tokenization Affect Neural Machine Translation? arXiv 2018, arXiv:1812.08621. [Google Scholar]
- Pang, B.; Lee, L.; Vaithyanathan, S. Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA, USA, 6–7 July 2002; pp. 79–86. [Google Scholar]
- Cho, D.; Lee, H.; Kang, S. A Study on the Detection of Anomalous Kicks in Taekwondo games by using LSTM. In Proceedings of the Korea Information Processing Society 2020, Online, 6–7 November 2020b; pp. 1025–1027. [Google Scholar]
- Karim, F.; Majumdar, S.; Darabi, H.; Chen, S. LSTM fully convolutional networks for time series classification. IEEE Access 2017, 6, 1662–1669. [Google Scholar] [CrossRef]
- Rao, G.; Huang, W.; Feng, Z.; Cong, Q. LSTM with sentence representations for document-level sentiment classification. Neurocomputing 2018, 308, 49–57. [Google Scholar] [CrossRef]
- Kim, G.; Lee, C. Korean Text Generation and Sentiment Analysis Using Model Combined VAE and CNN. In Proceedings of the Annual Conference on Human and Language Technology, Seoul, Korea, 12–13 October 2018; pp. 75–78. [Google Scholar]
- Lee, S.; Jang, H.; Baik, Y.; Park, S.; Shin, H. Kr-bert: A Small-scale Korean-Specific Language Model. arXiv 2020, arXiv:2008.03979. [Google Scholar]
- Won, H.; Lee, H.; Kang, S. A Performance Comparison of Korean Morphological Analyzers for Large-scale Text Analysis. In Proceedings of the Korean Computer Congress 2020, Online, 2–4 July 2020; pp. 401–403. [Google Scholar]
- Cho, D.; Lee, H.; Kang, S. Subword-based Sentence Representation Model for Sentiment Classification. In Proceedings of the 9th International Conference on Smart Media and Applications, Jeju, Korea, 17–19 September 2020. [Google Scholar]
Token Unit | Tokenization Result |
---|---|
Raw Text | 너무나 감동적인 영화 (very touching movie) |
Word | 너무나/감동적인/영화 (very/touching/movie) |
Morpheme | 너무나/감동/적/인/영화 (very/touch/-/ing/movie) |
Subword | _너무/나/_감동/적인/_영화 (_very/-/_touch/-ing/_movie) |
Submorpheme | _너무/나/_감동/_적/_인/_영화 (_very/-/_touch/_-/_ing/_movie) |
Tokenization | N(token) | OOV Rate | Avg_Length | ||||
---|---|---|---|---|---|---|---|
Trainset | Testset | Trainset | Testset | ||||
Word | 1,137,736 | 380,472 | 26.48 | 3.778 | 3.778 | ||
Morpheme | Mecab | 2,750,801 | 921,106 | 1.032 | 1.562 | 1.560 | |
Okt | 2,306,158 | 771,970 | 2.56 | 1.993 | 1.991 | ||
KLT2000 | 1,399,134 | 468,458 | 6.53 | 2.337 | 2.333 | ||
Subword | SWU | 2K | 3,986,624 | 1,332,618 | 0.028 | 1.363 | 1.364 |
4K | 2,951,483 | 986,317 | 0.047 | 1.842 | 1.843 | ||
8K | 2,536,203 | 849,876 | 0.077 | 2.143 | 2.139 | ||
16K | 2,262,086 | 762,284 | 0.121 | 2.403 | 2.385 | ||
32K | 2,062,594 | 701,982 | 0.202 | 2.635 | 2.590 | ||
SWT | 2K | 3,930,835 | 1,314,239 | 0.028 | 1.093 | 1.094 | |
4K | 2,934,992 | 980,809 | 0.044 | 1.464 | 1.466 | ||
8K | 2,528,770 | 847,393 | 0.068 | 1.700 | 1.696 | ||
16K | 2,257,431 | 760,706 | 0.105 | 1.904 | 1.890 | ||
32K | 2,058,975 | 700,666 | 0.173 | 2.087 | 2.051 | ||
Submorpheme | Mecab_SMU | 2K | 4,143,841 | 1,385,508 | 0.029 | 1.701 | 1.702 |
4K | 3,475,644 | 1,163,011 | 0.045 | 2.028 | 2.028 | ||
8K | 3,338,314 | 1,118,941 | 0.057 | 2.111 | 2.108 | ||
16K | 3,291,225 | 1,104,407 | 0.084 | 2.142 | 2.136 | ||
Mecab_SMT | 2K | 4,101,705 | 1,371,610 | 0.029 | 1.048 | 1.048 | |
4K | 3,450,677 | 1,154,684 | 0.042 | 1.246 | 1.245 | ||
8K | 3,315,186 | 1,111,229 | 0.053 | 1.296 | 1.294 | ||
16K | 3,268,987 | 1,097,026 | 0.071 | 1.315 | 1.310 |
Tokenization | SVM | MLP | LSTM | |||||||
---|---|---|---|---|---|---|---|---|---|---|
200 | 250 | 300 | 200 | 250 | 300 | 200 | 250 | 300 | ||
Word | 71.84 | 77.79 | 66.25 | 76.31 | 77.53 | 68.76 | 81.27 | 80.8 | 80.81 | |
Morpheme | Mecab | 81.08 | 82.06 | 82.07 | 81.06 | 82.08 | 81.95 | 85.32 | 85.39 | 84.96 |
Okt | 81.86 | 83.13 | 82.82 | 82.83 | 83.16 | 82.44 | 85.17 | 82.94 | 85.11 | |
KLT2000 | 82.07 | 81.05 | 82.36 | 81.81 | 82.4 | 82.24 | 83.86 | 84.07 | 83.91 |
Tokenization | SVM | MLP | LSTM | |||||||
---|---|---|---|---|---|---|---|---|---|---|
200 | 250 | 300 | 200 | 250 | 300 | 200 | 250 | 300 | ||
SWU | 2K | 78.77 | 79.02 | 79.26 | 78.91 | 79.04 | 79.32 | 81.57 | 81.56 | 81.44 |
4K | 81.52 | 81.6 | 81.78 | 81.89 | 81.93 | 82.17 | 84.49 | 84.13 | 83.98 | |
8K | 82.87 | 82.88 | 82.96 | 83.06 | 83.16 | 83.32 | 85.23 | 84.99 | 84.98 | |
16K | 83.09 | 83.11 | 83.23 | 83.38 | 83.49 | 83.57 | 85.42 | 85.33 | 85.35 | |
32K | 83.19 | 83.22 | 83.33 | 83.48 | 83.6 | 83.69 | 85.34 | 84.96 | 84.69 | |
SWT | 2K | 63.67 | 74.75 | 76.09 | 65.67 | 78 | 76.33 | 81.59 | 81.57 | 81.58 |
4K | 78.99 | 78.72 | 78.77 | 79.5 | 79.21 | 78.84 | 84.12 | 83.8 | 83.26 | |
8K | 82.77 | 82.75 | 83.03 | 83.13 | 82.96 | 83.12 | 85.05 | 85.18 | 84.81 | |
16K | 83.13 | 83.06 | 83.33 | 83.43 | 83.33 | 83.68 | 85.67 | 84.93 | 85.27 | |
32K | 83.27 | 83.43 | 83.56 | 83.58 | 83.57 | 83.94 | 85.52 | 85.24 | 84.71 |
Tokenization | SVM | MLP | LSTM | |||||||
---|---|---|---|---|---|---|---|---|---|---|
200 | 250 | 300 | 200 | 250 | 300 | 200 | 250 | 300 | ||
Mecab_SMU | 2K | 79.11 | 79.4 | 79.57 | 79.25 | 79.4 | 79.67 | 82.74 | 82.87 | 82.44 |
4K | 81.02 | 81.3 | 81.47 | 81.38 | 81.55 | 81.72 | 84.55 | 84.72 | 84.76 | |
8K | 81.05 | 81.27 | 81.34 | 81.41 | 81.51 | 81.6 | 84.56 | 84.79 | 84.75 | |
16K | 80.71 | 81.23 | 81.48 | 81.15 | 81.45 | 81.74 | 84.76 | 84.83 | 84.43 | |
Mecab_SMT | 2K | 74.3 | 76.68 | 78.19 | 74.97 | 76.85 | 79.27 | 82.12 | 82.05 | 81.29 |
4K | 79.65 | 79.65 | 79.47 | 80.84 | 79.03 | 80.9 | 84.1 | 84.17 | 84.24 | |
8K | 81.03 | 79.32 | 77.92 | 80.87 | 78.16 | 79.15 | 84.35 | 84.34 | 84.39 | |
16K | 79.87 | 80.88 | 80.4 | 80.15 | 81.5 | 79.58 | 84. 48 | 84.60 | 84.74 | |
Okt_SMU | 2K | 78.91 | 79.13 | 79.45 | 79.1 | 79.38 | 79.43 | 82.37 | 82.23 | 82.32 |
4K | 81.44 | 81.59 | 81.71 | 81.83 | 82 | 81.98 | 84.34 | 84.52 | 83.7 | |
8K | 82.11 | 82.32 | 82.52 | 82.53 | 82.72 | 82.94 | 85.32 | 84.94 | 85.03 | |
16K | 82.49 | 82.56 | 82.83 | 82.76 | 83.19 | 83.18 | 85.0 | 85.25 | 84.76 | |
32K | 82.72 | 82.74 | 82.86 | 83.04 | 83.11 | 83.37 | 85.01 | 84.5 | 83.96 | |
Okt_SMT | 2K | 76.36 | 77.39 | 68.06 | 78.75 | 78.9 | 71.39 | 82.01 | 82.03 | 81.33 |
4K | 80.62 | 81.23 | 81.55 | 81.36 | 81.66 | 81.47 | 83.69 | 84.03 | 84.05 | |
8K | 82.05 | 82.57 | 82.48 | 82.62 | 83.06 | 82.68 | 84.73 | 85.10 | 84.81 | |
16K | 82.38 | 82.66 | 82.83 | 82.87 | 83.12 | 83.23 | 85.06 | 84.77 | 85.16 | |
32K | 82.49 | 82.59 | 82.66 | 82.9 | 83.12 | 83.1 | 84.93 | 84.61 | 84.77 | |
KLT2000_SMU | 2K | 78.35 | 78.44 | 78.68 | 78.52 | 78.71 | 78.92 | 81.62 | 81.44 | 81.3 |
4K | 80.84 | 80.91 | 81.34 | 81.43 | 81.32 | 81.64 | 83.84 | 83.8 | 83.3 | |
8K | 82.07 | 82.02 | 82.03 | 82.6 | 82.28 | 82.46 | 84.69 | 84.22 | 84.5 | |
16K | 82.14 | 82.08 | 82.37 | 82.45 | 82.49 | 82.81 | 84.82 | 84.27 | 84.24 | |
32K | 82.21 | 82.23 | 82.36 | 82.58 | 82.76 | 82.96 | 84.55 | 84.47 | 84.48 | |
KLT2000_SMT | 2K | 77.9 | 76.13 | 76.26 | 77.84 | 77.05 | 76.96 | 81.04 | 81.3 | 80.76 |
4K | 80.75 | 81.23 | 80.98 | 81.42 | 81.22 | 81.4 | 83.57 | 83.49 | 83.05 | |
8K | 81.93 | 81.93 | 82.41 | 82.31 | 82.54 | 82.95 | 84.49 | 84.5 | 84.46 | |
16K | 82.21 | 82.38 | 82.43 | 82.58 | 82.88 | 82.85 | 84.66 | 84.83 | 84.47 | |
32K | 82.21 | 82.24 | 82.35 | 82.42 | 82.51 | 82.86 | 84.35 | 84.31 | 84.26 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cho, D.; Lee, H.; Kang, S. An Empirical Study of Korean Sentence Representation with Various Tokenizations. Electronics 2021, 10, 845. https://doi.org/10.3390/electronics10070845
Cho D, Lee H, Kang S. An Empirical Study of Korean Sentence Representation with Various Tokenizations. Electronics. 2021; 10(7):845. https://doi.org/10.3390/electronics10070845
Chicago/Turabian StyleCho, Danbi, Hyunyoung Lee, and Seungshik Kang. 2021. "An Empirical Study of Korean Sentence Representation with Various Tokenizations" Electronics 10, no. 7: 845. https://doi.org/10.3390/electronics10070845
APA StyleCho, D., Lee, H., & Kang, S. (2021). An Empirical Study of Korean Sentence Representation with Various Tokenizations. Electronics, 10(7), 845. https://doi.org/10.3390/electronics10070845