We empirically investigate the effective token unit for the Korean sentence embedding here. In detail, we focus on the following three research questions:
4.1. Dataset and Token Analysis
We were inspired by Cho (2020a), which is carried out the political-bias classification with the news article dataset, to find optimal tokenization for Korean sentences [
10]. In Korean, the informal text includes many variations of token unlike the formal text such as the news article dataset. For example, “강력한 추천” (strong recommendation) is abbreviated as “강추”. Additionally, “대박” (wow) is a coined word in Korean and “멋있다” (nice) makes a typo as “머싰다”. Because these properties of informal text are heavily influenced by the token unit that makes up a sentence, we used the informal text, naver sentiment movie corpus (NSMC) (
https://github.com/e9t/nsmc) (accessed on 2 December 2020), for the sentiment analysis task in Korean. The NSMC dataset has a trainset (150,000 reviews) and a testset (50,000 reviews), and it has binary sentiments of positive and negative [
24,
25]. A review of NSMC dataset consists of no more than 140 characters, so we regarded a review as a sentence in our work. In other words, we carried out the sentiment analysis using the NSMC dataset to evaluate the sentence embedding based on various token units (word, morpheme, subword, and submorpheme).
For the morpheme tokenization, we chose high-speed morphological analyzers, namely, Mecab, Okt (
https://konlpy-ko.readthedocs.io/ko/v0.4.3/) (accessed on 2 December 2020), and KLT2000 that is run by index2018.exe (
https://cafe.naver.com/nlpkang/3) (accessed on 2 December 2020) [
26]. The tokenization results differ depending on which morphological analyzer is used. For example of
Table 1, Mecab and Okt is decomposed into “너무나” (very), “감동” (touch), “적” (-), “인” (ing), and “영화” (movie), but KLT2000 is decomposed into “너무나” (very), “감동적” (touch-), and “영화” (movie). Thus, we compared the performances of sentiment analysis with sentence embedding utilizing three morphological analyzers.
We used the SentencePiece for subword tokenization and hypothesized that it is unnecessary to replace the whitespace of SentencePiece on the Korean sentiment analysis task. For subword tokenization, we carried out the SWU and SWT methods depending on whether the whitespace is replaced with a particular symbol, “_” (underbar). The SentencePiece algorithm for subword tokenization creates the vocabulary according to what vocabulary size is set. Cho (2020c) tested the SentencePiece with vocabulary sizes 50K, 75K, 100K, and 125K, and then the results showed a performance improvement when vocabulary size was small—50K [
27]. To prove that the smaller the vocabulary size, the better the performance for vocabulary sizes smaller than 50K, we explored the vocabulary sizes 2K, 4K, 8K, 16K, and 32K. For the submorpheme tokenization, we utilized the same morphological analyzers with morpheme tokenization (Mecab, Okt, and KLT2000) and SentencePiece. As we expected that submorpheme tokenization has the same trend of subword tokenization, we explored the sentence embedding with submorpheme tokenization in two cases, namely, SMU and SMT. As for subword, we explored the vocabulary sizes 2K, 4K, 8K, 16K, and 32K in the experiments of submorpheme tokenization with Okt and KLT2000. In submorpheme tokenization with Mecab, we tested the vocabulary sizes 2K, 4K, 8K, and 16K because vocabulary size 32K does not work when a morpheme sequence using Mecab is applied to SentenePiece.
We confirmed the data distribution of the NSMC dataset using each token unit (e.g., the number of tokens, OOV rate, and average token length). In
Table 2,
refers to the number of tokens in the trainset or testset of the NSMC datatset. OOV rate is the ratio of the number of unknown tokens to the number of testset tokens, as in Equation (
3). Avg_length is the average length of tokens in the trainset or testset of the NSMC dataset. The average length of tokens is composed by the sum of length of all token over the number of total tokens in the dataset as Equation (
4), where
n is the number of total tokens in the dataset and
is the i-th token in the total tokens. For example, “너무나/감동적인/영화” by word tokenization has an average token length of
because this example has token lengths of 3(너무나), 4(감동적인), and 2(영화), and then
n is 3.
Table 2 shows that the larger the number of tokens in trainset and testset, the lower the OOV rate. Besides, in subword and submorpheme tokenizations, we found that minor differences between the data distributions of SWU and SWT (or SMU and SMT), even between the vocabulary sizes. In
Table 2, the word tokenization has the smallest number of tokens and a higher OOV rate of 26.48% compared to other token units. This means the word tokenization is not robust to the OOV problem caused by unknown tokens. Among morpheme tokenizations, the KLT2000 showed the smallest number of tokens and the highest OOV rate of 6.53%, whereas the Mecab and Okt had the larger numbers of tokens and smaller OOV rates of 1.032% and 2.56%, respectively, compared to KLT2000. In subword tokenizations, SWU and SWT show similar data distributions, and then SWT shows a slightly lower OOV rate than SWU. We expect that SWT eliminates some noise by duplicated token due to the symbols such as “_영화” (_movie) and “영화” (movie). We also found that both SWU and SWT in subword tokenization have large numbers of tokens and lower OOV rates with small vocabularies, but their difference is trivial. The submorpheme tokenization is similar to the trend of subword tokenization.
Table 2 shows a representative data distribution of submorpheme with Mecab. The specific data distributions of the submorpheme unit, including Okt and KLT2000, are presented in
Table A1.
4.2. Experiments and Results
To evaluate the sentence embedding based on various token units, we carried out the Korean sentiment analysis task using SVM, MLP, and LSTM as classifiers.
Table 3 shows the sentiment analysis accuracy of sentence embedding based on word and morpheme units.
Table 4 and
Table 5 show the sentiment analysis accuracy of sentence embedding based on subword and submorpheme units, respectively. In
Table 3,
Table 4 and
Table 5, these indicate the performance according to the vector sizes 200, 250, and 300. Overall, the accuracy of LSTM was better than those of SVM and MLP. This means that the sentence representation method considering the order of tokens in the sentence is more effective than not considering the order. Thus, based on the performances of LSTM, we analyzed the experimental results of sentence embedding.
As shown in
Table 3, morpheme-based sentence embedding outperforms word-based sentence embedding. Word-based sentence embedding showed an accuracy of 81.27%, but it was significantly less accurate than morpheme-based sentence embedding. Among the morphological analyzers for morpheme-based sentence embedding, when we utilized Mecab, the morpheme-based sentence embedding achieved the best accuracy at 85.39%.
Table 4 shows the performances of sentiment analysis using the subword-based sentence embedding according to vocabulary sizes comparing the SWU and SWT. As shown in
Table 4, we found the two key points. First, SWT outperformed SWU among the subword-sentence embedding methods. Second, the performance improved when the vocabulary size was large. The sentence embedding based on SWT achieved 85.67% accuracy, whereas the sentence embedding based on SWU achieved 85.42% accuracy. The difference between performances of sentence embedding utilizing SWU and SWT was trivial, but the sentence embedding based on SWT indicates higher accuracy than the sentence embedding based on SWU. In the comparison of vocabulary sizes, the sentence embedding based on SWU and SWT achieved the accuracies of 81.57% and 81.59% for vocabulary size 2K, respectively, whereas for vocabulary size 32K, the sentence embedding based on SWU and SWT achieved the accuracies of 85.34% and 85.52%. Although sentence embedding with SWU and SWT achieved the best accuracies of 85.42% and 85.67% for vocabulary size 16K, respectively, we found a tendency that the larger the vocabulary size improves performance in
Table 4.
We confirmed the performance of sentence embedding based on submorpheme tokenization in
Table 5. First of all, unlike our expectation that submorpheme-based sentence embedding outperforms subword-based sentence embedding, submorpheme-based sentence embedding had lower performance than subword-based sentence embedding. The submorpheme-based sentence embedding with Okt_SMU achieved 85.32% accuracy as its best performance, whereas the subword-based sentence embedding achieved the better performance of 85.67% in SWT. We further found three tendencies. First, among the morphological analyzers utilized in submorpheme tokenizations, Okt showed better performance than the other morphological analyzers. Second, SMU outperformed SMT among the submorpheme-based sentence embedding methods, unlike the experimental results of subword-based sentence embedding. Lastly, the performance improved when the vocabulary size was large, similarly to the subword-based sentence embedding. In
Table 5, the sentence embedding based on SMU indicated the performances of 84.83%, 85.32%, and 84.82% in Mecab, Okt, and KLT2000, respectively, whereas the sentence embedding based on SMT indicated the performances of 84.74%, 85.16%, and 84.83%. The performance of submorpheme-based sentence embedding was improved when the vocabulary size was 32K compared to 2K, just like subword-based sentence embedding.
As shown in
Table 3,
Table 4 and
Table 5, the best tokenization method for the Korean sentence embedding is SWT of subword tokenization with vocabulary size 16K—85.67% accuracy. Multi-lingual BERT and KoBERT achieved 87.5% and 90.1% accuracy, respectively, on the sentiment analysis using the NSMC dataset. However, our method is competitive with multi-lingual BERT and KoBERT because the capacities of multi-lingual BERT and KoBERT have a lot of computation by conducting the pretraining tasks with large corpora, whereas our method has a small amount of computation with simple classifiers. Our research further is valuable in suggesting the most efficient token units for sentence embedding.