Lightweight Pre-Trained Korean Language Model Based on Knowledge Distillation and Low-Rank Factorization
Abstract
:1. Introduction
2. Methods
2.1. Teacher Model Training
2.2. Knowledge Distillation
2.3. Low-Rank Factorization
- KR-ELECTRA-Small-LF-V1 (35 MB): In this variant, low-rank factorization is applied exclusively to the FFN.
- KR-ELECTRA-Small-LF-V2 (30 MB): Here, low-rank factorization is applied to both the FFN and the multi-head attention mechanism.
- KR-ELECTRA-Small-LF-V3 (18 MB): This variant extends the application of low-rank factorization to the FFN, the multi-head attention mechanism, and the embedding layer.
3. Experiments and Discussion
3.1. Experimental Setup
3.2. Datasets
- Naver Sentiment Movie Corpus (NSMC) [23]: A binary sentiment classification task, where the goal is to predict whether a given movie review is positive or negative. The dataset consists of 200,000 movie reviews collected from the Naver movie review website. Each review is shorter than 140 characters. The numbers of positive and negative reviews are balanced. Examples are shown in Table 3.
- Korean Hate Speech Dataset (KOHATE) [24]: A binary classification task, where the goal is to identify hate speech in online comments. The dataset consists of 9381 comments from Korean entertainment news aggregation platforms, annotated for the existence of social bias and hate speech.
- Korean Natural Language Inference (KorNLI) [25]: A natural language inference (NLI) task, where the goal is to determine the relationship between a premise and a hypothesis (entailment, contradiction, or neutral). The dataset consists of 570,000 sentence pairs, translated from the English MultiNLI dataset.
- Korean Semantic Textual Similarity (KorSTS) [25]: A semantic textual similarity (STS) task, where the goal is to predict the similarity score between two sentences. The dataset consists of 8628 sentence pairs, translated from the English STS Benchmark dataset.
- Korean Question Answering Dataset (KorQuAD) 1.0 [27]: A machine reading comprehension (MRC) task, where the goal is to answer a question given a passage of text. The dataset consists of over 60,000 question–answer pairs based on Korean Wikipedia articles. Each piece of data consists of a passage, a question, and a starting point and ending point for the correct answer. It is structured in the same way as the Stanford Question Answering Dataset (SQUAD v1.0).
3.3. Results
3.3.1. Knowledge Distillation Results (KR-ELECTRA-Small-KD)
3.3.2. Low-Rank Factorization Results (KR-ELECTRA-Small-LF)
3.3.3. Comparison with Other Models
3.3.4. Inference Time
3.3.5. Ablation Study on Low-Rank Factorization
3.3.6. Impact of Distillation and Factorization Strategies
3.4. Discussion
3.5. Limitations
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2, Lake Tahoe, NV, USA, 5–10 December 2013; Curran Associates Inc.: Red Hook, NY, USA, 2013; pp. 3104–3112. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 31 March 2025).
- Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv 2020, arXiv:1909.08053. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
- Kim, B.; Kim, H.; Lee, S.-W.; Lee, G.; Kwak, D.; Dong Hyeon, J.; Park, S.; Kim, S.; Kim, S.; Seo, D.; et al. What Changes Can Large-Scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-Scale Korean Generative Pretrained Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, 7–11 November 2021; Moens, M.-F., Huang, X., Specia, L., Yih, S.W., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 3405–3424. [Google Scholar]
- Mnassri, K.; Farahbakhsh, R.; Crespi, N. Multilingual Hate Speech Detection: A Semi-Supervised Generative Adversarial Approach. Entropy 2024, 26, 344. [Google Scholar] [CrossRef] [PubMed]
- Takata, R.; Masumori, A.; Ikegami, T. Spontaneous Emergence of Agent Individuality Through Social Interactions in Large Language Model-Based Communities. Entropy 2024, 26, 1092. [Google Scholar] [CrossRef] [PubMed]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Zafrir, O.; Boudoukh, G.; Izsak, P.; Wasserblat, M. Q8BERT: Quantized 8Bit BERT. In Proceedings of the 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing—NeurIPS Edition (EMC2-NIPS), Vancouver, BC, Canada, 13 December 2019; IEEE Computer Society: Washington, DC, USA, 2019; pp. 36–39. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar] [CrossRef]
- Kinakh, V.; Drozdova, M.; Voloshynovskiy, S. MV–MR: Multi-Views and Multi-Representations for Self-Supervised Learning and Knowledge Distillation. Entropy 2024, 26, 466. [Google Scholar] [CrossRef] [PubMed]
- Gordon, M.; Duh, K.; Andrews, N. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. In Proceedings of the 5th Workshop on Representation Learning for NLP, Online, 9 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 143–155. [Google Scholar]
- Novikov, A.; Podoprikhin, D.; Osokin, A.; Vetrov, D. Tensorizing Neural Networks. In Proceedings of the 29th International Conference on Neural Information Processing Systems—Volume 1, Montreal, QC, Canada, 7–12 December 2015; MIT Press: Cambridge, MA, USA, 2015; Volume 1, pp. 442–450. [Google Scholar]
- Clark, K.; Luong, M.-T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators. In Proceedings of the 8th International Conference on Learning Representations ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
- Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4163–4174. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. In Proceedings of the 2019 International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Zhang, Z.; Lu, Y.; Wang, T.; Wei, X.; Wei, Z. DDK: Dynamic Structure Pruning Based on Differentiable Search and Recursive Knowledge Distillation for BERT. Neural Netw. 2024, 173, 106164. [Google Scholar] [CrossRef] [PubMed]
- Hentschel, M.; Nishikawa, Y.; Komatsu, T.; Fujita, Y. Keep Decoding Parallel With Effective Knowledge Distillation from Language Models To End-To-End Speech Recognisers. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10876–10880. [Google Scholar]
- Park, L. Naver Sentiment Movie Corpus, v1.0. 2025. Available online: https://github.com/e9t/nsmc (accessed on 31 March 2025).
- Moon, J.; Cho, W.I.; Lee, J. BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection. In Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media, Online, 10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 25–31. [Google Scholar]
- Ham, J.; Choe, Y.J.; Park, K.; Choi, I.; Soh, H. KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 422–430. [Google Scholar]
- Nlp-Challenge. 2025. Available online: https://www.nlpsummit.org/healthcare-2025/ (accessed on 31 March 2025).
- Lim, S.; Kim, M.; Lee, J. KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension. arXiv 2019, arXiv:1909.07005. [Google Scholar]
Structure of ELECTRA | Number of Parameters | Percentage of Total |
---|---|---|
Feed-Forward Network | 56,623,104 | 50% |
Embedding Layer | 24,576,000 | 22% |
Structure of Lightweight ELECTRA | Number of Parameters | |
---|---|---|
Lightweight Feed-Forward Network | 14,155,776 | |
Lightweight Embedding Layer | 6,144,000 |
Hyperparameter | Model Size | ||
---|---|---|---|
Tiny | Small | Base | |
Number of Layers | 12 | 12 | 12 |
Hidden Size | 128 | 256 | 768 |
FFN Inner Hidden Size | 512 | 1024 | 3072 |
Attention Heads | 2 | 4 | 12 |
Attention Head Size | 64 | 64 | 64 |
Embedding Size | 64 | 128 | 768 |
Parameters | 4 M | 14 M | 110 M |
Model Size | 17 MB | 53 MB | 432 MB |
Task | Learning Rate | Batch Size | Epochs | Early Stopping |
---|---|---|---|---|
NSMC | 3 × 10−5 | 32 | 3 | No |
KoHate | 3 × 10−5 | 32 | 3 | No |
KorNLI | 3 × 10−5 | 32 | 3 | No |
KorSTS | 3 × 10−5 | 32 | 3 | No |
NER | 3 × 10−5 | 16 | 5 | Yes |
KorQuAD | 3 × 10−5 | 8 | 2 | No |
Sentence and Meaning | Label | |
---|---|---|
Koreans | 아 더빙.. 진짜 짜증나네요 목소리 | Negative |
Meaning | Dubbing.. It’s really annoying voice | |
Koreans | 액션이 없는데도 재미 있는 몇 안되는 영화 | Positive |
Meaning | One of the few movies that is fun without action |
Sentence and Meaning | Label | |
---|---|---|
Koreans | P: 저는, 그냥 알아내려고 거기 있었어요. H: 이해하려고 노력하고 있었어요. | Entailment |
Meaning | P: I was there just to find out. H: I was trying to understand. | |
Koreans | P: 저는, 그냥 알아내려고 거기 있었어요. H: 나는 처음부터 그것을 잘 이해했다. | Contradiction |
Meaning | P: I was there just to find out. H: I understood it well from the beginning. | |
Koreans | P: 저는, 그냥 알아내려고 거기 있었어요. H: 나는 돈이 어디로 갔는지 이해하려고 했어요. | Neutral |
Meaning | P: I was there just to find out. H: I was trying to understand where the money went. |
Model | Model Size (MB) | Hyperparameter | Data | Avg | |||||
---|---|---|---|---|---|---|---|---|---|
NSMC (ACC) | Naver NER (F1) | KorNLI (ACC) | KorSTS (Spearman) | KorQuaD (EM/F1) | Korean HateSpeech (F1) | ||||
KR-ELECTRA-Base * | 432 | 1.0 | 89.324 | 87.896 | 80.878 | 81.722 | 59.369/ 88.993 | 66.273 | 79.208 |
KR-ELECTRA-Small | 53 | 1.0 | 88.798 | 85.409 | 77.485 | 76.809 | 57.256/ 86.544 | 61.598 | 76.271 |
KR-ELECTRA-Small-KD | 53 | 0.3 | 89.262 | 85.432 | 77.764 | 77.367 | 57.066/ 86.716 | 64.236 | 76.835 |
53 | 0.5 | 89.720 | 85.873 | 78.223 | 76.076 | 57.637/ 87.143 | 65.302 | 77.139 | |
53 | 0.7 | 89.428 | 85.603 | 77.365 | 75.424 | 57.516/ 86.865 | 64.301 | 76.643 |
Model | Model Size (MB) | Hyperparameter | Data | Avg | |||||
---|---|---|---|---|---|---|---|---|---|
NSMC (ACC) | Naver NER (F1) | KorNLI (ACC) | KorSTS (Spearman) | KorQuaD (EM/F1) | Korean HateSpeech (F1) | ||||
KR-ELECTRA-Base * | 432 | 1.0 | 89.324 | 87.896 | 80.878 | 81.722 | 59.369/ 88.993 | 66.273 | 79.208 |
KR-ELECTRA-Small-LF-V1 | 35 | 1.0 | 88.062 | 82.965 | 74.730 | 73.460 | 53.983/ 83.125 | 61.165 | 73.927 |
KR-ELECTRA-Small-LF-V1-KD | 35 | 0.3 | 88.614 | 83.769 | 75.089 | 74.463 | 51.974/ 81.110 | 60.783 | 73.686 |
35 | 0.5 | 89.012 | 84.983 | 75.828 | 73.889 | 53.134/ 82.477 | 63.389 | 74.673 | |
35 | 0.7 | 88.726 | 83.99 | 74.750 | 73.728 | 53.186/ 82.256 | 60.877 | 73.930 |
Model | Model Size (MB) | Hyperparameter | Data | Avg | |||||
---|---|---|---|---|---|---|---|---|---|
NSMC (ACC) | Naver NER (F1) | KorNLI (ACC) | KorSTS (Spearman) | KorQuaD (EM/F1) | Korean HateSpeech (F1) | ||||
KR-ELECTRA-Base * | 432 | 1.0 | 89.324 | 87.896 | 80.878 | 81.722 | 59.369/ 88.993 | 66.273 | 79.208 |
KR-ELECTRA-Small-LF-V2 | 30 | 1.0 | 87.898 | 81.517 | 74.151 | 72.091 | 51.437/ 80.142 | 60.734 | 72.567 |
KR-ELECTRA-Small-LF-V2-KD | 30 | 0.3 | 88.248 | 84.152 | 75.069 | 74.845 | 55.143/ 84.685 | 62.048 | 74.884 |
30 | 0.5 | 88.884 | 84.922 | 75.608 | 73.899 | 56.096/ 85.810 | 63.612 | 75.547 | |
30 | 0.7 | 88.346 | 84.614 | 76.147 | 73.514 | 55.438/ 84.713 | 64.132 | 75.272 |
Model | Model Size (MB) | Hyperparameter | Data | Avg | |||||
---|---|---|---|---|---|---|---|---|---|
NSMC (ACC) | Naver NER (F1) | KorNLI (ACC) | KorSTS (Spearman) | KorQuaD (EM/F1) | Korean HateSpeech (F1) | ||||
KR-ELECTRA-Base * | 432 | 1.0 | 89.324 | 87.896 | 80.878 | 81.722 | 59.369/ 88.993 | 66.273 | 79.208 |
KR-ELECTRA-Tiny | 17 | 1.0 | 87.454 | 78.878 | 72.355 | 71.019 | 51.627/ 80.791 | 59.120 | 71.606 |
KR-ELECTRA-Small-LF-V3 | 18 | 1.0 | 87.506 | 81.906 | 73.912 | 71.302 | 51.783/ 80.626 | 61.862 | 72.700 |
KR-ELECTRA-Small-LF-V3-KD | 18 | 0.3 | 88.378 | 83.439 | 75.369 | 72.930 | 54.624/ 84.094 | 62.129 | 74.423 |
18 | 0.5 | 88.138 | 83.537 | 74.011 | 74.220 | 54.121/ 83.528 | 63.571 | 74.447 | |
18 | 0.7 | 88.374 | 84.237 | 73.333 | 73.004 | 54.104/ 83.443 | 61.362 | 73.980 |
Model | Inference Time (ms/Sentence) |
---|---|
KR-ELECTRA-Base | 12.5 |
KR-ELECTRA-Small-KD | 3.2 |
KR-ELECTRA-Small-LF-V1 | 2.1 |
KR-ELECTRA-Small-LF-V2 | 1.8 |
KR-ELECTRA-Small-LF-V3 | 1.1 |
Model Configuration | NSME (ACC) | NER (F1) | KorNLI (ACC) | KorSTS (Spearman) | KorQuAD (EM/F1) | Korean HateSpeech (F1) | Avg | Model Size |
---|---|---|---|---|---|---|---|---|
Without LF (KD only) | 89.720 | 85.873 | 78.223 | 76.076 | 57.637/87.143 | 65.302 | 77.139 | 52 MB |
With LF (KD + LF-FFN) | 89.012 | 84.983 | 75.828 | 73.889 | 53.134/82.477 | 63.389 | 74.673 | 35 MB |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, J.-H.; Choi, Y.-S. Lightweight Pre-Trained Korean Language Model Based on Knowledge Distillation and Low-Rank Factorization. Entropy 2025, 27, 379. https://doi.org/10.3390/e27040379
Kim J-H, Choi Y-S. Lightweight Pre-Trained Korean Language Model Based on Knowledge Distillation and Low-Rank Factorization. Entropy. 2025; 27(4):379. https://doi.org/10.3390/e27040379
Chicago/Turabian StyleKim, Jin-Hwan, and Young-Seok Choi. 2025. "Lightweight Pre-Trained Korean Language Model Based on Knowledge Distillation and Low-Rank Factorization" Entropy 27, no. 4: 379. https://doi.org/10.3390/e27040379
APA StyleKim, J.-H., & Choi, Y.-S. (2025). Lightweight Pre-Trained Korean Language Model Based on Knowledge Distillation and Low-Rank Factorization. Entropy, 27(4), 379. https://doi.org/10.3390/e27040379