Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model
Abstract
:1. Introduction
- Research problems. By taking the BERT (Bidirectional Encoder Representations from Transformers) model as an example, we aim to answer the following three research questions. (1) How can we port the BERT model onto an ARM-based multi-core CPU? (2) How can we parallelize the BERT model, and how well does it run on the multi-core CPU? (3) How do the hyper-parameters impact the performance of BERT?
- Research contributions. In this article, we port the BERT model to an ARM multi-core processor and investigate how to accelerate the model fine-tuning by using multi-core parallelism to tuning hyper-parameters. First, we port the BERT model to the multi-core processor, and then we parallelize the training of the BERT model for downstream tasks, aiming to reduce training time by utilizing the available processing cores. Second, we fine-tune several typical downstream natural language processing tasks in terms of batch size, learning rate, and training epochs, in order to improve the performance of the BERT model in downstream tasks. To the best of our knowledge, this is the first systematic work on the performance evaluation and optimization of the BERT model on multi-core CPUs, without changing the model accuracy. To summarize, the contributions of this work are twofold.
- We port and parallelize the BERT model to an ARM multi-core processor. By using multiple CPUs to train downstream tasks based on the BERT model, the available multi-core processor can be used to reduce the training time of the model for downstream tasks and fully exploit the parallelism of the model training (Section 3).
- We evaluate the impact of BERT hyperparameters on model performance in terms of prediction accuracy and training time. By properly optimizing these parameters, the performance of the BERT model in downstream tasks can be improved (Section 4).
2. Background and Related Work
2.1. Transformer Model
- Encoder. The encoder has multiple identical layers, each of which has two sub-layers: a multi-head self-attention layer (SL1), and a fully connected feed-forward neural network layer (SL2). Both sub-layers use residual connections and layer normalization.
- Decoder. The decoder has a similar structure to the encoder but includes an additional masked multi-head self-attention layer. This layer masks the future positions of the input to prevent the model from seeing the subsequent words during training. The input to the decoder consists of the output from the encoder and the previous outputs of the decoder. The decoder outputs a probability distribution over the vocabulary for each position. During training, the decoder uses the ground truth from the previous positions to predict the current position. During inference, it uses its own previous predictions for the next step.
2.2. BERT Model
2.3. BERT Variants
- Optimizing training methods. BERT models face challenges of insufficient training data and inadequate training. To address these issues, optimization can be done by increasing the size of the dataset or using more complicated models, e.g., RoBERTa [13].
- Using lightweight models. While BERT-based models have achieved remarkable results in various tasks, the large size of the models poses a challenge when porting them to memory-constrained mobile platforms. Thus, prior work has focused on reducing model size, e.g., BistillBERT [20] and TinyBERT [21].
3. Fine-Tuning BERT Parallelization
3.1. Porting BERT to a Multi-Core Processor
- Phytium 2000+Architecture. We use the Phytium 2000+ multi-core processor, which integrates 64 ARM-compatible processor cores [22]. By integrating efficient processor cores, a data-affine large-scale cache-coherent architecture, and a hierarchical 2D mesh interconnection network, the Phytium 2000+ processor optimizes memory access latency and provides industry-leading computing performance, memory bandwidth, and I/O expansion capabilities. Phytium 2000+ is primarily employed in high-performance, high-throughput server domains, such as large-scale enterprises and high-performance server systems in industries with demanding requirements for processing power and throughput, as well as large-scale internet data centers. Figure 3 shows the architectural details of the Phytium 2000+ multi-core processor.
- BERT Porting. Porting BERT involves many adaptations to the Phytium 2000+ platform. Our local desktop software environment is based on Windows, which differs from the Linux platform of Phytium 2000+. Thus, there are significant differences in the software environment and packages used. Additionally, the Phytium 2000+ multi-core processor cluster used has no access to the public network. We have to manually install the required software packages when porting the BERT model on Phytium 2000+.
3.2. Parallelizing the BERT Model
3.2.1. Model Parallelism and Data Parallelism
3.2.2. Strategies for Accelerating BERT Fine-Tuning
3.2.3. BERT Parallelization Implementation
3.3. Performance Results
4. Fine-Tuning the BERT Model on Multi-Core CPUs
4.1. Parameter Tuning and Optimization of the BERT Model on Downstream Tasks
4.1.1. Introduction to the Test Sets
- ① CoLA (The Corpus of Linguistic Acceptability) Dataset [24]. This dataset is used for a single-sentence classification task, where the goal is to determine the grammatical acceptability of a given sentence. It has about 8500 training samples and 1000 test samples. Examples of the training sets are shown in Table 1, where the first column indicates the source of the sentence, the second column represents the grammatical acceptability (0 = unacceptable, 1 = acceptable), and the third column contains the sentences themselves.
- ②SST-2 (The Stanford Sentiment Treebank [25]). SST-2 is a single-sentence classification task used to determine whether a movie review expresses a positive or negative sentiment. The dataset has around 67,000 instances in the training set and 1800 instances in the test set. The examples of SST-2 are shown Table 2. The first column represents movie reviews (sentences), and the second column indicates whether the sentiment of the sentence is positive (1) or negative (0). The prediction accuracy is calculated as the number of correct predictions for both positive and negative examples divided by the total number of samples.
- ③ MRPC (Microsoft Research Paraphrase Corpus [26]). The MRPC dataset is a sentence pair classification task where, given a pair of sentences, the goal is to determine whether they are semantically equivalent. The dataset has around 3700 samples in the training set and 1700 samples in the test set.
- ④ RTE (Recognizing Textual Entailment [27]). Given two text snippets, the task of RTE is to determine whether the meaning of one text can be inferred from the other. This dataset has around 2500 training samples and 3000 testing samples. This task is applicable to various NLP tasks, such as question answering, information retrieval, information extraction, and text summary.
- ⑤ WNLI (Winograd NLI) [28]. This task involves determining whether the meanings of two sentences are the same. It consists of 634 training examples and 146 testing examples. The prediction accuracy is calculated as the number of correct predictions for both positive and negative examples divided by the total number of examples.
4.1.2. Fine-Tuning BERT on Downstream Tasks
4.2. Performance Results
4.2.1. Batch Size
4.2.2. Learning Rate
4.2.3. Training Epochs
4.2.4. BERT Optimizer
4.3. Summary
5. Conclusions and Future Outlook
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 2023, 55, 195. [Google Scholar] [CrossRef]
- Le, Q.V.; Mikolov, T. Distributed Representations of Sentences and Documents. Proc. Mach. Learn. Res. 2014, 32, 1188–1196. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
- Dai, A.M.; Le, Q.V. Semi-supervised Sequence Learning. In Proceedings of the 28th Conference and Workshop on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 3079–3087. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Melamud, O.; Goldberger, J.; Dagan, I. context2vec: Learning Generic Context Embedding with Bidirectional LSTM. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, 11–12 August 2016; pp. 51–61. [Google Scholar] [CrossRef]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), New Orleans, LA, USA, 1–6 June 2018; Volume 1, pp. 2227–2237. [Google Scholar] [CrossRef]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training 2018. Available online: https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf (accessed on 1 January 2024).
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
- Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.; Tian, X.; Zhu, D.; Tian, H.; Wu, H. ERNIE: Enhanced Representation through Knowledge Integration. arXiv 2023, arXiv:1904.09223. [Google Scholar]
- Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2023, arXiv:1907.11692. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.G.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 5754–5764. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Clark, K.; Luong, M.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, S.K.S.; Lopes, R.G.; Ayan, B.K.; Salimans, T.; et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022 (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 18–24 June 2022; pp. 10674–10685. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2023, arXiv:1910.01108. [Google Scholar]
- Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020; pp. 4163–4174. [Google Scholar] [CrossRef]
- Fang, J.; Liao, X.; Huang, C.; Dong, D. Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+. J. Comput. Sci. Technol. 2021, 36, 33–43. [Google Scholar] [CrossRef]
- Ben-Nun, T.; Hoefler, T. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Comput. Surv. 2019, 52, 65. [Google Scholar] [CrossRef]
- Warstadt, A.; Singh, A.; Bowman, S.R. Neural Network Acceptability Judgments. Trans. Assoc. Comput. Linguist. 2019, 7, 625–641. [Google Scholar] [CrossRef]
- Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.Y.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
- Dolan, W.B.; Brockett, C. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP@IJCNLP 2005), Jeju Island, Republic of Korea, 11–13 October 2005. [Google Scholar]
- Dagan, I.; Roth, D.; Zanzotto, F.; Sammons, M. Recognizing Textual Entailment: Models and Applications. Comput. Linguist. 2015, 41, 157–159. [Google Scholar] [CrossRef]
- He, P.; Liu, X.; Gao, J.; Chen, W. Deberta: Decoding-Enhanced Bert with Disentangled Attention. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Virtual Event, 3–7 May 2021. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- TensorFlow. LazyAdamOptimizer—TensorFlow 1.15. 2020. Available online: https://tensorflow.google.cn/versions/r1.15/api_docs/python/tf/contrib/opt/LazyAdamOptimizer (accessed on 18 June 2022).
- Dozat, T. Incorporating nesterov momentum into adam. In Proceedings of the 4th International Conference on Learning Representations (ICLR 2016), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Source | Label | Sentence |
---|---|---|
gj04 | 1 | The weights made the rope stretch over the pulley. |
gj04 | 1 | The mechanical doll wriggled itself loose. |
cj99 | 1 | If you had eaten more, you would want less. |
cj99 | 0 | The more you would want, the less you would eat. |
Sentence | Label |
---|---|
it’s a charming and often affecting journey. | 1 |
unflinchingly bleak and desperate | 0 |
allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker. | 1 |
Quality | #1 ID | #2 ID | #1 String | #2 String |
---|---|---|---|---|
1 | 702876 | 702977 | Amrozi accused his brother, whom he called “the witness”, of deliberately distorting his evidence. | Referring to him as only “the witness”, Amrozi accused his brother of deliberately distorting his evidence. |
Index | Sentence1 | Sentence2 | Label |
---|---|---|---|
0 | Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation. | Christopher Reeve had an accident. | not_entailment |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, L.; Gao, W.; Fang, J. Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model. Appl. Sci. 2024, 14, 2364. https://doi.org/10.3390/app14062364
Zhao L, Gao W, Fang J. Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model. Applied Sciences. 2024; 14(6):2364. https://doi.org/10.3390/app14062364
Chicago/Turabian StyleZhao, Lanxin, Wanrong Gao, and Jianbin Fang. 2024. "Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model" Applied Sciences 14, no. 6: 2364. https://doi.org/10.3390/app14062364
APA StyleZhao, L., Gao, W., & Fang, J. (2024). Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model. Applied Sciences, 14(6), 2364. https://doi.org/10.3390/app14062364