Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model

Zhao, Lanxin; Gao, Wanrong; Fang, Jianbin

doi:10.3390/app14062364

Open AccessArticle

Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model

by

Lanxin Zhao

¹,

Wanrong Gao

² and

Jianbin Fang

^2,*

¹

School of International Business, Hunan University of Information Technology, Changsha 410151, China

²

School of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(6), 2364; https://doi.org/10.3390/app14062364

Submission received: 14 January 2024 / Revised: 28 February 2024 / Accepted: 7 March 2024 / Published: 11 March 2024

(This article belongs to the Special Issue Design and Application of High-Performance Computing Systems)

Download

Browse Figures

Versions Notes

Abstract

:

The BERT model is regarded as the cornerstone of various pre-trained large language models that have achieved promising results in recent years. This article investigates how to optimize the BERT model in terms of fine-tuning speed and prediction accuracy, aiming to accelerate the execution of the BERT model on a multi-core processor and improve its prediction accuracy in typical downstream natural language processing tasks. Our contributions are two-fold. First, we port and parallelize the fine-tuning training of the BERT model on a multi-core shared-memory processor. We port the BERT model onto a multi-core processor platform to accelerate the fine-tuning training process of the model for downstream tasks. Second, we improve the prediction performance of typical downstream natural language processing tasks through fine-tuning the model parameters. We select five typical downstream natural language processing tasks (CoLA, SST-2, MRPC, RTE, and WNLI) and perform optimization on the multi-core platform, taking the hyperparameters of batch size, learning rate, and training epochs into account. Our experimental results show that, by increasing the number of CPUs and the number of threads, the model training time can be significantly reduced. We observe that the reduced time is primarily concentrated in the self-attention mechanism. Our further experimental results show that setting reasonable hyperparameters can improve the accuracy of the BERT model when applied to downstream tasks and that appropriately increasing the batch size under conditions of sufficient computing resources can significantly reduce training time.

Keywords:

BERT model; model fine-tuning; performance optimization

1. Introduction

Pre-trained large language models (LLMs) are models that are trained using a large amount of text data from our daily life, allowing the models to learn the occurrence probabilities of words or characters within this text [1]. As a result, LLMs can model the distributions present in these text data. The training data for LLMs consists of textual contexts, and these models can be trained with virtually unlimited amounts of text, enabling LLMs to possess powerful text processing abilities and exhibit excellent performance in various downstream tasks. Early LLMs mainly focused on learning word embeddings, but these embeddings only vectorize words without having practical effects in downstream tasks. Therefore, training models with word embeddings often results in poor prediction accuracy, as observed in models such as Skip-gram [2] and GloVe [3]. Moreover, these models fail to capture long-range contextual relationships.

Starting in 2016, the main development trends for LLMs have included establishing relationships between long-distance texts and pre-trained models on large-scale corpora. Dai et al. used language modeling and sequence autoencoders to improve the sequence learning of recurrent neural networks (RNNs), marking the beginning of modern LLMs [4]. They introduced the idea of using pre-trained models for downstream tasks and demonstrated the effectiveness through several classification tasks. Ever since then, the pre-trained models and their fine-tuning for downstream tasks have gradually gained attention.

With the increase in computing capability, deep learning has continuously evolved. Hochreiter et al. used unidirectional LSTM [5], which can only capture unidirectional semantic information from the context [3]. Melamud et al. made a pioneering contribution by introducing bidirectional LSTM, which can learn information from both contexts [6]. They showed for the first time that semantic information can be embedded in word embeddings trained on a large-scale unlabeled data corpus, where the vectors representing words contain the semantic information of those words. Thus, this approach can achieve significantly better performance compared to traditional word embedding models. In 2018, ELMo trained a model to predict the current word using contextual information, with a two-layer LSTM to extract syntactic and semantic information [7]. This model shows outstanding performance in multiple NLP tasks, which can tackle the problem of word ambiguity. Subsequently, pre-trained models such as GPT [8,9] and BERT [10] came into being, leading to remarkable advancements in natural language processing tasks.

Following the success of GPT [8] and BERT [10], many pre-trained models have been proposed, showcasing the exceptional performance of pre-trained models in downstream tasks. Examples include ERNIE [11], SpanBERT [12], RoBERTa [13], XLNet [14], ALBERT [15], and ELECTRA [16]. More recently, the generic large language models, pre-trained on text-only corpora, have become surprisingly effective at encoding text for image synthesis, i.e., the text-to-image diffusion model [17,18]. As a result, these pre-trained models have achieved impressive results in various NLP downstream tasks. At the same time, these pre-trained large language models require a lot of computing and memory resources for both model pre-training and fine-tuning.

Research problems. By taking the BERT (Bidirectional Encoder Representations from Transformers) model as an example, we aim to answer the following three research questions. (1) How can we port the BERT model onto an ARM-based multi-core CPU? (2) How can we parallelize the BERT model, and how well does it run on the multi-core CPU? (3) How do the hyper-parameters impact the performance of BERT?
Research contributions. In this article, we port the BERT model to an ARM multi-core processor and investigate how to accelerate the model fine-tuning by using multi-core parallelism to tuning hyper-parameters. First, we port the BERT model to the multi-core processor, and then we parallelize the training of the BERT model for downstream tasks, aiming to reduce training time by utilizing the available processing cores. Second, we fine-tune several typical downstream natural language processing tasks in terms of batch size, learning rate, and training epochs, in order to improve the performance of the BERT model in downstream tasks. To the best of our knowledge, this is the first systematic work on the performance evaluation and optimization of the BERT model on multi-core CPUs, without changing the model accuracy. To summarize, the contributions of this work are twofold.
- We port and parallelize the BERT model to an ARM multi-core processor. By using multiple CPUs to train downstream tasks based on the BERT model, the available multi-core processor can be used to reduce the training time of the model for downstream tasks and fully exploit the parallelism of the model training (Section 3).
- We evaluate the impact of BERT hyperparameters on model performance in terms of prediction accuracy and training time. By properly optimizing these parameters, the performance of the BERT model in downstream tasks can be improved (Section 4).

2. Background and Related Work

This section first introduces the Transformer model and then describes the BERT model and its related work.

2.1. Transformer Model

The Transformer model [19], introduced by Google in 2017, is a neural network architecture built upon attention mechanisms instead of recurrent neural networks (RNNs) in LSTM. This model allows for better parallelization for training and thus can take advantage of large-scale computing clusters. This model has achieved promising results in machine translation tasks. Figure 1 shows that Transformer has an encoder and a decoder.

Encoder. The encoder has multiple identical layers, each of which has two sub-layers: a multi-head self-attention layer (SL1), and a fully connected feed-forward neural network layer (SL2). Both sub-layers use residual connections and layer normalization.

An attention function can be described as mapping a query and a set of key–value pairs to an output, where the query, keys, values, and output are all vectors [19]. The attention mechanism in the multi-head self-attention layer can be represented in Equation (1):

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q \cdot K^{T}}{\sqrt{d_{k}}}) \cdot V,

(1)

where Q, K, and V are obtained from the multiplication of the input vector with matrices

W^{q}

,

W^{k}

, and

W^{v}

, respectively. Note that

W^{q}

,

W^{k}

, and

W^{v}

have to be calculated during model training. The multi-head attention mechanism involves training multiple sets of

W^{q}

,

W^{k}

, and

W^{v}

matrices to project the input vector differently. The outputs of all attention heads are concatenated.

The feedforward neural network layer (SL2) mainly provides a non-linear transformation function, using two fully connected layers. The first layer uses ReLU as the neuron activation function, while the second layer does not use the activation function.

Decoder. The decoder has a similar structure to the encoder but includes an additional masked multi-head self-attention layer. This layer masks the future positions of the input to prevent the model from seeing the subsequent words during training. The input to the decoder consists of the output from the encoder and the previous outputs of the decoder. The decoder outputs a probability distribution over the vocabulary for each position. During training, the decoder uses the ground truth from the previous positions to predict the current position. During inference, it uses its own previous predictions for the next step.

After the decoder, the Transformer model also includes a linear layer and a Softmax layer. These layers project the decoder outputs to the dimension of the corresponding output word vocabulary. Each dimension represents the predicted probability of a word in the vocabulary, and the word with the largest probability is selected as the output.

2.2. BERT Model

BERT (Bidirectional Encoder Representations from Transformers) is a bidirectional language model based on the encoder part of the Transformer model (i.e., the left part of Figure 1). BERT alleviates the unidirectionality constraint by using a “masked language model” (MLM) pre-training objective [10]. Specifically, the BERT model is pre-trained with two unsupervised tasks: masked language model (MLM) and next sentence prediction (NSP). As for MLM, approximately 15% of the tokens are selected, with 80% of them replaced with [MASK] tokens, 10% left unchanged, and 10% replaced with other random tokens. The model utilizes contextual semantic information to predict the masked words at these positions. As for NSP, given two sentences (A and B), the model needs to determine whether B logically follows A.

Figure 2 shows the main structure of the BERT model, with the stages of pre-training and fine-tuning. In the pre-training stage, the model is trained on a large corpus of unlabeled text to learn the relationships between words. The learned model is then fine-tuned on downstream tasks.

The internal structure of the BERT model utilizes the encoder part of the Transformer model, with multiple layers stacked together. The output of the BERT model is a vector for each input token. In the fine-tuning stage, the BERT model can be applied to various downstream tasks by utilizing its one-to-one input–output features. For example, in a sentence pair classification task, the output corresponding to the input’s first token can be used for classification. In a named entity recognition (NER) task, the output for each word can represent its position (beginning, middle, or end). In the SQuAD task, the model can output the probabilities for the start and end positions of an answer. Thus, the BERT model can be applied to a wide range of downstream tasks [10].

2.3. BERT Variants

Numerous variations based on BERT have emerged in recent years.

Optimizing pre-training tasks. The effectiveness of masking random individual tokens in BERT may not always be as good as masking words or phrases. Thus, optimization can be achieved by modifying the pre-training tasks, e.g., specific models include ERNIE [11] and SpanBERT [12].
Optimizing training methods. BERT models face challenges of insufficient training data and inadequate training. To address these issues, optimization can be done by increasing the size of the dataset or using more complicated models, e.g., RoBERTa [13].
Optimizing model structures. Optimizing model structures often leads to better performance. Therefore, optimization can be pursued from the perspective of model structures. Specific optimized models include XLNet [14], ALBERT [15], and ELECTRA [16].
Using lightweight models. While BERT-based models have achieved remarkable results in various tasks, the large size of the models poses a challenge when porting them to memory-constrained mobile platforms. Thus, prior work has focused on reducing model size, e.g., BistillBERT [20] and TinyBERT [21].

3. Fine-Tuning BERT Parallelization

This section focuses on porting the BERT model to a multi-core processor and accelerating its fine-tuning process by using the multiple hardware cores.

3.1. Porting BERT to a Multi-Core Processor

Phytium 2000+Architecture. We use the Phytium 2000+ multi-core processor, which integrates 64 ARM-compatible processor cores [22]. By integrating efficient processor cores, a data-affine large-scale cache-coherent architecture, and a hierarchical 2D mesh interconnection network, the Phytium 2000+ processor optimizes memory access latency and provides industry-leading computing performance, memory bandwidth, and I/O expansion capabilities. Phytium 2000+ is primarily employed in high-performance, high-throughput server domains, such as large-scale enterprises and high-performance server systems in industries with demanding requirements for processing power and throughput, as well as large-scale internet data centers. Figure 3 shows the architectural details of the Phytium 2000+ multi-core processor.
BERT Porting. Porting BERT involves many adaptations to the Phytium 2000+ platform. Our local desktop software environment is based on Windows, which differs from the Linux platform of Phytium 2000+. Thus, there are significant differences in the software environment and packages used. Additionally, the Phytium 2000+ multi-core processor cluster used has no access to the public network. We have to manually install the required software packages when porting the BERT model on Phytium 2000+.

The porting of the BERT model onto Phytium 2000+ has the following four steps.

Step ①: Obtain the dependent Python libraries and their versions from the local Desktop. This can be done by running the command “pip freeze > requirements.txt”, which generates a file containing the names of all the Python libraries. The required Python libraries and their versions for the BERT model are shown in Figure 4.

Step ②: Check the supported formats of the .whl files by pip on Phytium 2000+. The wheel file formats are shown in Figure 5. Note that <package>, <version>, <python_tag>, <abi_tag>, and <platform_tag> will be replaced with the actual package name, version, Python tag, ABI tag, and platform tag, respectively, for each specific package. The supported formats on Phytium 2000+ are shown in Figure 6.

Step ③: Use pip to download all the wheel files of the Python packages mentioned in Step ① on the local Desktop. Many downloaded .whl files may not meet the format requirements specified in Step ② for the Phytium 2000+ platform. For example, the downloaded pandas library may be named “pandas-0.25.3-cp36-cp36m-win_amd64.whl”, which does not match the installation requirements of the Phytium 2000+ platform. In this case, it has to be replaced with “pandas-1.1.3-cp36-cp36m-linux_aarch64.whl”.

Step ④: Upload the downloaded wheel files to Phytium 2000+ and use the “pip install ***.whl” command to install all the .whl files. By completing these steps, the local BERT model will be successfully ported to the Phytium 2000+ processor.

3.2. Parallelizing the BERT Model

3.2.1. Model Parallelism and Data Parallelism

Model parallelism and data parallelism are two typical approaches for accelerating model training [23]. Model parallelism refers to splitting a large model and deploying different parts of it on distinct devices for training. When neural network models are too large to be trained on a single processor, model parallelism becomes necessary. Implementing model parallelism involves assigning different layers of the deep learning model to different devices. In the forward pass, the later layers require the output from earlier layers as input, while in the backward pass, the earlier layers require the computed results from the later layers. As a result, there are dependencies between the neighboring layers, which limits the efficacy of model parallelism. Consequently, model parallelism is not commonly used for neural network training unless the model is extremely large.

In contrast, data parallelism is more commonly used in deep learning. This is due to the fact that the long training time of a model often results from the large number of training samples. Data parallelism involves placing identical models on distinct devices but training them with different subsets of the training samples. When the number of devices is N and the mini-batch size per device is b, we will have an equivalent batch size of

N \cdot b

.

3.2.2. Strategies for Accelerating BERT Fine-Tuning

Given that Phytium 2000+ has 64 CPU cores, investigating how to fully exploit the computational resources on this processor to accelerate the training process becomes necessary. For the TensorFlow-based BERT, we configure the number of CPU cores used for model training, thereby meeting the computational resource requirements. TensorFlow provides a configuration function called tf.ConfigProto(), which facilitates resource utilization and expedites the training process of the model. This function allows for various parameter settings, which are shown in Figure 7.

The

d e v i c e_c o u n t

parameter sets an upper limit on the number of CPU cores to be used, while

i n t e r_o p_p a r a l l e l i s m_t h r e a d s

and

i n t r a_o p_p a r a l l e l i s m_t h r e a d s

indicate the level of thread parallelism for operations in a session. By default, these values are set to 0, allowing the system to automatically determine the correct values. When

l o g_d e v i c e_p l a c e m e n t

is set to

T r u e

, the messages about which operations and tensors are assigned to specific devices will be printed in the terminal, displaying their device assignments during model training.

On parallelism,

i n t e r_o p_p a r a l l e l i s m_t h r e a d s

and

i n t r a_o p_p a r a l l e l i s m_t h r e a d s

are the two important parameters of parallelizing the BERT model. The parameter

i n t r a_o p_p a r a l l e l i s m_t h r e a d s

determines the degree of parallelism for computations within each operator (e.g., matrix multiplication). On the other hand,

i n t e r_o p_p a r a l l e l i s m_t h r e a d s

determines the level of parallelism for unrelated operations within the TensorFlow computation graph.

For the BERT model,

i n t r a_o p_p a r a l l e l i s m_t h r e a d s

impacts the parallelism of matrix multiplications during training. Figure 8 shows that one of the main components that can be parallelized in the BERT model is the self-attention mechanism. First, calculating the matrix of

Q

,

K

, and

V

can be obtained by multiplying the input matrix with the matrices

W_{q}

,

W_{k}

, and

W_{v}

, respectively. This operator can be parallelized. Second, the attention matrix

A

, computed by multiplying the

K

and

Q

matrices, can also be parallelized. Finally, the output matrix

O

, obtained by multiplying the value matrix

V

with the attention matrix

A

, can be parallelized as well. Thus, the parallelization of these three sets of matrix multiplications is controlled by

i n t r a_o p_p a r a l l e l i s m_t h r e a d s

. Additionally, the feed-forward neural network in the encoder is also parallelizable, as it involves matrix multiplication-related operations.

On the other hand,

i n t e r_o p_p a r a l l e l i s m_t h r e a d s

determines the level of parallelism between unrelated operations, particularly in the structure of the BERT model, where it primarily relates to the parallelism between multiple attention heads. Each attention head in the multi-head attention mechanism has its own set of matrices (

W_{q}

,

W_{k}

,

W_{v}

), allowing each attention head to obtain different q, k, and v vectors for the same input x. Since the operations between attention heads are independent, the parallelism can be controlled by changing

i n t e r_o p_p a r a l l e l i s m_t h r e a d s

.

3.2.3. BERT Parallelization Implementation

Due to the limitation of the official BERT source code provided by Google, which does not support multi-CPU/GPU training and cannot meet the requirements for large-scale parallelization, it is necessary to modify the BERT source code. The BERT model’s source code can be obtained from https://github.com/google-research/bert (accessed on 18 June 2022). Among the source code files, the main functionalities of the “create_pretraining_data.py” file are to preprocess raw text data and convert it into TFRecord files required for training. The “tokenization.py” file handles the tokenization of the text data. The “modeling.py” file defines the main structure of the BERT model. The “run_classifier.py” and “run_squad.py” files contain the code for fine-tuning the model.

Our focus is on the fine-tuning of the BERT model, specifically the “run_classifier.py” file, where the model training and evaluation are performed using tf.estimator. Figure 9 shows the model construction, where the model_fn_builder function, previously defined, is called, and it returns another function, model_fn. This model_fn function defines the computational graph of the model and specifies the operations to be performed during training, evaluation, and prediction. The model_fn is then used to create an estimator instance. The original code uses the tf.contrib.tpu.TPUEstimator function, but since we are using multi-core CPUs, we need to replace it with tf.estimator.Estimator. The file_based_fn_builder function is then called, which returns train_input_fn of type tf.data.Dataset. Finally, we call estimator.train to train the model.

To enable the training on multi-cores, we have to make code modifications. First, we need to modify the function tf.distribute.MirroredStrategy() and run_config as shown in Lines 1–5 of Figure 10. As mentioned earlier, the estimator needs to be modified to use the tf.estimator.Estimator function, as shown in Line 7 of Figure 10.

In the original BERT code, the batch size in the input_fn function is directly read from the params parameter, as shown in Figure 11. This is because the original code passes the batch_size as a parameter to the estimator. However, our modified estimator in Figure 10 does not need this parameter. Therefore, we need to make batch_size a parameter of the file_based_input_fn_builder function. The specific modifications are shown in Figure 11. When calling this function, we also need to pass the batch_size as a parameter. When performing evaluation and prediction, we have to pass the corresponding batch_size as well.

To further modify the code, we have to update the model_fn function. We will change the return types for different modes from tf.contrib.tpu.TPUEstimatorSpec to tf.estimator.EstimatorSpec. Additionally, for the eval mode, we will modify the eval_metric and call the metric_fn function within it. The original code uses a custom optimizer called AdamWeightDecayOptimizer, which does not consider the case of multi-CPU training. Therefore, we will change it to tf.train.AdamOptimizer.

3.3. Performance Results

This section shows the performance obtained by varying inter/intra-thread numbers while fixing the other. The experiments are run with the RTE dataset, and the maximum number of CPU cores used is set to 10. The performance results in terms of training time are shown in Figure 12. We see that increasing the intra- or inter-thread numbers appropriately leads to a shorter training duration while fixing the other.

Based on these findings, we further evaluate the performance impact of using different numbers of CPUs, which equals the inter/intra-thread number on training time. In these experiments, we set the inter/intra-thread numbers to be equal to the number of CPU cores, i.e., each thread is bound to a CPU core. As a matter of fact, we also conduct experiments including running multiple threads on a CPU core, but we fail to achieve any performance improvement. We vary the number of CPU cores from 1 to 10, and the performance results are shown in Figure 13. Additionally, we also measured the training time based on the original BERT model, which is represented as “default” in the figure. We see that when the number of CPU cores increases, the training time shows a nonlinear reduction. This observation aligns with our experimental experience. In the original BERT model, the inter_op_parallelism_threads and intra_op_parallelism_threads were not explicitly set to specify the granularity of CPU parallelism. Under these circumstances, TensorFlow attempts to automatically utilize multi-core CPUs to accelerate computation. However, as our experimental results have demonstrated, TensorFlow’s automatic settings do not achieve optimal performance. To minimize training time to the greatest extent, it is advisable to manually set parameters to optimize TensorFlow’s parallelism according to requirements. When using more CPU cores, the overhead from the inter-core communication will gradually increase, leading to a diminishing reduction in training time. Therefore, selecting the right number of threads is the key to reducing training time.

To further investigate the performance impact of using more CPU cores on training time, we run additional experiments on the CoLA dataset. Here, we increase the number of CPU cores and keep the number of CPU cores equal to the number of inter and intra threads. The performance results are shown in Figure 14. We see that using more CPU cores can lead to a further reduction in training time, but the performance gain will become minor.

4. Fine-Tuning the BERT Model on Multi-Core CPUs

This section focuses on the fine-tuning process of the BERT model for several typical downstream tasks in natural language processing, aiming to improve the prediction performance of the BERT model on these tasks.

4.1. Parameter Tuning and Optimization of the BERT Model on Downstream Tasks

We investigate the fine-tuning of the BERT model using three hyperparameters, targeting the test sets of several GLUE tasks. The five test sets used are CoLA [24], SST-2 [25], MRPC [26], RTE [27], and WNLI [28]. The performance data obtained are the results averaged over five training runs.

4.1.1. Introduction to the Test Sets

① CoLA (The Corpus of Linguistic Acceptability) Dataset [24]. This dataset is used for a single-sentence classification task, where the goal is to determine the grammatical acceptability of a given sentence. It has about 8500 training samples and 1000 test samples. Examples of the training sets are shown in Table 1, where the first column indicates the source of the sentence, the second column represents the grammatical acceptability (0 = unacceptable, 1 = acceptable), and the third column contains the sentences themselves.

The MCC (Matthews correlation coefficient) is an evaluation metric used to assess the quality of binary classification in machine learning. It measures the correlation between the observed and predicted classifications, taking into account TPs (true positives), TNs (true negatives), FPs (false positives), and FNs (false negatives). The MCC ranges from −1 to 1, where 1 indicates perfect predictions, and −1 indicates complete disagreement between the predictions and observations. We can calculate the MCC as follows:

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) \times (T P + F N) \times (T N + F P) \times (T N + F N)))))}} .

②SST-2 (The Stanford Sentiment Treebank [25]). SST-2 is a single-sentence classification task used to determine whether a movie review expresses a positive or negative sentiment. The dataset has around 67,000 instances in the training set and 1800 instances in the test set. The examples of SST-2 are shown Table 2. The first column represents movie reviews (sentences), and the second column indicates whether the sentiment of the sentence is positive (1) or negative (0). The prediction accuracy is calculated as the number of correct predictions for both positive and negative examples divided by the total number of samples.

③ MRPC (Microsoft Research Paraphrase Corpus [26]). The MRPC dataset is a sentence pair classification task where, given a pair of sentences, the goal is to determine whether they are semantically equivalent. The dataset has around 3700 samples in the training set and 1700 samples in the test set.

Table 3 shows an example from the MRPC dataset. The first column represents the label, where 0 indicates that the sentence pair is not semantically equivalent, while 1 indicates that the sentence pair is semantically equivalent. The second and third columns denote the IDs of the first and second sentences, respectively, while the fourth and fifth columns contain the text of the first and second sentences.

④ RTE (Recognizing Textual Entailment [27]). Given two text snippets, the task of RTE is to determine whether the meaning of one text can be inferred from the other. This dataset has around 2500 training samples and 3000 testing samples. This task is applicable to various NLP tasks, such as question answering, information retrieval, information extraction, and text summary.

An example of the RTE dataset is shown in Table 4. The first column represents the sentence index, the second and third columns represent the texts of the first and second sentences, and the fourth column indicates the label, where “entailment” is for inferable and “not entailment” is for not inferable.

⑤ WNLI (Winograd NLI) [28]. This task involves determining whether the meanings of two sentences are the same. It consists of 634 training examples and 146 testing examples. The prediction accuracy is calculated as the number of correct predictions for both positive and negative examples divided by the total number of examples.

4.1.2. Fine-Tuning BERT on Downstream Tasks

To achieve fine-tuning of the pretrained BERT model on downstream tasks, we have to modify the code file run_classifier.py. In this context, we will use the RTE dataset as an example to illustrate the basic steps of fine-tuning BERT.

As shown in Figure 15, the first step is to define the RTEProcessor class, which inherits from the original DataProcessor class in the code. Within this class, the functions get_train_examples, get_dev_examples, get_test_examples, and get_labels are implemented to retrieve the corresponding training set, validation set, test set, and labels for the downstream task. The create_examples function is used to create instances based on the meanings of different columns in the train.csv, dev.csv, and test.csv files. Specifically, text_a and text_b correspond to the two text segments (with text_b being None if unavailable), and label corresponds to the label value.

In the main function shown in Figure 15, the necessary configurations for training and testing are set. An instance of the RTEProcessor class is created, and the get_labels, get_train_examples, get_dev_examples, and get_test_examples functions are called to obtain the labels, training examples, validation examples, and test examples, respectively.

By running the script, fine-tuning the BERT model on the downstream task of RTE can be achieved. The fine-tuning process is executed while tuning the hyperparameters, including batch size, learning rate, and training epochs.

4.2. Performance Results

In this section, we evaluate the performance impact of fine-tuning parameters in terms of batch size, learning rate, and training epochs.

4.2.1. Batch Size

Batch size refers to the number of samples included in each gradient calculation during training. When using small batch sizes, parallel computation cannot be not easily achieved. Using a larger batch size allows for more effective parallel training by splitting training examples across different processor cores, significantly accelerating model training.

The choice of batch size primarily affects the direction of gradient descent during model training. When the dataset is relatively small, we can perform a full-batch learning, where the entire dataset is treated as a single batch. In such a case, the gradient descent direction is determined by the entire dataset. When the dataset size is sufficiently large, loading the entire dataset into memory at once becomes impractical due to memory and computational limitations. If the selected samples are sufficiently representative, using half of the dataset (or even smaller) will yield a gradient descent direction that is comparable to that obtained with the full dataset. Therefore, selecting an appropriate batch size can accelerate model training without sacrificing prediction accuracy.

Based on the above analysis, we run the parameter tuning experiments on different datasets by setting batch sizes to 16, 32, 64, 128, 256, and 512. We evaluate the performance impact of using various batch sizes on training time and prediction accuracy for the downstream tasks. The performance results in terms of training time and prediction accuracy are shown in Figure 16.

In Figure 16a, we see that, as the batch size increases, the training time significantly decreases. This particularly holds for tasks with large datasets, such as CoLA and SST-2. In terms of prediction accuracy, we see that the impact of batch size on prediction accuracy is uncertain, as shown in Figure 16b. For the CoLA dataset, increasing the batch size results in a slight decrease in accuracy, which aligns with our previous analysis that using a larger batch size requires more epochs to achieve the same level of accuracy. Subsequent experiments also confirm that increasing the number of epochs can compensate for the accuracy loss caused by larger batch sizes. On the other hand, for datasets like RTE, WNLI, and MRPC, increasing the batch size leads to a slight improvement in accuracy. This is due to the fact that using larger batch sizes can better represent the direction of gradient descent for the dataset, increasing the likelihood of finding the optimal point.

To summarize, we conclude that, for small datasets like RTE, WNLI, and MRPC, increasing the batch size brings the downstream task training closer to training on the entire dataset, leading to better identification of the correct gradient descent direction, but for large datasets like CoLA, the impact of increasing the batch size on the overall gradient descent direction is major. Although increasing the batch size can reduce the model training time, it requires more epochs to achieve the same level of accuracy. More epochs mean more training time. Thus, there exists a trade-off between batch size and the number of epochs. In resource-rich environments, increasing the batch size can reduce training time for downstream tasks and have a positive impact on accuracy for small datasets, but for large datasets, a balance needs to be achieved between batch size and the number of epochs.

In a nutshell, using a larger batch size within a reasonable range has the following advantages. First, it can improve memory utilization and allow for more parallelism. Second, it can reduce the number of training iterations required to complete one epoch, leading to shorter training time for the same amount of data. Third, within a certain range, using a larger batch size provides a more accurate direction for gradient descent, leading to smaller training oscillations. On the other hand, blindly increasing the batch size can have the following side effects. First, it may exceed the memory capacity, consuming substantial memory resources. Second, using a larger batch size can result in a reduced number of iterations required to complete one epoch, but achieving the same level of accuracy may require more epochs.

4.2.2. Learning Rate

The learning rate is an important parameter in supervised learning that determines whether the model can reach the optimal solution, i.e., whether it can find the local minimum. An appropriate learning rate can enable the model to converge to the local minimum within a suitable time range. The learning rate affects the training effectiveness of the model through gradient descent. The equation for gradient descent is as follows:

θ_{j} = θ_{j} - α \frac{Δ J (θ)}{Δ θ_{j}},

where

α

represents the learning rate, which determines the step size of the model parameters’ updates in the direction of the gradient. Choosing a small learning rate can result in slow convergence of the model and may lead to getting stuck in a local minimum. On the other hand, selecting a large learning rate can cause the model to oscillate near the minimum value and may even prevent convergence. Only when we have chosen a suitable learning rate can we find the optimal point within a reasonable number of iterations.

The learning rate is not fixed, but it typically starts with a relatively large value to explore the correct direction of gradient descent. As the model approaches the optimal point, the learning rate can be reduced to prevent oscillations or overshooting the optimum. When training the BERT model, the Adam optimizer is used to optimize the learning rate. The update equations for Adam are shown as follows. The first two lines involve calculating the moving average of the gradient value and its direction, while we only consider the history values. The third and fourth lines adjust the initial moving averages. The last line represents the parameter update equation, where the learning rate changes in each round of training, and the learning rates for each parameter are different. This dynamic update ensures the adaptability of the learning rate.

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}

v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}

{\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}}

{\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}}

θ_{t + 1} = θ_{t} - \frac{η}{\sqrt[]{{\hat{v}}_{t}} + ε} {\hat{m}}_{t}

Based on the aforementioned analysis, we evaluate the performance impact of learning rates on different datasets. We set initial learning rates to be 1 × 10⁻⁶, 5 ×

10^{- 6}

, 1 ×

10^{- 5}

, 2 ×

10^{- 5}

, 3 ×

10^{- 5}

, 4 ×

10^{- 5}

, 5 ×

10^{- 5}

, and 1 ×

10^{- 4}

. Here, we use the Adam optimizer to update the learning rate dynamically. We then evaluate the performance impact of the learning rate on model accuracy for downstream tasks. The results are shown in Figure 17. We see that, for most of the downstream tasks (i.e., CoLA, RTE, WNLI, and MRPC), setting the initial learning rate around 2 ×

10^{- 5}

leads to better prediction accuracy.

4.2.3. Training Epochs

Each training epoch refers to training the neural network on the entire dataset once. Training a neural network for only one epoch is often insufficient and may result in underfitting, leading to poor performance on the training set. As the number of epochs increases, the model’s training becomes more sufficient. Note that excessively large numbers of epochs can lead to overfitting, i.e., the model becomes too specialized to the training set and performs poorly on the test set.

Based on this analysis, we evaluate the performance impact of the number of epochs for different downstream tasks of the BERT model. We measure the performance of various epoch values (e.g., 2, 3, 4, 5, and 6) to assess their impact on the accuracy of the downstream tasks. The performance results are shown in Figure 18. We see that, for the vast majority of datasets, the prediction accuracy continuously improves with the increase in epoch. We also note that there are a few datasets (such as RTE) that achieve the best prediction accuracy when the epoch is 3. As the epoch continues to increase from 3, the accuracy of the training set continuously improves, but the prediction accuracy of the test set decreases, indicating a model overfitting. Therefore, using the right number of training epochs can help us obtain a better model.

4.2.4. BERT Optimizer

When training the BERT model on multi-core CPUs, we chose to use the Adam optimizer. Since the introduction of Adam, researchers have proposed various improved versions. These enhanced versions aim to address certain limitations of the original Adam algorithm, such as increasing stability, accelerating convergence, or improving the final model’s performance. We selected three variants of Adam for experimentation, including AdaMax [29], LazyAdam [30], and Nadam [31],to identify the most suitable optimizer for training the BERT model on multi-core CPUs. The experiments are run using the WNLI dataset, applying the default configuration parameters for each Adam variant, and the results are compared with Adam’s optimal accuracy. As shown in Figure 19, the models trained using AdaMax and LazyAdam optimizers can achieve better accuracy than Adam. Nadam looks to be unsuitable for BERT model training, yielding the worst accuracy.

4.3. Summary

In this section, we first provided a brief introduction to the five datasets (CoLA, SST-2, MRPC, RTE, and WNLI) used in our experiments. We then performed fine-tuning on these datasets by adjusting three parameters—batch size, learning rate, and training epochs—aiming to improve the model’s accuracy on these tasks. The performance results showed that increasing the batch size significantly reduced the training time without compromising the accuracy, and for smaller datasets, using a larger batch size may improve the model’s performance on downstream tasks. Regarding the learning rate, our results indicated that setting it around 2 ×

10^{- 5}

generally yielded the best performance for almost all datasets. As for the training epochs, choosing an appropriate epoch size led to better training results, but blindly increasing the number of epochs could cause overfitting and degrade the model’s performance on the test set.

5. Conclusions and Future Outlook

With the advancements in pre-trained models and the development of attention mechanisms in natural language processing, the BERT model, built on the Transformer encoder layer, has shown outstanding performance in various downstream tasks. It has become the mainstream approach to combine pre-training with downstream task fine-tuning. Thus, the focus of this work is on the popular BERT model.

In this work, we deployed the BERT model onto a 64-core processor and improved its prediction accuracy on typical NLP downstream tasks. Leveraging the abundant resources of the multi-core processor, we can parallelize the fine-tuning process to reduce the required training time. First, we have successfully deployed the BERT model onto an ARM-based multi-core processor by migrating the necessary environments and installing the required libraries. Second, to accelerate the fine-tuning training process, we have employed both external model and internal acceleration techniques to reduce training time. This includes setting the right number of CPUs and configuring the inter- and intra-thread counts to minimize training time. The experimental results have shown that setting the number of CPUs as equal to the number of inter and intra threads can achieve the shortest training time. Furthermore, increasing the number of CPUs can reduce the training time when sufficient resources are available. By analyzing the BERT model’s source code, we found that the main acceleration part of the model lies in the self-attention computation, which involves matrix multiplications. Through a multi-threading configuration, the matrix multiplication computations can be accelerated. Finally, we have fine-tuned the BERT model on several typical NLP downstream tasks by setting three hyperparameters: batch size, learning rate, and number of epochs. By optimizing these hyperparameters, we can improve the model’s prediction accuracy and reduce the training time. The experimental results have shown that increasing the batch size can significantly reduce the training time for downstream tasks, and for larger datasets, achieving the same training accuracy required increasing the number of epochs. Additionally, choosing the right learning rate is key. Our empirical results have shown that we can achieve the best performance when the initial learning rate is set around 1 ×

10^{- 5}

. Different downstream tasks require different optimal learning rates, however, and there is no one-size-fits-all learning rate setting that maximizes accuracy for all tasks. Regarding the number of epochs, increasing it within a certain range can result in better performance on downstream tasks, but exceeding a certain range can lead to overfitting and a decline in performance on the test set.

Our work lays the foundation for future research on deploying the BERT model on the ARM-based multi-core processor. The BERT model can be applied to various tasks in natural language processing, such as information extraction, intelligent question answering, document retrieval, and machine translation. In the field of software engineering, training a BERT model can be employed for tasks such as code clone detection, code generation, and code correction.

Author Contributions

Conceptualization, L.Z. and J.F.; methodology, L.Z., J.F. and W.G.; validation, W.G. and J.F.; writing—original draft preparation, L.Z. and J.F.; writing—review and editing, L.Z. and J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially funded by the Social Science Fund of Hunan Province, China (Grant No. 22YBA305) and the National Natural Science Foundation of China (Grant No. 61972408).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 2023, 55, 195. [Google Scholar] [CrossRef]
Le, Q.V.; Mikolov, T. Distributed Representations of Sentences and Documents. Proc. Mach. Learn. Res. 2014, 32, 1188–1196. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
Dai, A.M.; Le, Q.V. Semi-supervised Sequence Learning. In Proceedings of the 28th Conference and Workshop on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 3079–3087. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Melamud, O.; Goldberger, J.; Dagan, I. context2vec: Learning Generic Context Embedding with Bidirectional LSTM. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, 11–12 August 2016; pp. 51–61. [Google Scholar] [CrossRef]
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), New Orleans, LA, USA, 1–6 June 2018; Volume 1, pp. 2227–2237. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training 2018. Available online: https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf (accessed on 1 January 2024).
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.; Tian, X.; Zhu, D.; Tian, H.; Wu, H. ERNIE: Enhanced Representation through Knowledge Integration. arXiv 2023, arXiv:1904.09223. [Google Scholar]
Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2023, arXiv:1907.11692. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.G.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 5754–5764. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Clark, K.; Luong, M.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, S.K.S.; Lopes, R.G.; Ayan, B.K.; Salimans, T.; et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022 (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 18–24 June 2022; pp. 10674–10685. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2023, arXiv:1910.01108. [Google Scholar]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020; pp. 4163–4174. [Google Scholar] [CrossRef]
Fang, J.; Liao, X.; Huang, C.; Dong, D. Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+. J. Comput. Sci. Technol. 2021, 36, 33–43. [Google Scholar] [CrossRef]
Ben-Nun, T.; Hoefler, T. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Comput. Surv. 2019, 52, 65. [Google Scholar] [CrossRef]
Warstadt, A.; Singh, A.; Bowman, S.R. Neural Network Acceptability Judgments. Trans. Assoc. Comput. Linguist. 2019, 7, 625–641. [Google Scholar] [CrossRef]
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.Y.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
Dolan, W.B.; Brockett, C. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP@IJCNLP 2005), Jeju Island, Republic of Korea, 11–13 October 2005. [Google Scholar]
Dagan, I.; Roth, D.; Zanzotto, F.; Sammons, M. Recognizing Textual Entailment: Models and Applications. Comput. Linguist. 2015, 41, 157–159. [Google Scholar] [CrossRef]
He, P.; Liu, X.; Gao, J.; Chen, W. Deberta: Decoding-Enhanced Bert with Disentangled Attention. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Virtual Event, 3–7 May 2021. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
TensorFlow. LazyAdamOptimizer—TensorFlow 1.15. 2020. Available online: https://tensorflow.google.cn/versions/r1.15/api_docs/python/tf/contrib/opt/LazyAdamOptimizer (accessed on 18 June 2022).
Dozat, T. Incorporating nesterov momentum into adam. In Proceedings of the 4th International Conference on Learning Representations (ICLR 2016), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]

Figure 1. The architecture of the Transformer model [19].

Figure 2. The architecture of the BERT model [10].

Figure 3. The architecture of the Phytium 2000+ processor.

Figure 4. Dependent Python packages.

Figure 5. Wheel file formats.

Figure 6. Wheel file formats supported on Phytium 2000+.

Figure 7. Usage of tf.ConfigProto().

Figure 8. The process of calculating the attention matrix

A

.

Figure 8. The process of calculating the attention matrix

A

.

Figure 9. BERT model construction.

Figure 10. Incorporating the MirroredStrategy and modifying the estimator.

Figure 11. The original and the modified input_fn function.

Figure 12. The training time when varying

i n t e r_o p_p a r a l l e l i s m_t h r e a d s

or

i n t r a_o p_p a r a l l e l i s m_t h r e a d s

.

Figure 12. The training time when varying

i n t e r_o p_p a r a l l e l i s m_t h r e a d s

or

i n t r a_o p_p a r a l l e l i s m_t h r e a d s

.

Figure 13. The training time when varying the number of tasks and CPU cores used.

Figure 14. The training time when using more CPU cores.

Figure 15. The class of processing the RTE data.

Figure 16. The training time and accuracy of using different datasets and batch sizes.

Figure 17. Model prediction accuracy when using different datasets and learning rates.

Figure 18. Model prediction accuracy when using different datasets and training epochs.

Figure 19. Model prediction accuracy using optimizers.

Table 1. Examples from the CoLA dataset.

Source	Label	Sentence
gj04	1	The weights made the rope stretch over the pulley.
gj04	1	The mechanical doll wriggled itself loose.
cj99	1	If you had eaten more, you would want less.
cj99	0	The more you would want, the less you would eat.

Table 2. Examples from the SST-2 dataset.

Sentence	Label
it’s a charming and often affecting journey.	1
unflinchingly bleak and desperate	0
allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker.	1

Table 3. Examples from the MRPC dataset.

Quality	#1 ID	#2 ID	#1 String	#2 String
1	702876	702977	Amrozi accused his brother, whom he called “the witness”, of deliberately distorting his evidence.	Referring to him as only “the witness”, Amrozi accused his brother of deliberately distorting his evidence.

Table 4. The examples of the RTE dataset.

Index	Sentence1	Sentence2	Label
0	Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation.	Christopher Reeve had an accident.	not_entailment

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, L.; Gao, W.; Fang, J. Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model. Appl. Sci. 2024, 14, 2364. https://doi.org/10.3390/app14062364

AMA Style

Zhao L, Gao W, Fang J. Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model. Applied Sciences. 2024; 14(6):2364. https://doi.org/10.3390/app14062364

Chicago/Turabian Style

Zhao, Lanxin, Wanrong Gao, and Jianbin Fang. 2024. "Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model" Applied Sciences 14, no. 6: 2364. https://doi.org/10.3390/app14062364

APA Style

Zhao, L., Gao, W., & Fang, J. (2024). Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model. Applied Sciences, 14(6), 2364. https://doi.org/10.3390/app14062364

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model

Abstract

1. Introduction

2. Background and Related Work

2.1. Transformer Model

2.2. BERT Model

2.3. BERT Variants

3. Fine-Tuning BERT Parallelization

3.1. Porting BERT to a Multi-Core Processor

3.2. Parallelizing the BERT Model

3.2.1. Model Parallelism and Data Parallelism

3.2.2. Strategies for Accelerating BERT Fine-Tuning

3.2.3. BERT Parallelization Implementation

3.3. Performance Results

4. Fine-Tuning the BERT Model on Multi-Core CPUs

4.1. Parameter Tuning and Optimization of the BERT Model on Downstream Tasks

4.1.1. Introduction to the Test Sets

4.1.2. Fine-Tuning BERT on Downstream Tasks

4.2. Performance Results

4.2.1. Batch Size

4.2.2. Learning Rate

4.2.3. Training Epochs

4.2.4. BERT Optimizer

4.3. Summary

5. Conclusions and Future Outlook

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI