Accelerating and Compressing Transformer-Based PLMs for Enhanced Comprehension of Computer Terminology

Peng, Jian; Zhong, Kai

doi:10.3390/fi16110385

Open AccessArticle

Accelerating and Compressing Transformer-Based PLMs for Enhanced Comprehension of Computer Terminology

by

Jian Peng

¹ and

Kai Zhong

^2,*

¹

College of Foreign Languages, Changsha Normal University, Changsha 410000, China

²

College of Computer Science and Electronic Engineering, Hunan University, Changsha 410000, China

^*

Author to whom correspondence should be addressed.

Future Internet 2024, 16(11), 385; https://doi.org/10.3390/fi16110385

Submission received: 27 September 2024 / Revised: 15 October 2024 / Accepted: 21 October 2024 / Published: 22 October 2024

(This article belongs to the Special Issue AI Based Natural Language Processing: Emerging Approaches and Applications)

Download

Browse Figures

Versions Notes

Abstract

Pretrained language models (PLMs) have significantly advanced natural language processing (NLP), establishing the "pretraining + fine-tuning" paradigm as a cornerstone approach in the field. However, the vast size and computational demands of transformer-based PLMs present challenges, particularly regarding storage efficiency and processing speed. This paper addresses these limitations by proposing a novel lightweight PLM optimized for accurately understanding domain-specific computer terminology. Our method involves a pipeline parallelism algorithm designed to accelerate training. It is paired with an innovative mixed compression strategy that combines pruning and knowledge distillation to effectively reduce the model size while preserving its performance. The model is further fine-tuned using a dataset that mixes source and target languages to enhance its versatility. Comprehensive experimental evaluations demonstrate that the proposed approach successfully achieves a balance between model efficiency and performance, offering a scalable solution for NLP tasks involving specialized terminology.

Keywords:

pretraining models; parallelism; mixed compression method; computer-specialized terms

1. Introduction

Artificial intelligence (AI) stands as an emerging field of technological science, encompassing theories, methodologies, technologies, and application systems aimed at replicating and extending human intelligence. Its evolution marks a pivotal shift in modern society, where AI technologies are progressively transforming various aspects of our lives. Pretrained language models (PLMs), characterized by their vast parameter counts, represent intricate artificial neural network models. The emergence and adoption of such technology signify a significant milestone, ushering artificial intelligence research into the era of generalized artificial intelligence. In the rapid evolution of large models, the amalgamation of extensive datasets, substantial computing power, and sophisticated algorithms has substantially enhanced the pretraining, generation capabilities, and applicability across multimodal and multiscenario domains [1,2]. For instance, the remarkable success of ChatGPT owes much to the robust computational resources provided by platforms such as Microsoft Azure, coupled with extensive datasets such as Wikipedia. The strategy of fine-tuning GPT models and incorporating reinforcement learning with human feedback (RLHF) has further propelled its advancements, all built upon the robust foundation of the transformer architecture. Indeed, the transformer architecture serves as the backbone for many large models, capable of significant model variations through parameter adjustments. This sets the stage for comprehensive exploration and experimentation to further harness the potential of large models and their underlying architectures. Let us delve into a detailed examination of experimental procedures and findings.

PLMs are characterized by their scale, emergence, and generality. These models, such as ChatGPT, are significant in the quest for generalized AI due to their vast number of parameters and deep network structures. They possess the capability to learn and comprehend intricate features and patterns, thereby demonstrating remarkable abilities in natural language understanding, intent recognition, inference, context modeling, language generation, and various other natural language processing tasks. Additionally, they exhibit a general-purpose problem-solving ability, making them valuable assets in the pursuit of general artificial intelligence. The current development trajectory of large language models (LLMs) is illustrated in Figure 1.

The training paradigm of PLMs, characterized by “pretraining + fine-tuning”, has revolutionized natural language processing. Initially, a language model based on the transformer architecture undergoes pretraining on a large-scale corpus, followed by fine-tuning on a downstream task. This approach ensures exceptional model performance, even with limited training data for the downstream task. The success of PLMs in English has prompted their widespread adoption in other languages, leading to the development of language-specific versions such as ChineseBERT [3], CamemBERT [4], and RobBERT [5]. However, pretraining language-specific models typically demands abundant training data [6]. For low-resource or resource-poor languages, leveraging the cross-lingual capabilities of multilingual PLMs to transfer relevant knowledge from source languages for target language tasks has emerged as an active research direction. Nonetheless, similar to monolingual PLMs, the immense parameter count in multilingual PLMs poses challenges for deployment on resource-constrained devices. Thus, reducing storage and computational demands for reasoning in multilingual PLMs through model compression techniques has become a pressing issue in the industry. Moreover, as the performance of pretrained language models continues to advance, their parameter counts have surged exponentially. For instance, the GPT-3 model by OpenAI boasts a staggering 175 billion parameters [7]. This exponential growth in parameters brings forth computational, storage, and power requirements that pose significant challenges for real-world application scenarios. Addressing these challenges is crucial to harnessing the full potential of PLMs in practical settings.

Large models possess robust generalization capabilities yet often lack domain-specific expertise. Consider computer English as a case in point. In contrast to everyday English, computer English exhibits distinctive features, characterized by fragmented vocabulary specific to the computing domain and widespread usage of abbreviated proprietary terms [8,9]. Research indicates that translating computer English requires not only addressing vocabulary nuances but also employing specialized translation techniques to achieve desired outcomes. In today’s context, computers have permeated various facets of daily life and work, leading individuals to gradually encounter computer-related professional English. Examples include error feedback, prompts, and assistance in computer operations. Additionally, a significant portion of technical details in computer-related data is articulated in English. Consequently, difficulty in accurately interpreting such content may hinder internet access and impede work efficiency. To effectively handle domain-specific tasks such as computer English translation, PLMs must undergo training tailored to specific domains. Customizing models for various scenarios within the domain facilitates the development of comprehensive models tailored to specific fields. This approach ensures that PLMs possess the requisite expertise to address domain-specific challenges effectively.

Extensive research has been conducted on PLMs, exploring various algorithms such as quantization, pruning, distillation, dynamic networks, data parallelism, model parallelism, and fine-tuning [10,11,12,13,14]. However, two prominent challenges persist regarding acceleration, compression, and fine-tuning techniques for large language models:

Many compression algorithms necessitate fine-tuning or even retraining the model post-compression. Notably, the significant challenge associated with large models lies in the considerable cost incurred by model fine-tuning or training. Consequently, several algorithms for large models are delving into approaches that circumvent the need for tuning, such as quantization and pruning.
Large models prioritize generality and generalization capabilities over performance in singular tasks. However, targeting specific domains requires a concentrated effort to align with downstream tasks more effectively. Hence, there’s a growing emphasis on developing strategies that enhance alignment with specific domains while leveraging the inherent generality of large models.

To address these challenges, we propose LightChatGLM, based on the ChatGLM-6B architecture, specifically tailored for training PLMs optimized for computerized English tasks. ChatGLM-6B is a bidirectional autoregressive generative model built on the Transformer architecture. It integrates the advantages of self-attention mechanisms with feed-forward neural networks. Unlike traditional unidirectional generative models, ChatGLM-6B employs a bidirectional encoder-decoder framework, enabling it to comprehend language and manage longer contexts in generation tasks efficiently. The encoding process is handled by a multilayer Transformer module, which leverages a multihead attention mechanism to weight the input sequences dynamically. With 6 billion parameters, ChatGLM-6B offers fine-grained capabilities in both language comprehension and generation [15]. However, its large scale introduces significant challenges regarding computational resources and storage requirements. To address these concerns, we introduce a hybrid compression strategy in LightChatGLM aimed at optimizing model efficiency without compromising performance. LightChatGLM integrates structured pruning and knowledge distillation with advanced techniques, such as pipeline parallelism, to create a lightweight model. This approach optimizes efficiency while maintaining excellent cross-linguistic capabilities, making it particularly effective for tasks involving computerized English.

The main contributions of this paper are summarized as follows.

The introduction of a pipeline parallelism-based training method that optimally utilizes computational resources, thereby enhancing training efficiency for transformer-based basic models.
The proposal of a mixed compression method for pretrained language models that amalgamates multiple compression methodologies (each with distinct principles of operation) and capitalizes on their respective strengths. Initially, the teacher model is pruned structurally to obtain the student model. Subsequently, the student model undergoes knowledge distillation to recover lost information due to structured pruning, thereby refining its performance.
The development of a fine-tuning methodology utilizing labeled target language datasets. This fine-tuning method utilizes a small subset of labeled target language datasets, which are randomly mixed with source language labeled data. Subsequently, the model is fine-tuned on the mixed dataset, followed by continuous knowledge distillation on the downstream task. This process facilitates the transfer of cross-linguistic competence from the teacher model to the student model, enhancing the latter’s performance in the target language. Finally, experimental comparisons with state-of-the-art methods verified the validity of the proposed methods.

The rest of this paper is organized as follows. Section 2 reviews the related work. The process of LightChatGLM is presented in Section 3. Section 4 evaluates the performance of LightChatGLM. Section 5 concludes the paper.

2. Related Work

2.1. Training and Inference for PLMs

As transformer-based models continue to evolve, numerous variants have been developed to address various application needs, each introducing additional demands related to latency, throughput, and memory. These requirements pose challenges in deploying models efficiently, making the development of a PLMs inference acceleration framework crucial to address the efficiency needs across various scenarios.

To maximize training throughput, numerous strategies have been employed, including tensor parallelism, pipeline parallelism, expert parallelism, distributed heterogeneity, etc. [16,17,18,19]. Among various parallel strategies, intermediate result fusion is particularly effective in minimizing redundant intermediate outputs, lowering memory usage, and reducing unnecessary memory I/O and kernel startup overhead. This optimization improves the efficiency of computational resources such as GPUs, CPUs, and registers [20,21,22]. For instance, Microsoft’s DeepSpeed Inference offers an efficient integrated inference system, achieving notable reductions in latency and improvements in throughput [23]. Similarly, FlexGen et al. presented an offloading-based inference system [24]. Additionally, the open-source inference framework Power-Infer introduces an innovative GPU-CPU heterogeneous hybrid inference engine, leveraging the highly localized sparse activation property of PLMs to minimize memory requirements and data transfer overhead between CPU and GPU [25].

However, many existing acceleration strategies process only a small amount of data at a time, leading to suboptimal performance for small batches and low memory bandwidth utilization, resulting in significant overhead. In this paper, we propose leveraging a pipeline parallelism-based algorithm during the training phase of PLMs to enhance GPU space utilization and improve training efficiency by dividing blocks.

2.2. Compression Techniques

PLMs are known for their emphasis on generality, generalization capabilities, and even emergence across a wide array of tasks and unseen data. Consequently, compressed PLMs must undergo careful verification to ensure their retention of generality and generalization capabilities. In response to these challenges, various compression methods have been proposed specifically tailored for PLMs. Common compression techniques include quantization, pruning, knowledge distillation, and dynamic networks.

Quantization refers to the process of converting a model’s parameters and/or computations from high-precision representation to lower-precision formats. This technique is commonly used to reduce the computational resources required for running models and improve their efficiency, especially in deployment scenarios where hardware constraints are a concern [26,27,28,29,30,31]. These include methods that do not necessitate retraining, such as post-training quantization (PTQ), and methods that do require retraining, such as quantization-aware training (QAT). PTQ methods circumvent the need for an expensive retraining process, making them a more viable direction for most researchers.

Pruning is a technique used to reduce the size and complexity of a neural network by removing certain elements, such as weights, neurons, or entire layers. This helps in making the model more efficient without significantly sacrificing performance. However, its effectiveness can be undermined during the fine-tuning phase, which is often costly, particularly for models with a large number of parameters. Despite these challenges, pruning remains a critical technique for model compression, deserving further investigation to improve its application in PLMs. Notable unstructured pruning methods, such as SparseGPT [32] and Wanda [33], have set benchmarks for subsequent approaches. Wanda, for instance, introduces a unique pruning metric that accounts for both weight magnitude and activation values, while another method, relative importance and activation (RIA), addresses channel corruption by clipping entire rows and columns from the weight matrix. To further enhance pruning effectiveness in PLMs, various auxiliary techniques have been developed, including region-specific sparsity rates [34,35], post-pruning fine-tuning strategies [36,37,38], and hardware optimizations [39,40]. Despite advancements, significant challenges remain, particularly in integrating pruning with other techniques and addressing the high cost of fine-tuning.

Knowledge distillation (KD) is a technique in machine learning where a smaller, more efficient model (often referred to as the “student”) is trained to replicate the performance of a larger, more complex model (the “teacher”) [41]. The goal is to transfer the knowledge from the teacher model to the student model, enabling the student to achieve similar performance while being more computationally efficient and faster to deploy [42]. To improve the effectiveness of distillation, various enhancement strategies have been proposed, such as multitask learning and tailored dataset distillation. For instance, DISCO leveraged a pretrained language model (PLM) to generate counterfactual data, which were subsequently filtered using a large-scale teacher model for natural language inference tasks [43]. Similarly, PubMedBERT is tailored to handle the nuances and specialized terminology of biomedical text, which makes it particularly useful for tasks such as biomedical information extraction, literature mining, and question-answering within the medical domain [44]. PromptMix is a technique designed to enhance the performance of language models by combining multiple prompts in a structured way. It leverages the strengths of different prompts to provide more comprehensive and contextually relevant responses [45].

Combining knowledge distillation with other techniques improves performance, motivating researchers to explore the synthesis of multiple compression methods for real-world application scenarios [46,47,48]. In this paper, we propose a mixed compression method that combines pruning and knowledge distillation to achieve better performance in obtaining a student model.

Moreover, many compression algorithms often necessitate post-compression fine-tuning or even retraining to regain accuracy [49,50,51,52,53]. However, performing full parametric fine-tuning or training for medium or large models can be prohibitively expensive. Various parameter-efficient fine-tuning (PEFT) algorithms are effective methods to solve the above problems. These algorithms aim to fine-tune as few parameters or cycles as possible to reduce the cost of fine-tuning, such as LORA [54], QLoRA [55], adapter tuning [56], prefix tuning [57], and prompt tuning [58], among others.

In the context of training PLMs for computerized English, this paper introduces a hybrid data fine-tuning method. This method leverages annotated target language data along with source language data, randomly mixing them for fine-tuning purposes. Furthermore, continuous knowledge distillation is performed to transfer cross-linguistic competencies from the teacher model to the student model, thereby enhancing the student model’s performance in the target language.

3. Methodology

3.1. Pipeline Parallelism

When applied to the training of the LightChatGLM model, the DAPPLE algorithm optimizes the training efficiency of the Transformer architecture. The DAPPLE algorithm is a hybrid parallel training algorithm designed to address the challenges of distributed training for large-scale models on multi-GPU clusters. By combining the strategies of data parallelism, parameter parallelism, and pipeline parallelism, DAPPLE significantly reduces communication overhead and memory consumption during training, thereby accelerating the overall training process.

Data parallelism in the DAPPLE algorithm is implemented by dividing the dataset into smaller mini-batches, which are then distributed across multiple GPUs for parallel processing. Simultaneously, DAPPLE incorporates parameter parallelism, which allocates model parameters across different GPUs, allowing each GPU to store only a subset of the parameters, thus minimizing memory usage.

In addition, pipeline parallelism is utilized to further enhance computational efficiency. By assigning different layers of the Transformer model to separate GPUs, DAPPLE enables cross-layer parallel computation, reducing the idle time of computational resources. In contrast to traditional data and parameter parallelism methods, where inter-GPU communication can result in significant overhead, pipeline parallelism overlaps these communications with computations, thereby decreasing the latency of the training process. This advantage is particularly pronounced in the training of large-scale models [59].

By employing a hybrid parallel strategy, DAPPLE efficiently allocates GPU resources, significantly shortening training times. Furthermore, DAPPLE’s memory optimization capabilities enable LightChatGLM to achieve superior performance, even when operating with limited hardware resources.

Suppose there is a training dataset consisting of N samples, each denoted as

(X_{i}, Y_{i})

, where

X_{i}

is the input sequence and

Y_{i}

is the target sequence. The initial transformer-based model, denoted as

M_{i n i t}

, can be represented as follows:

\begin{matrix} M_{i n i t} = {θ_{i n i t}}, \end{matrix}

(1)

where

θ_{i n i t}

denotes the initial parameters of the model. In the pipeline construction phase, we construct a pipeline consisting of multiple parallel stages. Each stage executes sequentially in the pipeline and passes the data in a pipelined manner. Suppose there is a pipeline containing K parallel stages, denoted as

\begin{matrix} S = {S_{1}, S_{2}, \dots, S_{K}}, \end{matrix}

(2)

where S represents the set of all stages in the pipeline, with each

S_{i} (i = 1, 2, \dots, K)

corresponding to the i-th parallel computation stage. Each stage performs a specific computational task and passes the intermediate results to the subsequent stage, thereby forming a sequential, pipelined computational flow.

In the data transfer and update phase, we transfer the intermediate results computed in each phase to the next phase and leverage parallel computing resources to update the model parameters simultaneously in the parameter update phase. By denoting the communication time in the data transfer phase as

T_{c o m m}

, the intermediate results computed in each phase

S_{i}

will be transferred to the next phase

S_{i + 1}

.

T_{c o m m}

, which is represented by

\begin{matrix} T_{c o m m} = \sum_{i = 1}^{K - 1} t r a n s f e r (S_{i}, S_{i + 1}), \end{matrix}

(3)

where

t r a n s f e r (S_{i}, S_{i + 1})

indicates the data transfer time from stage

S_{i}

to stage

S_{i + 1}

.

T_{u p d a t e}

denotes the time of the parameter update phase, which uses parallel computing resources to update the parameters of the model simultaneously as follows:

\begin{matrix} T_{u p d a t e} = p a r a l l e l (S_{i}) . \end{matrix}

(4)

For each stage

S_{i}

, if we assume that there are

P_{m a x}

parallel processing units that can perform computational tasks simultaneously, then it can be expressed as

P_{m a x}

processing units working simultaneously, i.e.,

p a r a l l e l (S_{i}) = P_{m a x}

. This means that the tasks in stage

S_{i}

are assigned to

P_{m a x}

processing units to maximize parallelism and processing efficiency.

P_{o p t}

denotes the number of optimized parallel computing resources. By dynamically adjusting the degree of parallelism as follows, the algorithm can be made to achieve optimal performance in different hardware environments.

\begin{matrix} P_{o p t} = o p t i m i z e (P_{m a x}), \end{matrix}

(5)

where

o p t i m i z e (P_{m a x})

denotes a function that dynamically adjusts the degree of parallelism according to the hardware environment and task requirements. Based on the above description, a DAPPLE-based improved algorithm can be obtained for training the transformer-based model, as shown in Algorithm 1.

Algorithm 1: Pipeline parallel transformer-based model training algorithm

1:: Initialize base model M, training data $T_{d}$ , iterations $n u m_e p o c h s$ , sample batches $i t e r a t e_b a t c h e s$
2:: Initialize the number of pipeline stages $n u m_s t a g e s$
3:: Initialize the parallelism $p a r a l l e l i s m = d e t e r m i n e_p a r a l l e l i s m ()$
4:: Construct the pipeline $p i p e l i n e = c o n s t r u c t_p i p e l i n e (n u m_s t a g e s)$
5:: for $e p o c h$ in range( $n u m_e p o c h s$ ) do
6:: for $b a t c h$ in range( $i t e r a t e_b a t c h e s$ ) do
7:: for $s t a g e$ in range( $n u m_s t a g e s$ ) do
8:: forward_pass(pipeline[stage], batch)
9:: synchronize()
10:: end for
11:: for $s t a g e$ in reversed(range( $n u m_s t a g e s$ ) do
12:: backward_pass_and_update(pipeline[stage], batch)
13:: synchronize()
14:: end for
15:: end for
16:: Output algorithm performance evaluation results
17:: end for
18:: return Trained model

In Algorithm 1, the initialization phase includes setting up the model, the training dataset, the number of pipeline stages, and parallelism, as depicted in lines 1–3 of Algorithm 1. Then, during the model training phase, samples are initially divided based on sample blocks, followed by parameter calculation and updating via forward and backward propagation, as indicated in lines 4–14 of the algorithm. Eventually, the trained transformer model is obtained. The time complexity of the algorithm primarily revolves around three key factors: data transmission time

T_{c o m m}

, task training time

T_{c}

, and parameter updating time

T_{u p d a t e}

.

In Algorithm 1, the function

d e t e r m i n e_p a r a l l e l i s m ()

is responsible for deciding the parallelization strategy for the model. It determines how to partition the model and allocate tasks to devices based on factors such as the number of model layers, available hardware resources (e.g., number of GPUs and memory), network topology, and task requirements. The function

c o n s t r u c t_p i p e l i n e ()

constructs the entire pipeline, dividing the model into multiple stages and assigning each stage to a different device for parallel execution.

f o r w a r d_p a s s ()

carries out forward propagation, starting with input data and processing through each layer of the model until the final output is generated.

b a c k w a r d_p a s s_a n d_u p d a t e ()

handles the backward propagation and parameter updates. Backward propagation calculates error gradients and propagates them in reverse through the model. It then updates the parameters of each layer. In pipeline parallelism, backward propagation across different stages is performed in parallel, coordinating the exchange of gradients between devices. And the function

s y n c h r o n i z e ()

ensures synchronization between devices in pipeline parallelism, guaranteeing that each device is in a consistent state throughout the training process. Considering these factors, the overall time complexity of the algorithm can be expressed as

m a x (K \cdot T_{c o m m}, T_{c}, T_{u p d a t e})

.

3.2. Mixed Compression

The computations in the transformer structure are first reorganized by the attention head, and the proposed structured pruning method is given based on this; then, the knowledge distillation process for the PLM is introduced. A typical transformer layer comprises a multihead attention (MHA) sublayer and a feed-forward network (FFN) sublayer. By segregating the computation of MHA and FFN into independent parallel tasks, computational resources can be utilized more efficiently. This restructuring of computations within the transformer layer facilitates accelerated computation, thereby enhancing the efficiency of both model training and inference processes.

In the standard transformer layer, assuming that each transformer layer has

N_{a}

attention heads, the MHA obtains the parameters associated with the attention heads as query, Q, key, K, and value, V, matrices by transforming the input sequence X through a linear transformation as follows:

Q_{i} = X W_{i}^{Q}, K = X W_{i}^{K}, V = X W_{i}^{V},

(6)

where

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

are the weight matrices of the linear transformation. Then, the attention score is calculated, the Softmax function is applied to obtain the attention weights, and finally, the output for each position is obtained by weighted summation as follows:

\begin{matrix} A_{i} (Q, K, V) = S o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}), \end{matrix}

(7)

where

d_{k}

represents the dimension of the query or key. Then, the output of each attention head is concatenated and linearly transformed to obtain the final attention head output, i.e., the attention matrix A is weighted and summed against the value matrix V by

\begin{matrix} O u t p u t_{i} = A_{i} \times V_{i} . \end{matrix}

(8)

Finally, the output matrix,

O u t p u t

, of each attention header is spliced and linearly transformed to obtain the final attention header output by

\begin{matrix} O u t p u t = C o n c a t (O u p u t_{1}, O u p u t_{2}, . . ., O u p u t_{N_{a}}) W^{O}, \end{matrix}

(9)

where

W^{O}

is the weight matrix of the output linear transformation, and

C o n c a t

denotes the splicing operation. The concept of reorganization involves treating the computation of each attention head as an individual task to be executed in parallel. Consequently, the computation of each head in the multihead attention mechanism can be decomposed into independent computing units that operate without interference, thereby achieving parallel computation.

The FFN performs a linear transformation of the inputs at each position and then generates the final output by means of an activation function (e.g., ReLU) and another linear transformation. This process is applied independently to each position in the input sequence, meaning that each vector at each position undergoes the same FFN transformation. This independence allows the model to capture position-specific features, which are crucial for tasks such as language modeling and machine translation.

As depicted in Figure 2, by leveraging the aforementioned formulas, if we maintain the number of attention heads,

N_{a}

, as being unchanged while altering the dimension of the attention heads from

d_{a}

to

d_{a}^{'}

, the model dimension will consequently become

d^{'} = d_{a}^{'} \times N_{a}

. Similarly, if the dimension of the FFN intermediate layer,

d_{f f n}

, is adjusted to

d_{f f n}^{'} = \frac{d_{a}^{'}}{d_{a}} \times d_{f f n}

, LightChatGLM can preserve the original transformer structure parameter scale. The dashed lines in the diagram indicate the dimensions of the channels, attention heads, and feed-forward networks that are pruned in the model. As shown in Equations (6)–(9), the model determines the relative importance of each weight, neuron, convolutional kernel, and attention head during the training or validation process. Based on this importance evaluation, dimensions that contribute less to the overall performance are pruned. The dashed lines specifically denote the pruned dimensions within the attention heads and the feed-forward network sublayers, thereby simplifying the model’s structure. Additionally, a subset of the transformer layer is selected from the ChatGLM model to undergo structural pruning in the depth direction. By employing pruning in both the depth and dimension directions, LightChatGLM can obtain a student model with reduced layers and smaller dimensions, initialized by the teacher model parameters.

After deriving the student model through structured pruning, LightChatGLM proceeds with knowledge distillation to facilitate the migration of bilingual knowledge from the ChatGLM model to the student model. This step aims to help the student model recover the information lost due to structured pruning. Drawing from the knowledge distillation approach employed in the task-independent phase of the TinyBERT framework [60], LightChatGLM adopts a distinct layer mapping strategy. Referencing the existing literature reveals [61,62] that the last layer strategy proves more beneficial for the student model compared with the average strategy utilized in the original TinyBERT framework during the task-independent knowledge distillation phase of the pretraining process. Consequently, LightChatGLM opts to employ the last layer for distillation instead of distilling all hidden layer knowledge from the teacher model.

Assuming that the student model and the teacher model consist of

M_{s}

and

N_{t}

layers of the transformer, respectively, and the last hidden state of the teacher model is denoted as

H_{t}^{N_{t}}

, we utilize

H_{t}^{N_{t}}

as a soft label to guide the training of the student model. For each training sample, the input data are forwarded through the first

M_{s}

layers of the student model to obtain the last hidden state

H_{s}^{M_{s}}

of the student model. Then, the squared loss function is employed to measure the loss of the distillation process as follows:

\begin{matrix} L_{K D} = \frac{1}{2} \sum_{i, j} {(H_{t}^{N_{t}} (i, j) - H_{s}^{M_{s}} (i, j))}^{2}, \end{matrix}

(10)

where

H_{t}^{N_{t}} (i, j)

and

H_{s}^{M_{s}} (i, j)

denote the j-th feature of the i-th sample of the hidden state in the last layer of the teacher model and the student model. The parameters of the student model are optimized by minimizing the loss function

L_{K D}

so that the student model can closely approximate the output of the teacher model. By using the last layer of the knowledge distillation strategy, LightChatGLM successfully transfers bilingual knowledge from the teacher model to the student model, enabling the student model to recover the information lost due to structured pruning. This strategy is particularly effective in task-independent knowledge distillation scenarios during the pretraining phase, where the final layer strategy proves more beneficial to the student model compared with distilling all the hidden layers of the teacher model. The whole process is shown in Algorithm 2. The trained teacher model undergoes structural pruning in both the depth and gradient directions, as illustrated in lines 1–6 of Algorithm 2. Subsequently, knowledge distillation is carried out to further transfer bilingual knowledge from the teacher model to the student model, enabling the student model to recover information lost during structural pruning, as outlined in lines 7–14 of Algorithm 2. Ultimately, the compressed student model is obtained.

Algorithm 2: PLMs mixed compression algorithm

1:: Initialize teacher_model by Algorithm 1, source_dataset $D_{s}$ , target_dataset $D_{d}$ , $n u m_e p o c h s$
2:: for $e p o c h$ in range( $n u m_e p o c h s$ ) do
3:: Get the teacher_model
4:: Calculate MHA output by Equation (9)
5:: Calculate FFN
6:: Update teacher_model after pruning
7:: for $b a t c h$ in range( $D_{s}$ and $D_{d}$ ) do
8:: Generate soft_labels by teacher_model( $D_{s} (b a t c h)$ )
9:: Smooth logits
10:: Calculate loss and gradients by Equation (10)
11:: Generate hard_labels by teacher_model( $D_{d} (b a t c h)$ )
12:: Smooth logits
13:: Calculate loss and gradients by Equation (10)
14:: Update student_model
15:: end for
16:: end for
17:: return student_model

3.3. Hybrid Data Fine-Tuning

By leveraging structured pruning to initialize the student model and subsequently transfer cross-lingual knowledge from the teacher model through knowledge distillation, LightChatGLM achieves a streamlined model with cross-lingual capabilities. To enhance the student model’s understanding of computer English, LightChatGLM continues with cross-language knowledge distillation on downstream tasks to further refine the model’s performance in computer languages. Specifically, LightChatGLM aims to refine the teacher model through fine-tuning to enhance its cross-linguistic competence. Then, this competence is transferred to the student model through cross-language knowledge distillation on downstream tasks. This section introduces a hybrid data-based fine-tuning method wherein annotated data from both the source and target languages are randomly mixed for fine-tuning the teacher model. This approach enables the teacher model to autonomously learn the connections and similarities between languages during training, thereby improving its performance in the target language with the aid of rich source language knowledge.

We prepared the labeled datasets for the source and target languages as

D_{s}

and

D_{d}

, respectively. These datasets contain training samples for the target task along with their corresponding labels. We randomly mixed the labeled data from

D_{s}

and

D_{d}

to create a hybrid dataset. In the mixed-data fine-tuning process, a balancing parameter,

α

, is required to appropriately weigh the contributions of the source and target languages. The value of

α

can be determined based on the relative sizes of the source and target language datasets, as well as their significance to the target task. This relationship can be expressed by the following formula:

\begin{matrix} α = \frac{w_{s} \cdot n_{s}}{w_{s} \cdot n_{s} + w_{d} \cdot n_{d}}, \end{matrix}

(11)

where

w_{s}

and

w_{d}

represent the weighting coefficients for the source and target languages, which are usually based on task-relevant performance metrics, such as translation accuracy or cross-language comprehension. The terms

n_{s}

and

n_{d}

denote the number of samples in the source and target language datasets.

Furthermore, the teacher model is adapted to the characteristics of the target task, and its performance is improved by performing backpropagation and parameter updating on the hybrid data. The performance of the teacher model on the mixed data can be measured using the cross-entropy loss function. Specifically, for each sample (

D_{s} (i)

,

D_{d} (j)

), the final objective function can be expressed as

\begin{matrix} L_{h y b r i d} = - \frac{1}{N} \sum_{i = 1, j = 1}^{N} (α \cdot l o g (f_{t e a c h e r} (D_{s} (i))) + (1 - α) \cdot l o g (f_{t e a c h e r} (D_{d} (j)))), \end{matrix}

(12)

where N is the total number of samples in the hybrid dataset,

α

is a balancing parameter between the source and target language samples, and

f_{t e a c h e r}

represents the prediction function of the teacher model.

During the fine-tuning process with random mixing, LightChatGLM continues to perform knowledge distillation to transfer the cross-linguistic competence of the teacher model to the student model, thereby enhancing the performance of the student model on the target language. The loss function for knowledge distillation on the downstream task is then formulated as follows:

\begin{matrix} L_{d i s t i l l} = - \frac{1}{M} \sum_{j = 1}^{M} (λ \cdot l o g (f_{t e a c h e r} (D_{d} (j))) + (1 - λ) \cdot l o g (f_{s t u d e n t} (D_{d} (j)))), \end{matrix}

(13)

where M is the total number of samples in the target language dataset, and

f_{s t u d e n t}

represents the prediction function of the student model. The balance parameter

λ

is used to control the weighting between the teacher model predictions and the student model predictions, and typically, the value of

λ

can be adjusted between 0 and 1. Larger values of

λ

tend to rely more on the direct predictions of the student model for the target language, while smaller values of

λ

rely more on the guidance of the teacher model. The optimal

λ

value can be determined by using cross-validation or performance on the validation set.

The detailed process is shown in Algorithm 3. The first step involves randomly mixing the annotated data from both the source and target languages. Subsequently, the teacher model is fine-tuned based on this mixed dataset, allowing it to adapt to the target task through back-propagation and parameter updating on the mixed data, as depicted in lines 1–8 of Algorithm 3. Throughout the fine-tuning process using randomly mixed data, the algorithm continually performs knowledge distillation to transfer the cross-linguistic competencies of the teacher model to the student model, thereby enhancing the performance of the student model on the target language, as shown in lines 9–14 of Algorithm 3.

In line 13, the

b a c k w a r d_p r o p a g a t i o n ()

function is responsible for computing the gradient of the loss function with respect to the model parameters through backpropagation, enabling weight updates that progressively optimize the model and reduce prediction error. Before backpropagation, the model first performs forward propagation to calculate the output, followed by computing the loss value based on the difference between the predicted output and the actual labels. Since the LightChat model is fine-tuned using a mixed dataset, the loss function may involve multiple tasks. Therefore, the first step of the

b a c k w a r d_p r o p a g a t i o n ()

function is to calculate the loss function for each task, merging them with appropriate task-specific weights. Subsequently, backpropagation computes the gradients. Due to the mixed-data fine-tuning approach, the model processes loss functions for different language data, ensuring that gradients for each task are computed correctly in this step. Task weights may be used to balance the contributions of different tasks to the model’s parameter updates. After computing the gradients, the

b a c k w a r d_p r o p a g a t i o n ()

function invokes the optimizer to update the model’s weights based on the gradients, aiming to minimize the loss function and thereby enhance model performance. Ultimately, a streamlined student model with exceptional comprehension abilities in computer English can be achieved.

Algorithm 3: Hybrid data fine-tuning algorithm

1:: Initialize teacher_model and student_model by Algorithm 2, $m i x e d_d a t a s e t$ by $D_{s}$ and $D_{d}$ , $n u m_e p o c h s$
2:: for $e p o c h$ in range( $n u m_e p o c h s$ ) do
3:: for $b a t c h$ in range( $m i x e d_d a t a s e t$ ) do
4:: $D_{s} (i)$ , $D_{d} (j)$ ← $b a t c h$
5:: Calculate $L_{h y b r i d}$ by Equation (12)
6:: backward_propagation( $L_{h y b r i d}$ )
7:: Update teacher_model
8:: end for
9:: for $b a t c h$ in range( $m i x e d_d a t a s e t$ ) do
10:: $D_{d} (j)$ ← $b a t c h$
11:: Get output teacher_output ← teacher_model
12:: Calculate $L_{d i s t i l l}$ by Equation (13)
13:: backward_propagation( $L_{d i s t i l l}$ )
14:: Update student_model
15:: end for
16:: end for
17:: return student_model

4. Experimental Evaluation

4.1. Experiment Setup

Hardware environment: The server we used is equipped with an Intel(R) Xeon(R) Silver 4210R CPU at 2.40 GHz and NVIDIA Tesla T4 GPUs with 16 GB of RAM each, interconnected via PCIe-III. The server runs on a 64-bit Ubuntu 20.04 system with CUDA toolkit version 10.2 and PyTorch 1.10.2.

Hyperparameter setting: LightChatGLM utilized Wikipedia pages containing computer-specific English and Chinese nouns as the training corpus. These corpora were randomly mixed together. The hyperparameters were set as follows: a batch size of 256, a maximum sequence length of 128, a dropout rate of 0.1, a parameter decay of 0.05, and 400,000 steps for model parameter updates. The learning rate was initially set to 0.9 and decayed after the first 10% of the update steps. Based on these initial parameters, we performed a hyperparameter search to further optimize the model’s performance. Hyperparameter search is usually performed using grid search or random search, combined with cross-validation, to select the optimal combination of parameters to enhance model performance for a specific task [49,63,64].

For the downstream multilingual task, the datasets we used are shown in Table 1. The WMT 2020 Chinese–English comprises both Chinese and English segments from the Chinese–English translation tasks of the WMT competition, serving as a benchmark dataset for bilingual machine translation tasks. The UM-Corpus, developed by the University of Macau, is primarily used for high-quality research on Chinese–English parallel corpora, making it suitable for machine translation and semantic understanding tasks. The Ai Challenger encompasses Chinese–English parallel corpora for various tasks and is an essential resource for research in natural language processing and machine translation. IWSLT 17 is mainly used for multilingual spoken translation tasks, especially Chinese–English translation tasks. XGLUE is an evaluation benchmark for cross-language natural language understanding launched by Microsoft, which includes multiple tasks, including translation, question answering, and text classification.

During fine-tuning or the distillation of the student model for this task, the hyperparameters were adjusted accordingly. Specifically, the maximum sequence length was set to 128, the batch size to 32, and the balance factors

α

and

λ

were both set to 0.65. Additionally, the smoothed logit temperature value was set to 1.

Comparison models: we evaluate the performance of LightChatGLM by comparing it with the following typical schemes:

mBERT_drop [70]: mBERT is a multilingual BERT model pretrained for 104 languages, featuring the same structure as the original BERT model. Meanwhile, mBERT_drop represents a compression technique specifically designed for mBERT, involving the direct pruning of the top transformer layer of the mBERT model.
DistilmBERT [71]: The multilingual version of DistilBERT is a pretraining model that employs knowledge distillation techniques to decrease the size and enhance the speed of the BERT model. The concept behind its design is straightforward: construct a smaller model, referred to as DistilBERT, as the student model and utilize the original BERT model as the teacher model. The goal is for the student model to learn from the teacher model as much as possible, thereby retaining the reasoning capabilities of the teacher model to the fullest extent possible.
ChatGLM-6B [72,73]: ChatGLM-6B is an open-source, bilingual conversation language model built on the GLM architecture. By leveraging model quantization technology, it demands as little as 6 GB of video memory when operating at the INT4 quantization level.

4.2. Experiment Analysis

Figure 3 illustrates the variation in training time per epoch for the transformer base model across different sample block sizes and a comparison of various parallelization strategies. As the sample block size increases, the training speed also increases, with the optimal speed achieved when the sample block size is set to 256. Additionally, the use of different parallelization strategies can significantly reduce the training time per epoch. LightChatGLM leverages pipelined parallelism to facilitate efficient distributed training across multiple GPUs, resulting in a minimum training time of 180 s per epoch. However, the higher communication overhead associated with data and model parallelism slightly extends the training duration. It is also important to consider that memory usage escalates substantially as the sample block size increases, necessitating the careful design of the sample block size to achieve optimal training acceleration. This training speedup effect is observed in several training sets, including WMT 2020, UM-Corpus, and Ai Challenger. Experiments on these datasets show that the training speedup is, indeed, the result of the combined effect of pipeline parallelization and reasonable sample block division.

Figure 4 illustrates the validation loss of the mixed compression method compared with random pruning over an equal number of training rounds. We trained the base LightChatGLM model on the UM-Corpus dataset using the full dataset and parameter set to establish the initial validation loss. Subsequently, the model was subjected to mixed compression and random pruning to derive the corresponding student models, with the validation loss computed on the validation set. As shown in Figure 4, the mixed compression approach, which integrates structured pruning and knowledge distillation techniques, achieves higher compression efficiency while preserving model performance. The validation loss curves for the mixed compression method typically exhibit faster convergence and stabilize at lower loss values. In contrast, random pruning methods, which achieve compression by randomly removing certain weights from the model, may cause greater fluctuations in model performance. As a result, the validation loss curves for random pruning tend to converge more slowly and stabilize at relatively higher values. The experimental data demonstrate that the mixed compression method outperforms random pruning in both convergence speed and final validation loss, effectively maintaining model performance.

Table 2 offers a comprehensive comparison of the accuracy achieved in computerized English-Chinese translation without the application of target data for fine-tuning and knowledge distillation. A noteworthy observation from the table is that the absence of utilizing rich source-language annotated data for cross-language knowledge migration results in the diminished generalization ability of the compressed model. Among the various methods evaluated, the DistilmBERT approach yields the most favorable outcomes, with 46.3% in Chinese–English bilingual translation. However, it is important to note that despite this, the performance of the student model obtained through LightChatGLM after mixed compression demonstrates remarkable proximity to that of the teacher model within a 9.4% gap. LightChatGLM effectively facilitates the migration of bilingual knowledge from the teacher model to the student model, which proves to be particularly effective in task-independent knowledge distillation scenarios during the pretraining stage. This successful knowledge transfer highlights the efficacy of LightChatGLM in preserving the essential knowledge and capabilities of the teacher model while achieving significant compression, thereby demonstrating its potential for practical deployment in real-world applications.

Table 3 presents a comprehensive comparison of the accuracy achieved in computerized English to Chinese translations using a combination of fine-tuning and knowledge distillation, with both target annotated language data and source language data. Upon examining the table, it becomes evident that the accuracy of Chinese and English translations for all models significantly improves after employing hybrid data fine-tuning. This enhancement can be attributed to the utilization of labeled target data in conjunction with further knowledge distillation, which collectively contributes to the model’s enhanced performance in the domain of Chinese and English bilingual translation. Furthermore, it is noteworthy that LightChatGLM exhibits slightly lower performance, with 57.4% in English comprehension compared with the other models. This disparity arises from the relatively smaller English corpus available in the dataset utilized, as compared with the teacher model. However, by augmenting the training data with a more extensive Chinese corpus, LightChatGLM has the potential to achieve superior results with 33.3% in English-to-Chinese comprehension. This result underscores the importance of dataset composition and the need for mixed data to effectively train models for bilingual translation tasks. The dataset used in Table 2 and Table 3 is Ai Challenger.

According to Table 2 and Table 3, in the mixed-data scenario, the LightChatGLM model achieves its highest Chinese translation accuracy, suggesting that the mixed dataset may enhance the model’s generalization ability to some extent. The hybrid dataset incorporates characteristics of both the source and target languages, allowing the model to capture complex linguistic relationships, thereby improving translation quality. Specifically, the model benefits from mixed-data training by learning contextual and semantic relationships, particularly in computer-related terminology. However, in non-hybrid data scenarios, the model’s translation accuracy is significantly lower. This may be due to the disproportionate presence of single-language data, which leads the model to focus excessively on the source language, thus neglecting the target language during optimization.

According to Equation (12), where

L_{hybrid}

denotes the loss function across different languages or tasks and

α

represents the weighting coefficient, the model adapts better to various linguistic features when

α

is balanced. The strength of hybrid data lies in its ability to help the model learn from both the source and target languages by optimizing this loss function, thus explaining why LightChatGLM can more effectively capture the correlation between these languages in a mixed-data setting, ultimately leading to better translation performance. Conversely, in non-hybrid scenarios, the model may only optimize a single loss function, leading to insufficient learning. Furthermore, the model’s poor performance in single-language scenarios could be attributed to overfitting, where it learns the idiosyncrasies of a specific dataset but lacks the generalization capability to handle other contexts.

To quantify this, let translation accuracy be defined as

A c c (D, θ)

, where D represents the dataset, and

θ

denotes the model parameters. The accuracy rate

A c c (D_{mix}, θ)

for the mixed-data scenario can be expressed as

\begin{matrix} A c c (D_{mix}, θ) = \frac{1}{N_{mix}} \sum_{i = 1}^{N_{mix}} P (T_{i} | S_{i}, θ), \end{matrix}

(14)

where

T_{i}

is the target language sentence,

S_{i}

is the source language sentence,

N_{mix}

is the number of samples in the mixed dataset, and

P (T_{i} | S_{i}, θ)

represents the probability of a correct translation given the model parameters

θ

. Due to the diversity of the mixed data,

A c c (D_{mix}, θ)

tends to be higher.

In contrast, the translation accuracy

A c c (D_{single}, θ)

in single-language scenarios may exhibit greater variability:

\begin{matrix} A c c (D_{single}, θ) = \frac{1}{N_{single}} \sum_{i = 1}^{N_{single}} P (T_{i} | S_{i}, θ), \end{matrix}

(15)

where

N_{single}

represents the number of samples in a single-language dataset. In such cases, the model may struggle to capture the diversity present in a mixed dataset, leading to fluctuating performance.

In order to enhance the cross-linguistic capabilities of teacher models, LightChatGLM employs a hybrid data-based fine-tuning approach. This method involves randomly mixing annotated data from both the source and target languages, followed by fine-tuning the teacher model on the hybrid dataset while continuing knowledge distillation on downstream tasks. This process aims to yield student models with superior performance. Figure 5 illustrates the average results of Chinese–English bilingual migration experiments conducted using three different fine-tuning methods on WMT 2020, UM-Corpus, and Ai Challenger. Here, the English training set serves as the source-language-labeled corpus, and the Chinese validation set acts as the target-language-labeled corpus. From Figure 5, it is evident that the performance of fine-tuning solely using the source language yields the poorest results. This can be attributed to the absence of target annotation language, leading to the diminished generalization ability of the fine-tuned model. Conversely, the fine-tuning method incorporating the target annotation language and hybrid annotation language demonstrates improved performance results.

To evaluate the effectiveness of the hybrid data fine-tuning method in more complex tasks, we selected two high-resource datasets, WMT 2020 and UM-Corpus, and two low-resource datasets, IWSLT 17 and XGLUE 20. We applied three fine-tuning strategies: (1) fine-tuning using only high-resource language data (high fine-tuning), (2) fine-tuning using only low-resource language data (low fine-tuning), and (3) hybrid data fine-tuning, which incorporates data from both high- and low-resource languages.

In Figure 6, the experimental results demonstrate that the highest translation accuracy is achieved with the high fine-tuning strategy, as the abundant data from high-resource languages enables the model to capture more context and complex linguistic structures, thereby enhancing its translation performance. In contrast, on low-resource datasets such as IWSLT 17 and XGLUE 20, the low fine-tuning strategy resulted in significantly lower accuracy, with an average decrease of 3.9%. The hybrid fine-tuning approach, although showing performance improvement over low fine-tuning, did not match the effectiveness of the high fine-tuning. This suggests that while hybrid fine-tuning benefits from the richer linguistic features learned from high-resource languages, it is not yet fully optimized to transfer this knowledge effectively to low-resource language tasks. Nonetheless, the improvements observed indicate that hybrid fine-tuning aids in enhancing the model’s ability to understand complex language phenomena, such as domain-specific language comprehension in technical fields like computerized English.

In order to deeply analyze the performance of the LightChatGLM model at different compression levels, we designed a series of compression experiments to test core indicators such as model size, inference time, and translation accuracy, as shown in Table 4.

The experimental results demonstrate that as the compression level increases, the model’s resource consumption decreases significantly, but this is accompanied by a corresponding loss in performance. While mixed compression techniques effectively reduce resource usage and inference time, higher compression levels can lead to a marked degradation in model performance, particularly in tasks requiring complex language understanding. The performance trade-offs observed at different compression levels highlight the importance of selecting an appropriate compression strategy based on the specific application scenario. This ensures that model accuracy is preserved as much as possible while optimizing resource efficiency. Furthermore, these findings introduce a critical challenge for future research: how to maintain the model’s robust generalization capabilities for language features under extreme compression.

To further validate the effectiveness of the combination of techniques in LightChatGLM, we designed an ablation comparison experiment to systematically analyze the impact of three core techniques: parallel training, mixed compression, and hybrid data fine-tuning. Each technique was applied individually and in combination to assess their respective contributions. The core components of LightChatGLM were decomposed, based on which we structured the ablation experiments.

Baseline Model: The baseline model is ChatGLM-6B.
Parallel Training Algorithm: Parallel training was applied to assess its impact on training speed and model performance.
Mixed Compression: Only mixed compression techniques were used to evaluate their effect on model size, inference time, and performance.
Hybrid Data Fine-Tuning: The impact of hybrid data fine-tuning on translation accuracy was assessed.
Parallel Training + Mixed Compression: A combination of parallel training and mixed compression, excluding hybrid fine-tuning, was tested.
Parallel Training + Hybrid Fine-Tuning: Parallel training was combined with hybrid fine-tuning, without applying mixed compression.
Mixed Compression + Hybrid Fine-Tuning: Mixed compression and fine-tuning were applied together, without parallel training.
LightChatGLM: The complete model was tested with all three core combined techniques.

Table 5 shows the results of the above ablation experiments.

The parallel training algorithm significantly reduced the training time from 10 h to 6 h, without compromising the translation accuracy compared with the Baseline model. This demonstrates its effectiveness in accelerating training while maintaining model performance. The mixed compression technique provided notable benefits by reducing the model size by 10% and decreasing inference time by approximately 43%. However, the application of mixed compression alone resulted in a slight degradation of model performance, indicating that while compression is effective in optimizing resource efficiency, it may adversely impact the model’s accuracy. Hybrid data fine-tuning, on the other hand, improved the model’s translation accuracy from 45.7% to 46.3%, highlighting its ability to enhance the model’s understanding of complex linguistic structures, particularly in tasks related to computerized English comprehension. When combining parallel training, mixed compression, and hybrid data fine-tuning, LightChatGLM demonstrated superior performance across various metrics. This integrated approach yielded the best overall outcomes, particularly in balancing translation accuracy, reducing model size, accelerating inference time, and shortening training duration, thereby offering significant practical advantages.

5. Conclusions

The powerful cross-linguistic capabilities of multilingual pretrained models enable them to handle tasks in resource-poor languages by transferring relevant knowledge from resource-rich languages. However, they also face challenges, such as large model sizes and slow inference speeds, which hinder their deployment in real-world applications. To address these challenges and reduce training and storage overheads, this paper proposes LightChatGLM, an effective method based on pipeline parallelism and mixed compression for training models suitable for computer English translation applications. Firstly, LightChatGLM utilizes a pipeline parallelism-based approach to enhance the training of transformer-based LM base models. This method optimizes the training process by leveraging pipeline parallelism, thereby improving efficiency. Secondly, a mixed compression method is proposed for pretrained language models, which combines pruning and knowledge distillation techniques. This approach involves structurally pruning the teacher model and prompting the student model to recover lost information through knowledge distillation. By doing so, the model’s cross-linguistic capabilities are further refined, ensuring better performance. Finally, to obtain a specialized model tailored to computerized English comprehension, the annotated target language dataset is used for fine-tuning. Furthermore, knowledge distillation on downstream tasks is continued to enhance the model’s performance. Numerous experiments have been conducted to validate the effectiveness of LightChatGLM. Moving forward, we aim to address the time overhead associated with repeatedly running large models during compression. Additionally, we plan to explore collaborative hardware and software approaches to train pretrained mini-models with superior performance.

Author Contributions

Conceptualization, J.P. and K.Z.; methodology, J.P.; software, J.P.; formal analysis, J.P.; writing—original draft preparation, J.P.; writing—review and editing, K.Z.; funding acquisition, J.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Foreign Language Research Joint Project of the Social Science Foundation of Hunan Province (Grant No. 19WLH34).

Data Availability Statement

The data presented in this study are openly available in WMT 2020 which has been opened for access on 19 November 2020 https://www.statmt.org/wmt20, Um Corpus on 1 January 2014 http://nlp2ct.cis.umac.mo/um-corpus/, and Ai challenger on 11 November 2018 https://github.com/AIChallenger/AI_Challenger_2018?tab=readme-ov-file.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2023, 15, 1–45. [Google Scholar] [CrossRef]
Hu, L.; Liu, Z.; Zhao, Z.; Hou, L.; Nie, L.; Li, J. A survey of knowledge enhanced pre-trained language models. IEEE Trans. Knowl. Data Eng. 2023, 36, 1413–1430. [Google Scholar] [CrossRef]
Sun, Z.; Li, X.; Sun, X.; Meng, Y.; Ao, X.; He, Q.; Wu, F.; Li, J. Chinesebert: Chinese pretraining enhanced by glyph and pinyin information. arXiv 2021, arXiv:2106.16038. [Google Scholar]
Martin, L.; Muller, B.; Suárez, P.J.O.; Dupont, Y.; Romary, L.; de La Clergerie, É.V.; Seddah, D.; Sagot, B. CamemBERT: A tasty French language model. arXiv 2019, arXiv:1911.03894. [Google Scholar]
Delobelle, P.; Winters, T.; Berendt, B. Robbert: A dutch roberta-based language model. arXiv 2020, arXiv:2001.06286. [Google Scholar]
Wu, S.; Dredze, M. Are all languages created equal in multilingual BERT? arXiv 2020, arXiv:2005.09093. [Google Scholar]
Floridi, L.; Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar] [CrossRef]
Ping, Z.; Chunrong, W. Research on the Characteristics of English Text Based on Computer and Its Translation. J. Phys. Conf. Ser. 2021, 1992, 032006. [Google Scholar]
Regina, D.; V, A.D. Computer-Based Vocabulary Learning in the English Language: A Systematic Review. Theory Pract. Lang. Stud. 2022, 12, 2365–2373. [Google Scholar] [CrossRef]
Xu, C.; McAuley, J. A survey on model compression and acceleration for pretrained language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 10566–10575. [Google Scholar]
Wang, W.; Chen, W.; Luo, Y.; Long, Y.; Lin, Z.; Zhang, L.; Lin, B.; Cai, D.; He, X. Model Compression and Efficient Inference for Large Language Models: A Survey. arXiv 2024, arXiv:2402.09748. [Google Scholar]
Choudhary, T.; Mishra, V.; Goswami, A.; Sarangapani, J. A comprehensive survey on model compression and acceleration. Artif. Intell. Rev. 2020, 53, 5113–5155. [Google Scholar]
Zhu, X.; Li, J.; Liu, Y.; Ma, C.; Wang, W. A survey on model compression for large language models. arXiv 2023, arXiv:2308.07633. [Google Scholar]
Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–38. [Google Scholar]
GLM, T.; Zeng, A.; Xu, B.; Wang, B.; Zhang, C.; Yin, D.; Rojas, D.; Feng, G.; Zhao, H.; Lai, H.; et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv 2024, arXiv:2406.12793. [Google Scholar]
Zheng, L.; Li, Z.; Zhang, H.; Zhuang, Y.; Chen, Z.; Huang, Y.; Wang, Y.; Xu, Y.; Zhuo, D.; Xing, E.P.; et al. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), Carlsbad, CA, USA, 11–13 July 2022; pp. 559–578. [Google Scholar]
Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv 2019, arXiv:1909.08053. [Google Scholar]
Li, S.; Liu, H.; Bian, Z.; Fang, J.; Huang, H.; Liu, Y.; Wang, B.; You, Y. Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing, Salt Lake City, UT, USA, 7–10 August 2023; pp. 766–775. [Google Scholar]
Rasley, J.; Rajbhandari, S.; Ruwase, O.; He, Y. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 3505–3506. [Google Scholar]
Chen, T.; Moreau, T.; Jiang, Z.; Zheng, L.; Yan, E.; Shen, H.; Cowan, M.; Wang, L.; Hu, Y.; Ceze, L.; et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, USA, 8–10 October 2018; pp. 578–594. [Google Scholar]
Jiang, X.; Wang, H.; Chen, Y.; Wu, Z.; Wang, L.; Zou, B.; Yang, Y.; Cui, Z.; Cai, Y.; Yu, T.; et al. MNN: A universal and efficient inference engine. Proc. Mach. Learn. Syst. 2020, 2, 1–13. [Google Scholar]
Lopes, N.P. Torchy: A tracing jit compiler for pytorch. In Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction, Montréal, QC, Canada, 25–26 February 2023; pp. 98–109. [Google Scholar]
Aminabadi, R.Y.; Rajbhandari, S.; Awan, A.A.; Li, C.; Li, D.; Zheng, E.; Ruwase, O.; Smith, S.; Zhang, M.; Rasley, J.; et al. Deepspeed-inference: Enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, Dallas, TX, USA, 13–18 November 2022; pp. 1–15. [Google Scholar]
Sheng, Y.; Zheng, L.; Yuan, B.; Li, Z.; Ryabinin, M.; Chen, B.; Liang, P.; Ré, C.; Stoica, I.; Zhang, C. Flexgen: High-throughput generative inference of large language models with a single gpu. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 31094–31116. [Google Scholar]
Song, Y.; Mi, Z.; Xie, H.; Chen, H. Powerinfer: Fast large language model serving with a consumer-grade gpu. arXiv 2023, arXiv:2312.12456. [Google Scholar]
Piao, T.; Cho, I.; Kang, U. SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression. PLoS ONE 2022, 17, e0265621. [Google Scholar]
Zhang, W.; Hou, L.; Yin, Y.; Shang, L.; Chen, X.; Jiang, X.; Liu, Q. Ternarybert: Distillation-aware ultra-low bit bert. arXiv 2020, arXiv:2009.12812. [Google Scholar]
Qin, H.; Ding, Y.; Zhang, M.; Yan, Q.; Liu, A.; Dang, Q.; Liu, Z.; Liu, X. Bibert: Accurate fully binarized bert. arXiv 2022, arXiv:2203.06390. [Google Scholar]
Kim, Y.J.; Henry, R.; Fahim, R.; Awadalla, H.H. Finequant: Unlocking efficiency with fine-grained weight-only quantization for llms. arXiv 2023, arXiv:2308.09723. [Google Scholar]
Xiao, G.; Lin, J.; Seznec, M.; Wu, H.; Demouth, J.; Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 38087–38099. [Google Scholar]
Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 525–542. [Google Scholar]
Frantar, E.; Alistarh, D. Sparsegpt: Massive language models can be accurately pruned in one-shot. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 10323–10337. [Google Scholar]
Sun, M.; Liu, Z.; Bair, A.; Kolter, J.Z. A simple and effective pruning approach for large language models. arXiv 2023, arXiv:2306.11695. [Google Scholar]
Yin, L.; Wu, Y.; Zhang, Z.; Hsieh, C.Y.; Wang, Y.; Jia, Y.; Pechenizkiy, M.; Liang, Y.; Wang, Z.; Liu, S. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. arXiv 2023, arXiv:2310.05175. [Google Scholar]
Xu, P.; Shao, W.; Chen, M.; Tang, S.; Zhang, K.; Gao, P.; An, F.; Qiao, Y.; Luo, P. BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation. arXiv 2024, arXiv:2402.16880. [Google Scholar]
Syed, A.; Guo, P.H.; Sundarapandiyan, V. Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models. 2023. Available online: https://openreview.net/forum?id=cKlgcx7nSZ (accessed on 26 September 2024).
Zhang, Y.; Zhao, L.; Lin, M.; Sun, Y.; Yao, Y.; Han, X.; Tanner, J.; Liu, S.; Ji, R. Dynamic sparse no training: Training-free fine-tuning for sparse llms. arXiv 2023, arXiv:2310.08915. [Google Scholar]
Boža, V. Fast and optimal weight update for pruned large language models. arXiv 2024, arXiv:2401.02938. [Google Scholar]
Xia, H.; Zheng, Z.; Li, Y.; Zhuang, D.; Zhou, Z.; Qiu, X.; Li, Y.; Lin, W.; Song, S.L. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv 2023, arXiv:2309.10285. [Google Scholar] [CrossRef]
Srinivasan, V.; Gandhi, D.; Thakker, U.; Prabhakar, R. Training large language models efficiently with sparsity and dataflow. arXiv 2023, arXiv:2304.05511. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Chen, Z.; Gao, Q.; Bosselut, A.; Sabharwal, A.; Richardson, K. Disco: Distilling counterfactuals with large language models. arXiv 2022, arXiv:2212.10534. [Google Scholar]
Gu, Y.; Zhang, S.; Usuyama, N.; Woldesenbet, Y.; Wong, C.; Sanapathi, P.; Wei, M.; Valluri, N.; Strandberg, E.; Naumann, T.; et al. Distilling large language models for biomedical knowledge extraction: A case study on adverse drug events. arXiv 2023, arXiv:2307.06439. [Google Scholar]
Sahu, G.; Vechtomova, O.; Bahdanau, D.; Laradji, I.H. Promptmix: A class boundary augmentation method for large language model distillation. arXiv 2023, arXiv:2310.14192. [Google Scholar]
Kurtic, E.; Kuznedelev, D.; Frantar, E.; Goin, M.; Alistarh, D. Sparse finetuning for inference acceleration of large language models. arXiv 2023, arXiv:2310.06927. [Google Scholar]
Ahmad, Z.; Illanko, K.; Khan, N.; Androutsos, D. Human action recognition using convolutional neural network and depth sensor data. In Proceedings of the 2019 International Conference on Information Technology and Computer Communications, Guangzhou, China, 20–22 December 2019; pp. 1–5. [Google Scholar]
Mishra, A.; Latorre, J.A.; Pool, J.; Stosic, D.; Stosic, D.; Venkatesh, G.; Yu, C.; Micikevicius, P. Accelerating sparse deep neural networks. arXiv 2021, arXiv:2104.08378. [Google Scholar]
Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.M.; Chen, W.; et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mach. Intell. 2023, 5, 220–235. [Google Scholar]
Sun, K.; Luo, X.; Luo, M.Y. A survey of pretrained language models. In Proceedings of the International Conference on Knowledge Science, Engineering and Management, Singapore, 6–8 August 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 442–456. [Google Scholar]
Min, B.; Ross, H.; Sulem, E.; Veyseh, A.P.B.; Nguyen, T.H.; Sainz, O.; Agirre, E.; Heintz, I.; Roth, D. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput. Surv. 2023, 56, 1–40. [Google Scholar] [CrossRef]
Xu, L.; Xie, H.; Qin, S.Z.J.; Tao, X.; Wang, F.L. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv 2023, arXiv:2312.12148. [Google Scholar]
Fu, Z.; Yang, H.; So, A.M.C.; Lam, W.; Bing, L.; Collier, N. On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 12799–12807. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Xu, Y.; Xie, L.; Gu, X.; Chen, X.; Chang, H.; Zhang, H.; Chen, Z.; Zhang, X.; Tian, Q. Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv 2023, arXiv:2309.14717. [Google Scholar]
He, R.; Liu, L.; Ye, H.; Tan, Q.; Ding, B.; Cheng, L.; Low, J.W.; Bing, L.; Si, L. On the effectiveness of adapter-based tuning for pretrained language model adaptation. arXiv 2021, arXiv:2106.03164. [Google Scholar]
Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar]
Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar]
Fan, S.; Rong, Y.; Meng, C.; Cao, Z.; Wang, S.; Zheng, Z.; Wu, C.; Long, G.; Yang, J.; Xia, L.; et al. DAPPLE: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, 27 February–3 March 2021; pp. 431–445. [Google Scholar]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar]
Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Adv. Neural Inf. Process. Syst. 2020, 33, 5776–5788. [Google Scholar]
Jiao, X.; Chang, H.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Improving task-agnostic BERT distillation with layer mapping search. Neurocomputing 2021, 461, 194–203. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Yin, Y.; Chen, C.; Shang, L.; Jiang, X.; Chen, X.; Liu, Q. Autotinybert: Automatic hyper-parameter optimization for efficient pre-trained language models. arXiv 2021, arXiv:2107.13686. [Google Scholar]
Koehn, P.; Chaudhary, V.; El-Kishky, A.; Goyal, N.; Chen, P.J.; Guzmán, F. Findings of the WMT 2020 shared task on parallel corpus filtering and alignment. In Proceedings of the Fifth Conference on Machine Translation, Online, 19–20 November 2020; pp. 726–742. [Google Scholar]
Tian, L.; Wong, D.F.; Chao, L.S.; Quaresma, P.; Oliveira, F.; Yi, L. UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation. In Proceedings of the LREC, Reykjavik, Iceland, 26–31 May 2014; pp. 1837–1842. [Google Scholar]
Wu, J.; Zheng, H.; Zhao, B.; Li, Y.; Yan, B.; Liang, R.; Wang, W.; Zhou, S.; Lin, G.; Fu, Y.; et al. Ai challenger: A large-scale dataset for going deeper in image understanding. arXiv 2017, arXiv:1711.06475. [Google Scholar]
Cettolo, M.; Federico, M.; Bentivogli, L.; Niehues, J.; Stüker, S.; Sudoh, K.; Yoshino, K.; Federmann, C. Overview of the iwslt 2017 evaluation campaign. In Proceedings of the 14th International Workshop on Spoken Language Translation, Tokyo, Japan, 14–15 December 2017; pp. 2–14. [Google Scholar]
Liang, Y.; Duan, N.; Gong, Y.; Wu, N.; Guo, F.; Qi, W.; Gong, M.; Shou, L.; Jiang, D.; Cao, G.; et al. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. arXiv 2020, arXiv:2004.01401. [Google Scholar]
Sajjad, H.; Dalvi, F.; Durrani, N.; Nakov, P. On the effect of dropping layers of pre-trained transformer models. Comput. Speech Lang. 2023, 77, 101429. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X.; et al. Glm-130b: An open bilingual pre-trained model. arXiv 2022, arXiv:2210.02414. [Google Scholar]
Du, Z.; Qian, Y.; Liu, X.; Ding, M.; Qiu, J.; Yang, Z.; Tang, J. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 320–335. [Google Scholar]

Figure 1. LLMs development process.

Figure 2. Structured pruning strategy.

Figure 3. Effect of different parallel strategies on training time with different batch sizes.

Figure 4. Validation loss over training steps for mixed compression and random pruning.

Figure 5. Experimental average results of cross-language migration with different fine-tuning methods on three different datasets.

Figure 6. Effectiveness of hybrid data fine-tuning method on different resource language datasets.

Table 1. Datasets.

Name	Chinese Words/Million	English Words/Million	Training Set Size	Development Set Size	Test Set Size
WMT 2020 Chinese–English [65]	130	110	90%	5%	5%
UM-Corpus [66]	50	48	80%	10%	10%
Ai Challenger [67]	250	200	80%	10%	10%
IWSLT 17 [68]	0.8	1	85%	5%	10%
XGLUE 20 [69]	0.9	1.1	80%	10%	10%

Table 2. Comparison of the experimental results of different compression methods under fine-tuning without target data (translation accuracy).

Student/Teacher Model	English	Chinese	AVG
mBERT_drop/mBERT	58.4/60.4	30.4/37.6	44.4/49.0
DistilmBERT/mBERT	59.2/60.4	33.3/37.6	46.3/49.0
ChatGLM-6B/ChatGLM	57.6/60.2	32.8/39.5	45.2/49.9
LightChatGLM(Ours)/ChatGLM-6B	57.2/57.6	30.1/32.8	43.7/45.2

Table 3. Comparison of the experimental results of different compression methods under fine-tuning with mixed data (translation accuracy).

Student/Teacher Model	English	Chinese	AVG
mBERT_drop/mBERT	58.7/61.2	32.4/41.1	45.6/51.2
DistilmBERT/mBERT	59.8/61.2	32.8/41.1	46.3/51.2
ChatGLM-6B/ChatGLM	58.2/60.5	33.1/40.5	45.7/50.5
LightChatGLM(Ours)/ChatGLM-6B	57.4/58.2	33.3/33.1	45.4/45.7

Table 4. The performance of the LightChatGLM model at different compression levels.

Experimental Plan	Translation Accuracy/%	Model Size	Inference Time/ms	Performance Degradation Rate
No Compression	45.7	100%	150	0%
Structured Pruning	45.1	62.4%	120	−1.3%
knowledge distillation	44.3	54.7%	110	−3.1%
Mixed Compression	43.2	49.7%	85	−5.7%

Table 5. Comparative results of LightChatGLM ablation experiments.

Experimental Plan	Translation Accuracy/%	Model Size	Inference Time/ms	Training Time/h
Baseline Model	45.7	100%	150	10
Parallel Training Algorithm	45.7	100%	150	6
Mixed Compression	43.2	49.7%	85	10
Hybrid Data Fine-Tuning	46.3	100%	150	10
Parallel Training + Mixed Compression	43.4	49.7%	85	6
Parallel Training + Hybrid Fine-Tuning	46.7	100%	150	6
Mixed Compression + Hybrid Fine-Tuning	44.8	49.7%	85	10
LightChatGLM	45.4	49.7%	85	6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, J.; Zhong, K. Accelerating and Compressing Transformer-Based PLMs for Enhanced Comprehension of Computer Terminology. Future Internet 2024, 16, 385. https://doi.org/10.3390/fi16110385

AMA Style

Peng J, Zhong K. Accelerating and Compressing Transformer-Based PLMs for Enhanced Comprehension of Computer Terminology. Future Internet. 2024; 16(11):385. https://doi.org/10.3390/fi16110385

Chicago/Turabian Style

Peng, Jian, and Kai Zhong. 2024. "Accelerating and Compressing Transformer-Based PLMs for Enhanced Comprehension of Computer Terminology" Future Internet 16, no. 11: 385. https://doi.org/10.3390/fi16110385

APA Style

Peng, J., & Zhong, K. (2024). Accelerating and Compressing Transformer-Based PLMs for Enhanced Comprehension of Computer Terminology. Future Internet, 16(11), 385. https://doi.org/10.3390/fi16110385

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Accelerating and Compressing Transformer-Based PLMs for Enhanced Comprehension of Computer Terminology

Abstract

1. Introduction

2. Related Work

2.1. Training and Inference for PLMs

2.2. Compression Techniques

3. Methodology

3.1. Pipeline Parallelism

3.2. Mixed Compression

3.3. Hybrid Data Fine-Tuning

4. Experimental Evaluation

4.1. Experiment Setup

4.2. Experiment Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI