4.1. Experiment Setup
Hardware environment: The server we used is equipped with an Intel(R) Xeon(R) Silver 4210R CPU at 2.40 GHz and NVIDIA Tesla T4 GPUs with 16 GB of RAM each, interconnected via PCIe-III. The server runs on a 64-bit Ubuntu 20.04 system with CUDA toolkit version 10.2 and PyTorch 1.10.2.
Hyperparameter setting: LightChatGLM utilized Wikipedia pages containing computer-specific English and Chinese nouns as the training corpus. These corpora were randomly mixed together. The hyperparameters were set as follows: a batch size of 256, a maximum sequence length of 128, a dropout rate of 0.1, a parameter decay of 0.05, and 400,000 steps for model parameter updates. The learning rate was initially set to 0.9 and decayed after the first 10% of the update steps. Based on these initial parameters, we performed a hyperparameter search to further optimize the model’s performance. Hyperparameter search is usually performed using grid search or random search, combined with cross-validation, to select the optimal combination of parameters to enhance model performance for a specific task [
49,
63,
64].
For the downstream multilingual task, the datasets we used are shown in
Table 1. The WMT 2020 Chinese–English comprises both Chinese and English segments from the Chinese–English translation tasks of the WMT competition, serving as a benchmark dataset for bilingual machine translation tasks. The UM-Corpus, developed by the University of Macau, is primarily used for high-quality research on Chinese–English parallel corpora, making it suitable for machine translation and semantic understanding tasks. The Ai Challenger encompasses Chinese–English parallel corpora for various tasks and is an essential resource for research in natural language processing and machine translation. IWSLT 17 is mainly used for multilingual spoken translation tasks, especially Chinese–English translation tasks. XGLUE is an evaluation benchmark for cross-language natural language understanding launched by Microsoft, which includes multiple tasks, including translation, question answering, and text classification.
During fine-tuning or the distillation of the student model for this task, the hyperparameters were adjusted accordingly. Specifically, the maximum sequence length was set to 128, the batch size to 32, and the balance factors and were both set to 0.65. Additionally, the smoothed logit temperature value was set to 1.
Comparison models: we evaluate the performance of LightChatGLM by comparing it with the following typical schemes:
mBERT_drop [
70]: mBERT is a multilingual BERT model pretrained for 104 languages, featuring the same structure as the original BERT model. Meanwhile, mBERT_drop represents a compression technique specifically designed for mBERT, involving the direct pruning of the top transformer layer of the mBERT model.
DistilmBERT [
71]: The multilingual version of DistilBERT is a pretraining model that employs knowledge distillation techniques to decrease the size and enhance the speed of the BERT model. The concept behind its design is straightforward: construct a smaller model, referred to as DistilBERT, as the student model and utilize the original BERT model as the teacher model. The goal is for the student model to learn from the teacher model as much as possible, thereby retaining the reasoning capabilities of the teacher model to the fullest extent possible.
ChatGLM-6B [
72,
73]: ChatGLM-6B is an open-source, bilingual conversation language model built on the GLM architecture. By leveraging model quantization technology, it demands as little as 6 GB of video memory when operating at the INT4 quantization level.
4.2. Experiment Analysis
Figure 3 illustrates the variation in training time per epoch for the transformer base model across different sample block sizes and a comparison of various parallelization strategies. As the sample block size increases, the training speed also increases, with the optimal speed achieved when the sample block size is set to 256. Additionally, the use of different parallelization strategies can significantly reduce the training time per epoch. LightChatGLM leverages pipelined parallelism to facilitate efficient distributed training across multiple GPUs, resulting in a minimum training time of 180 s per epoch. However, the higher communication overhead associated with data and model parallelism slightly extends the training duration. It is also important to consider that memory usage escalates substantially as the sample block size increases, necessitating the careful design of the sample block size to achieve optimal training acceleration. This training speedup effect is observed in several training sets, including WMT 2020, UM-Corpus, and Ai Challenger. Experiments on these datasets show that the training speedup is, indeed, the result of the combined effect of pipeline parallelization and reasonable sample block division.
Figure 4 illustrates the validation loss of the mixed compression method compared with random pruning over an equal number of training rounds. We trained the base LightChatGLM model on the UM-Corpus dataset using the full dataset and parameter set to establish the initial validation loss. Subsequently, the model was subjected to mixed compression and random pruning to derive the corresponding student models, with the validation loss computed on the validation set. As shown in
Figure 4, the mixed compression approach, which integrates structured pruning and knowledge distillation techniques, achieves higher compression efficiency while preserving model performance. The validation loss curves for the mixed compression method typically exhibit faster convergence and stabilize at lower loss values. In contrast, random pruning methods, which achieve compression by randomly removing certain weights from the model, may cause greater fluctuations in model performance. As a result, the validation loss curves for random pruning tend to converge more slowly and stabilize at relatively higher values. The experimental data demonstrate that the mixed compression method outperforms random pruning in both convergence speed and final validation loss, effectively maintaining model performance.
Table 2 offers a comprehensive comparison of the accuracy achieved in computerized English-Chinese translation without the application of target data for fine-tuning and knowledge distillation. A noteworthy observation from the table is that the absence of utilizing rich source-language annotated data for cross-language knowledge migration results in the diminished generalization ability of the compressed model. Among the various methods evaluated, the DistilmBERT approach yields the most favorable outcomes, with 46.3% in Chinese–English bilingual translation. However, it is important to note that despite this, the performance of the student model obtained through LightChatGLM after mixed compression demonstrates remarkable proximity to that of the teacher model within a 9.4% gap. LightChatGLM effectively facilitates the migration of bilingual knowledge from the teacher model to the student model, which proves to be particularly effective in task-independent knowledge distillation scenarios during the pretraining stage. This successful knowledge transfer highlights the efficacy of LightChatGLM in preserving the essential knowledge and capabilities of the teacher model while achieving significant compression, thereby demonstrating its potential for practical deployment in real-world applications.
Table 3 presents a comprehensive comparison of the accuracy achieved in computerized English to Chinese translations using a combination of fine-tuning and knowledge distillation, with both target annotated language data and source language data. Upon examining the table, it becomes evident that the accuracy of Chinese and English translations for all models significantly improves after employing hybrid data fine-tuning. This enhancement can be attributed to the utilization of labeled target data in conjunction with further knowledge distillation, which collectively contributes to the model’s enhanced performance in the domain of Chinese and English bilingual translation. Furthermore, it is noteworthy that LightChatGLM exhibits slightly lower performance, with 57.4% in English comprehension compared with the other models. This disparity arises from the relatively smaller English corpus available in the dataset utilized, as compared with the teacher model. However, by augmenting the training data with a more extensive Chinese corpus, LightChatGLM has the potential to achieve superior results with 33.3% in English-to-Chinese comprehension. This result underscores the importance of dataset composition and the need for mixed data to effectively train models for bilingual translation tasks. The dataset used in
Table 2 and
Table 3 is Ai Challenger.
According to
Table 2 and
Table 3, in the mixed-data scenario, the LightChatGLM model achieves its highest Chinese translation accuracy, suggesting that the mixed dataset may enhance the model’s generalization ability to some extent. The hybrid dataset incorporates characteristics of both the source and target languages, allowing the model to capture complex linguistic relationships, thereby improving translation quality. Specifically, the model benefits from mixed-data training by learning contextual and semantic relationships, particularly in computer-related terminology. However, in non-hybrid data scenarios, the model’s translation accuracy is significantly lower. This may be due to the disproportionate presence of single-language data, which leads the model to focus excessively on the source language, thus neglecting the target language during optimization.
According to Equation (
12), where
denotes the loss function across different languages or tasks and
represents the weighting coefficient, the model adapts better to various linguistic features when
is balanced. The strength of hybrid data lies in its ability to help the model learn from both the source and target languages by optimizing this loss function, thus explaining why LightChatGLM can more effectively capture the correlation between these languages in a mixed-data setting, ultimately leading to better translation performance. Conversely, in non-hybrid scenarios, the model may only optimize a single loss function, leading to insufficient learning. Furthermore, the model’s poor performance in single-language scenarios could be attributed to overfitting, where it learns the idiosyncrasies of a specific dataset but lacks the generalization capability to handle other contexts.
To quantify this, let translation accuracy be defined as
, where
D represents the dataset, and
denotes the model parameters. The accuracy rate
for the mixed-data scenario can be expressed as
where
is the target language sentence,
is the source language sentence,
is the number of samples in the mixed dataset, and
represents the probability of a correct translation given the model parameters
. Due to the diversity of the mixed data,
tends to be higher.
In contrast, the translation accuracy
in single-language scenarios may exhibit greater variability:
where
represents the number of samples in a single-language dataset. In such cases, the model may struggle to capture the diversity present in a mixed dataset, leading to fluctuating performance.
In order to enhance the cross-linguistic capabilities of teacher models, LightChatGLM employs a hybrid data-based fine-tuning approach. This method involves randomly mixing annotated data from both the source and target languages, followed by fine-tuning the teacher model on the hybrid dataset while continuing knowledge distillation on downstream tasks. This process aims to yield student models with superior performance.
Figure 5 illustrates the average results of Chinese–English bilingual migration experiments conducted using three different fine-tuning methods on WMT 2020, UM-Corpus, and Ai Challenger. Here, the English training set serves as the source-language-labeled corpus, and the Chinese validation set acts as the target-language-labeled corpus. From
Figure 5, it is evident that the performance of fine-tuning solely using the source language yields the poorest results. This can be attributed to the absence of target annotation language, leading to the diminished generalization ability of the fine-tuned model. Conversely, the fine-tuning method incorporating the target annotation language and hybrid annotation language demonstrates improved performance results.
To evaluate the effectiveness of the hybrid data fine-tuning method in more complex tasks, we selected two high-resource datasets, WMT 2020 and UM-Corpus, and two low-resource datasets, IWSLT 17 and XGLUE 20. We applied three fine-tuning strategies: (1) fine-tuning using only high-resource language data (high fine-tuning), (2) fine-tuning using only low-resource language data (low fine-tuning), and (3) hybrid data fine-tuning, which incorporates data from both high- and low-resource languages.
In
Figure 6, the experimental results demonstrate that the highest translation accuracy is achieved with the high fine-tuning strategy, as the abundant data from high-resource languages enables the model to capture more context and complex linguistic structures, thereby enhancing its translation performance. In contrast, on low-resource datasets such as IWSLT 17 and XGLUE 20, the low fine-tuning strategy resulted in significantly lower accuracy, with an average decrease of 3.9%. The hybrid fine-tuning approach, although showing performance improvement over low fine-tuning, did not match the effectiveness of the high fine-tuning. This suggests that while hybrid fine-tuning benefits from the richer linguistic features learned from high-resource languages, it is not yet fully optimized to transfer this knowledge effectively to low-resource language tasks. Nonetheless, the improvements observed indicate that hybrid fine-tuning aids in enhancing the model’s ability to understand complex language phenomena, such as domain-specific language comprehension in technical fields like computerized English.
In order to deeply analyze the performance of the LightChatGLM model at different compression levels, we designed a series of compression experiments to test core indicators such as model size, inference time, and translation accuracy, as shown in
Table 4.
The experimental results demonstrate that as the compression level increases, the model’s resource consumption decreases significantly, but this is accompanied by a corresponding loss in performance. While mixed compression techniques effectively reduce resource usage and inference time, higher compression levels can lead to a marked degradation in model performance, particularly in tasks requiring complex language understanding. The performance trade-offs observed at different compression levels highlight the importance of selecting an appropriate compression strategy based on the specific application scenario. This ensures that model accuracy is preserved as much as possible while optimizing resource efficiency. Furthermore, these findings introduce a critical challenge for future research: how to maintain the model’s robust generalization capabilities for language features under extreme compression.
To further validate the effectiveness of the combination of techniques in LightChatGLM, we designed an ablation comparison experiment to systematically analyze the impact of three core techniques: parallel training, mixed compression, and hybrid data fine-tuning. Each technique was applied individually and in combination to assess their respective contributions. The core components of LightChatGLM were decomposed, based on which we structured the ablation experiments.
Baseline Model: The baseline model is ChatGLM-6B.
Parallel Training Algorithm: Parallel training was applied to assess its impact on training speed and model performance.
Mixed Compression: Only mixed compression techniques were used to evaluate their effect on model size, inference time, and performance.
Hybrid Data Fine-Tuning: The impact of hybrid data fine-tuning on translation accuracy was assessed.
Parallel Training + Mixed Compression: A combination of parallel training and mixed compression, excluding hybrid fine-tuning, was tested.
Parallel Training + Hybrid Fine-Tuning: Parallel training was combined with hybrid fine-tuning, without applying mixed compression.
Mixed Compression + Hybrid Fine-Tuning: Mixed compression and fine-tuning were applied together, without parallel training.
LightChatGLM: The complete model was tested with all three core combined techniques.
Table 5 shows the results of the above ablation experiments.
The parallel training algorithm significantly reduced the training time from 10 h to 6 h, without compromising the translation accuracy compared with the Baseline model. This demonstrates its effectiveness in accelerating training while maintaining model performance. The mixed compression technique provided notable benefits by reducing the model size by 10% and decreasing inference time by approximately 43%. However, the application of mixed compression alone resulted in a slight degradation of model performance, indicating that while compression is effective in optimizing resource efficiency, it may adversely impact the model’s accuracy. Hybrid data fine-tuning, on the other hand, improved the model’s translation accuracy from 45.7% to 46.3%, highlighting its ability to enhance the model’s understanding of complex linguistic structures, particularly in tasks related to computerized English comprehension. When combining parallel training, mixed compression, and hybrid data fine-tuning, LightChatGLM demonstrated superior performance across various metrics. This integrated approach yielded the best overall outcomes, particularly in balancing translation accuracy, reducing model size, accelerating inference time, and shortening training duration, thereby offering significant practical advantages.