Next Article in Journal
Study on the Influence of Collision Scene on the Energy Dissipation Process for Train Collision
Next Article in Special Issue
Optimizing Large Language Models: A Deep Dive into Effective Prompt Engineering Techniques
Previous Article in Journal
A Dynamic Measurement System Based on Adaptive Clustering and Multi-Classifier
Previous Article in Special Issue
The Impact of Prompting Techniques on the Security of the LLMs and the Systems to Which They Belong
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient Fine-Tuning of Large Language Models via a Low-Rank Gradient Estimator

by
Luoming Zhang
1,2,
Zhenyu Lou
1,2,
Yangwei Ying
1,2,
Cheng Yang
1,2,* and
Hong Zhou
1,2,*
1
The College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou 310027, China
2
Hangzhou Yihe Hui Sheng, Room 1006-2, 10th Floor, Building 1, No. 180 Kecheng Street, Hangzhou 310000, China
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(1), 82; https://doi.org/10.3390/app15010082
Submission received: 17 November 2024 / Revised: 13 December 2024 / Accepted: 24 December 2024 / Published: 26 December 2024

Abstract

:
In this paper, we present a Low-Rank Gradient Estimator (LoGE) to accelerate the finetune-time computation of transformers, especially large language models (LLMs). Unlike Parameter-Efficient Fine-Tuning (PEFT) methods, which primarily aim to minimize the number of fine-tuning parameters, LoGE also significantly reduces the computational load of activation gradient calculations by decomposing pre-trained weights and utilizing low-rank matrices during the backward pass. Our approach includes an effective solution for identifying sensitive and important latent subspaces in large models before training with downstream datasets. As LoGE does not alter the network structure, it can be conveniently integrated into existing models. We validated LoGE’s efficacy through comprehensive experiments across various models on various tasks. For the widely used LLaMA model equipped with LoRA, LoGE achieves up to a 1.3× speedup while maintaining graceful accuracy.

1. Introduction

Recently, large language models (LLMs) [1,2,3,4,5,6,7] have demonstrated exceptional capabilities in various language understanding tasks [8], becoming the backbone of leading chat systems [7]. To investigate the capabilities of LLMs further, fine-tuning them for specific downstream tasks has become a popular approach. Although the dataset size for fine-tuning a large language model is significantly smaller than that required for pre-training, the computational overhead of fine-tuning becomes comparable to pre-training due to the growing number of downstream tasks and their increasing complexity. Using LLaMA-65B [5] as an example, fine-tuning the model on a 10B-token dataset for 300 downstream tasks would generate approximately 24 tons of CO2, posing significant challenges in terms of computational demands and environmental impact.
To enhance memory efficiency and reduce computational overhead during fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) [9,10,11,12,13,14,15,16,17,18,19] significantly reduces the number of trainable parameters, thereby decreasing the memory required for gradients and optimization states. Specifically, the Low-Rank Adapter (LoRA) [17] employs a low-rank approximation to represent weight updates and integrates these updates during inference, introducing no additional computational overhead. PEFT methods reduce the number of trainable parameters, thereby decreasing the computation of weight gradients. However, they still require full-size matrix computations for forward passes and activation gradients, requiring a large amount of calculation.
To address the above challenge, we propose the Low-Rank Gradient Estimator (LoGE), which utilizes low-rank computation during the backward pass while maintaining full-rank computation during the forward pass. Our key idea is to leverage the low-rank structure of the weight W rather than approximating the weight updates as low rank. In LoGE, we apply decomposition to the full-rank weights and utilize the resultant low-rank matrices to calculate the input gradient, as illustrated in Figure 1. By decomposing full-rank weights into low-rank matrices (denoted as B and A ), we can interpret weight updates as the update of these low-rank matrices. We fix one low-rank matrix (typically denoted as B ) to avoid full-rank updates and maintain stability during training. To reduce the computational load during the forward pass while preserving the adaptor’s functionality, we propose a trainable parameter C as the update to A and restrict updates solely to C . At the start of training, C is initialized to 0 to align with the pre-trained weights. To identify task-specific subspaces, we utilize data from the downstream task. This allows us to select the optimal subspace for optimization. By combining both gradient similarity and gradient increment values, we introduce a novel importance metric. This metric aids in the selection of the most relevant subspace for the fine-tuning process.
Overall, our contributions are as follows:
  • We propose a new efficient fine-tuning method for LLMs by decomposing the pre-trained weight into a low-rank structure in the backward pass while maintaining the full-rank weight in the forward pass.
  • We propose a metric to choose the subspace via task-specific gradient scores and weight salience.
  • We evaluate our method on different models and different tasks to demonstrate the effectiveness of our proposed method and provide detailed analyses based on the empirical results. Results show that LoGE achieves better performance with less training time cost.

2. Related Work

Large Language Models (LLMs) [1,2,3,4,5,6,7] have become a leading paradigm in natural language processing, achieving top results in numerous tasks [20,21] and forming the basis of chat systems [22]. To maximize the downstream performance, fine-tuning LLMs is essential. However, the size of LLMs has dramatically increased to enhance performance, exemplified by GPT-3, which boasts up to 175 billion parameters and requires approximately 350 GB of memory in FP16 format. This growth poses significant challenges for fine-tuning these models in a memory- and calculation-efficient manner.
Parameter-Efficient Fine-Tuning (PEFT) [9,10,11,12,13] methods are meant to address this problem to reduce the number of trainable parameters by various methods while maintaining performance. PEFT typically involves adding adapter layers between the existing layers of a neural network. The pre-trained weights are frozen, and only the adapter layers are trained [9]. Prefix tuning [14] adds prefix parameters to the hidden states in every layer of the model. Prompt tuning [15] uses the template to reconstruct prompts and only updates the parameters related to prompt understanding. Low-Rank Adaptation (LoRA) [13] is a popular method that decomposes adapter weights into two low-rank matrices. These matrices are then fused with the pre-trained weights during inference. LoRA has demonstrated comparable performance to full-parameter fine-tuning while using much fewer parameters. AdaLoRA [16] utilizes SVD decomposition and prunes less significant singular values for more efficient updates. Another method uses orthogonal factorization to fine-tune diffusion models. LoRA-FA [17] keeps matrix A fixed with random initialization and only trains matrix B. Child Tuning [18] and SPT [19] only update a subset of parameters by strategically masking the gradient of the parameters not in the subset. Our method uses low-rank matrices to represent the subset of parameters in latent space, which allows for accelerating the calculation of backward computation without masking.
Quantization-Based PEFT. Although PEFT reduces the trainable parameters in models, the pre-trained weights still require a large amount of memory. To address this challenge, QLoRA [23] compresses the pre-trained weights into 4-bit floats and integrates LoRA. Meanwhile, QA-LoRA [24] and EfficientDM [25] address the quantization challenges in LoRA-fused models. However, QLoRA and QALoRA introduce extra dequantization overhead in the calculation. In contrast, our approach eliminates the need for dequantization in the backward pass, thereby achieving significant speed improvements over these memory-efficient LoRA variants.
Low-Rank Structures in Deep Learning. Low rank is widely used in machine learning, as many machine learning problems [26,27,28,29] have certain intrinsic low-rank structures. Particularly in over-parameterized neural networks, it is observed that the networks tend to exhibit low-rank properties post-training [30]. Furthermore, several studies have explored low-rank updates to frozen models. For instance, ReLoRA [31] extends low-rank updates to encompass the entire training process of the original networks, and GaLore [32] employs low-rank decompositions of weight gradients to reduce memory demands during full training. To our knowledge, no existing work applies low-rank decomposition directly to the weights during training to reduce computational load. In our research, we decompose pre-trained weights to expedite gradient calculations.

3. Background

Naive Full-Rank Training. For a given weight W 0 R m × n , the conventional update can be expressed as follows:
W T = W 0 + η 0 T 1 G t
where G t represents the gradient added to the weight matrix at step t. In training, the backward pass is both memory-intensive and computationally demanding. For example, with Adam, gradients G t and additional optimizer states M , V R m × n are used to regularize the gradient, increasing memory requirements by 3 m n . Gradient calculations involve two full-rank operations:
O / x = δ O W O / W = δ O T x
Here, x R b × n is the input, and  δ O R b × m represents the output error. This naive approach incurs 4 b m n FLOPs during the backward pass.
Low-Rank Adaptation. For a linear layer W R m × n , LoRA and its variants model the incremental update of the pre-trained weight with the product of two low-rank matrices  AB :
W = W 0 + BA ,
where W , W 0 R m × n , A R r × n , and B R m × r , with r m i n ( n , m ) . With Adam, LoRA needs 3 m r + 3 n r memory to store optimization states and gradients. The gradient calculation for weight involves two low-rank operations:
O / A = ( δ O B ) T x O / B = δ O T ( x A )
During backpropagation, LoRA simplifies FLOPs to 2 b m n + 6 b m r + 6 b n r , thereby reducing the calculation.
Quantization. Quantization is the process of mapping continuous or high bit-width data types into discrete and often lower bit-width values. For instance, converting a 16-bit floating point (FP16) tensor into a 4-bit integer (INT4) tensor with a range of [−7, 8] can be expressed as follows:
X INT 4 = round ( 15 absmax ( X FP 16 ) X FP 16 ) = round ( X FP 16 s FP 16 )
where s is the quantization scale. Dequantization is the inverse:
dequant ( s FP 16 , X INT 4 ) = s FP 16 X INT 4 ) = X ^ FP 16
Here, X ^ is an estimated value, as quantization will introduce quantization error for original values. Quantization is introduced to reduce the weight memory usage for LLMs, as PEFT reduces the learnable parameters, thereby reducing memory usage for weight gradients.

4. Method

In many cases, the optimal weight may not be low-rank. Therefore, prior works [31,32] choose to maintain the full-rank weight, and update the full-rank weight with a dynamic low-rank approximate to enhance training efficiency. If we can estimate the full-rank weight with a low-rank approximation during the backward pass while maintaining the full-rank weight in the forward pass, we can further enhance training efficiency. This leads to our proposed LoGE method.

4.1. Low-Rank Gradient Estimator

The core idea of LoGE is to replace the full-rank weight in the backward pass with a low-rank representation and update one low-rank matrix. In the backward pass, we substitute the full-rank matrix with its low-rank approximation to compute the gradient of the activation. However, updating both components of the low-rank approximation prevents the weight update from being directly represented by the updates of the components. Given the low-rank approximation B A , we can formulate the update of weight δ W as follows:
δ W = ( B + δ B ) ( A + δ A ) B A = δ B A + B δ A + δ B δ A
The δ W requires full-rank storage, making this approximation memory-inefficient. Inspired by [17], we try to fix one low-rank matrix and only update the other. This approach maintains the update in a low-rank format, thereby reducing the memory usage required for optimization states and gradients. For simplicity in computation, we frame the low-rank update in the adapter format, following the design principles of LoRA [13]. Given a full-rank weight W R m × n and n m , the calculation for LoGE is defined as follows:
O = ( W 0 + B A ) x O / x = δ O B ( C + A )
Here, B R m × r and A R r × n represent the low-rank approximations of W used during the backward pass, with  A fixed during training. The matrix C R r × n is the trainable parameters and represents the updates to A . For cases where n < m , we fix the A and reformulate the calculations to the following:
O = ( W + C A ) x O / x = δ O ( B + C ) A
The subsequent sections will concentrate on the case where n m to explore LoGE further. Orthogonal decomposition matrices can simplify the calculation of importance for structured sparsity. Therefore, we utilize Singular Value Decomposition (SVD) to decompose the matrix. Given the pre-trained weight W 0 , this process can be formulated as follows:
W 0 = U Σ V T B = U [ i d x ] C = Σ [ i d x ] V [ i d x ] T
Here, i d x refers to the selected index vector. The select strategy is discussed in the next subsection.

4.2. Salient Space Choose Strategy

For fine-tuning, the objective is to integrate new knowledge into the pre-trained weights. Therefore, identifying weights that are sensitive to the fine-tuning dataset is crucial. Following the core principle from Child Tuning [18], our approach considers the singular values and the importance of gradients. This dual consideration helps us design a metric to select the most relevant subspaces for updates.
Obtaining the gradient importance of each subspace is challenging, as it requires training models to converge, which is time-consuming. Following prior work [18], we approximate C using the gradient of A after a single step of gradient descent.
Since the latent space corresponds to a pair from the weight decomposition, using the full-rank decomposition for gradient descent calculations could lead to large calculation and memory demands. To solve this problem, we sparse the U , Σ , and V T by selecting the top 2 r singular values and then compute the gradient-based importance metric.
While Child Tuning [18] selects subspaces based on element-wise importance, our method focuses on identifying important channels from the SVD decomposition. We use an importance score for the entire channel rather than for each individual element. Inspired by structural pruning [33], we utilize the second-order Taylor expansion to determine the importance of each channel. Given the channel gradient g A = { g a i } , the gradient importance score can be formulated as follows:
s a i = g a i T g a i
As low-rank matrices estimate the gradients of activations, we also consider the magnitude of singular values. Therefore, we use the energy of the singular values (the square of the singular values) to represent their importance. We then normalize these two metrics and combine them using a temperature parameter:
S i = | | λ i 2 | | m a x ( | | λ i 2 | | ) + 1 α | | s a i | | m a x ( | | s a i | | )
Here, α serves as a temperature parameter that helps balance differing normalization values. We present a detailed ablation study to compare the performance of different importance metrics and different α .
The entire fine-tuning process with LoGE is outlined in Algorithm 1. In practice, the batch size b is set to 32. The time cost for calculating the importance score is approximately 20 s for the OPT-1.3B model on the Alpaca dataset.
Algorithm 1 LoGE for fine-tuning. θ represents the model parameters, and D k is the dataset for task k.
Require:  θ , D k
1: for layer in model layers do
2:    if layer is linear then
3:         U , Σ , V T W 0
4:        layer ← LoGE( W 0 , A , U , Σ V T )
5:    end if
6: end for
7: Sample b samples d k from D k .
8: Calculate gradient of A with d k .
9: Calculate importance index S i and select i d x .
10: for layer in model layers do
11:     if layer is linear then
12:        layer ← LoGE( W 0 , A , U [ i d x ] , Σ [ i d x ] V [ i d x ] T )
13:     end if
14: end for
15: for s in training steps do
16:     Update A {Training step with LoGE}
17: end for
18: return  θ

4.3. Memory Usage and Calculation

Since LoGE limits the trainable parameters with a low-rank matrix rather than a full-rank one, the memory usage of the optimization states and gradient is reduced to 3 n r . Combined with the parameters, the total memory usage, excluding activation for linear layers, is m n + m r + 5 n r . Furthermore, the smaller trainable parameters also reduce the communication costs for gradient broadcast in distributed training. For the calculation in the backward pass, the calculation for the gradient of the input is reduced from b m n to b m r + b n r + n r . For better comparison, we list the exact memory usage and FLOPs in Table 1. In the case of fine-tuning an LLaMA-7B model with a rank of 128, the FLOPs for a linear layer under traditional full training are approximately 100.67 M. With the application of LoRA, these FLOPs decrease to 73.4 M. Notably, when employing LoGE and fine-tuning the layers with a larger rank of 384, the FLOPs were further reduced to 49.28 M, which is significantly lower than those achieved with LoRA.

4.4. QLoGE: LoGE with Quantization

To enhance memory efficiency, we implement quantization for the frozen weights. Although quantization training may increase overall training time [23], the trade-off between memory usage and training speed makes it particularly suitable for fine-tuning. Consequently, we apply QLoGE within the fine-tuning methods presented in this paper.
For quantization-based memory-efficient fine-tuning, a significant drawback is the additional computational overhead introduced by dequantizing weights back to the calculation data type. Since fine-tuning often involves long sequence lengths for a single linear layer, this process becomes compute-bound [23], consequently slowing down training speed. The extra dequantization overhead occurs in two areas: during forward calculations and when calculating activation gradients, which can be expressed as follows:
O FP 16 = ( dequant ( s FP 16 , W NF 4 ) + B FP 16 A FP 16 ) x FP 16 O FP 16 / x FP 16 = δ O FP 16 ( dequant ( s FP 16 , W NF 4 ) + B FP 16 A FP 16 )
By replacing the quantized weight in the activation gradient calculation, our method can eliminate one source of this overhead. Since the task loss is calculated with quantized weight, we decompose the simulated weight W ^ into a low-rank form. The strategy for QLoGE can be formulated as follows:
W ^ = U S V T B 0 = U [ i d x ] C 0 = S [ i d x ] V [ i d x ] T O FP 16 / x FP 16 = δ O FP 16 B FP 16 ( C FP 16 + A FP 16 )

5. Experiment

5.1. Implementation Details

We conducted experiments to evaluate the effectiveness of our proposed method and compared it with representative methods.
Datasets and Models: We initially conducted experiments on small models including OPT-1.3B and OPT-2.7B [34], fine-tuning them on complete conversations. We used two distinct datasets: Self-Instruct [35] and Alpaca [36]. Following the benchmark in QLoRA [23], we also used the five-shot performance of the fine-tuned models on the Massively Multitask Language Understanding (MMLU) task [37]. To evaluate the efficiency and effectiveness of QLoGE, we scaled up our experiments to include larger models such as LLaMA2-7B and LLaMA2-13B [5]. We used two distinct datasets: OASST1 [38] and Alpaca [36]. The Self-Instruct dataset [35], generated by GPT-3, contains approximately 52K instructions. The OASST1 dataset [38] comprises human-created and annotated dialogues from ChatGPT-like assistants, featuring 161,443 messages in 35 languages and 461,292 quality assessments. The Alpaca dataset [36] includes 52,000 instructions and demonstrations generated using OpenAI’s text-davinci-003 engine.
To evaluate LoGE with more complex tasks, we fine-tuned LLaMA-2-7B [5] on the MetaMathQA dataset [39] to assess their mathematical problem-solving capabilities on the GSM8K [40] validation sets. Additionally, the models were fine-tuned on the CodeFeedback dataset [41] and evaluated for coding proficiency using the Human-eval datasets [42]. Furthermore, training was extended to the WizardLM-Evol-Instruct dataset [43] with subsequent testing for conversational abilities on the MT-Bench dataset [44]. The GSM8K dataset [40] consists of 8.5K high-quality elementary school math problems, all authored by human writers. The MT-bench dataset [44] contains 3.3K expert-level pairwise human preferences for model responses generated by six models in response to 80 MT-bench questions. The Human-eval dataset [42] comprises 164 original programming problems, evaluating language comprehension, algorithms, and basic mathematics, some of which resemble simple software interview questions. These experiments were conducted using subsets containing 100K data points and were limited to one epoch of training to minimize overhead.
To evaluate LoGE alongside other structured transformer-based large language models, we fine-tuned RoBERTa-large [2] on a subset of the GLUE benchmark [45], including MNLI, QNLI, SST-2, and MRPC.
Training Details: For the OPT models trained on the Alpaca and Self-instruct datasets, we adhered mostly to the hyperparameters used in QLoRA. We employed the Adam optimizer with a learning rate of 8 × 10 5 . The moving average betas were set at [0.9, 0.999], and the weight decay was maintained at 0. And we used a constant learning rate scheduler. For LLaMA models trained on the Alpaca and Self-instruct datasets, we also adhered mostly to the hyperparameters used in QLoRA. We employed the Adam optimizer with a learning rate of 2 × 10 4 . The moving average betas were set at [0.9, 0.999], and the weight decay was maintained at 0. Additionally, we implemented a constant learning rate scheduler, incorporating a warmup ratio of 0.3. For fine-tuning LLMs with complex tasks, we used the Adam optimizer with a learning rate of 2 × 10 5 . The moving average betas were set at [0.9, 0.999], with no weight decay. Additionally, we utilized a cosine learning rate scheduler with a warmup ratio of 0.3. The hyperparameters for RoBerta-large were almost the same as the hyperparameters from LoRA [13], and is detailed in Table 2.
For smaller models like RoBERTa-large [2] and OPT [34], we applied LoGE only to the q-proj and v-proj layers to preserve model performance. For larger models like LLaMA-7B, we extended the use of LoGE to all linear layers within the attention module, as larger models exhibit greater robustness to model compression.
Implement Details: All experiments were conducted on RTX 3090 GPUs (NVIDIA, Santa Clara, CA, USA) using PyTorch (v2.2.0). To ensure better reproducibility, we have provided the prototype code in the Supplementary Materials.

5.2. Main Results

5.2.1. Results on OPT Models

We first applied LoGE to fine-tune the OPT models. Given that the MMLU results for OPT-1.3B and OPT-2.7B were about 25%, we used the perplexity of the trainable datasets for comparison. Table 3 summarizes the results with respect to different model sizes and fine-tuning datasets. For models fine-tuned with Alpaca, we observe that the performance was almost the same for LoRA and LoGE, with OPT-1.3B performing better with LoGE. Consequently, we included Self-instruct datasets for further evaluation. The memory usage of LoGE was slightly higher than that of LoRA, as LoGE needed to store three low-rank matrices A , B , and C . Since only A was trainable, the optimizer state for LoGE was not significantly larger than that for LoRA. Additionally, we applied LoGE to only a few layers to preserve model performance. Therefore, LoGE did not reach the ideal speedup ratios.

5.2.2. Results on Complex Tasks

We then fine-tuned LLaMA2-7B on more complex tasks, with results detailed in Table 4. According to these results, our LoGE method achieved superior performance with reduced training times compared to LoRA. The performance improvements were consistent across various models, demonstrating the versatility of our approach. The training loss plot in Figure 2 shows that LoGE converged faster than LoRA in the initial steps and maintained a smaller training loss throughout. This accelerated convergence can be attributed to the fact that while the parameters of LoRA were randomly initialized and required several steps to establish convergence directions, LoGE leveraged the gradients of the low-rank approximation of weights, which had already been aligned with the correct convergence direction. We also compared our method with two LoRA variants [46,47] that focus on enhancing model performance. While our approach lagged slightly in performance, it demonstrated a significant reduction in training time. This is primarily because our method prioritizes training efficiency by compressing pre-trained weights, which inevitably introduces some gradient errors.

5.2.3. Results on GLU Tasks

We then fine-tuned RoBerta-large on a subset of GLUE, with results detailed in Table 5. Because smaller models are harder to compress [48], and RoBerta-large has about 350 M parameters, which is much smaller than prior models, we kept the rank at 256 to ensure the compression ratio. Compared to LoRA, LoGE achieved a slight speedup with a marginal increase in memory usage. The reason is that a rank of 256 is relatively large for RoBERTa-large, given its hidden state size of 1024. The slight increase in training time for SST-2 was due to gradient accumulation, as SST-2 used eight steps compared to four in other tasks. Additionally, longer inputs in tasks like SST-2 and MRPC reduced the speedup ratio, as attention calculations became more time-intensive. Additionally, LoGE achieved better performance on smaller subsets like MRPC. This is because LoGE was initialized with SVD, providing it with an advantageous starting point and the correct convergence direction.

5.2.4. QLoGE Performance Comparison

We applied our QLoGE method to fine-tune LLaMA models and evaluated their performance using the MMLU benchmark. As summarized in Table 6, QLoGE achieves comparable accuracy to QLoRA, with up to a 1.3× speedup. Additionally, there is an increase in memory usage—about 20% for LLaMA-7B and 10% for LLaMA-13B. This increase is due to the higher rank used in QLoGE compared to QLoRA. A higher rank is necessary in LoGE to avoid performance degradation, as a smaller rank can affect gradient accuracy. Combining quantization with low-rank updates further amplifies the gradient gap, leaving this challenge as an open question for future research. Furthermore, our method reduces the number of trainable parameters and eliminates the significant overhead associated with dequantizing weights for computation, offering another considerable advantage.

5.3. Ablation Studies

The performance reported in this subsection is assessed using perplexity (PPL), where lower values indicate better model performance. For efficient evaluation, we fine-tuned the models with one epoch for comparison in this section.

5.3.1. The Impact of Different α and Different Datasets on Model Performance

We employed a temperature parameter, α , to combine two normalized importance scores: the magnitude of the singular values and the contribution to the gradient. To explore how different values of α (ranging from 0.1 to 10) affect model performance, we conducted tests across various datasets using the OPT-1.3B model in Table 7. Our findings reveal that models perform optimally when initialized with the top-K largest singular values. This supports findings from Hu et al. [13], which noted a significant overlap with the top singular vectors in LoRA. Interestingly, relying solely on gradient contribution for subspace selection resulted in suboptimal performance, diverging from the results observed in Child Tuning [18] and SPT [19]. This could be due to our method’s reliance on low-rank matrices to estimate input gradients, potentially omitting subspaces with the largest singular values and thus introducing significant estimation errors. Remarkably, we found α = 0.3 to be the most effective setting for both datasets tested. This suggests that the fusion method and this particular hyperparameter setting are generally applicable, not just dataset-specific.

5.3.2. The Time Cost of SVD

The time cost of SVD is detailed in Table 8. Relative to the total training duration, the time added by SVD is minimal, constituting approximately 1.53% and 3.11% of the overall fine-tuning times. Including the time for SVD, LoGE still achieves approximately a 20% speedup compared to LoRA.

5.3.3. Different Selection for LoGE

To further evaluate the optimal rank r for LoGE and determine the best insertion points, we conducted two different sets of experiments. The first set involved varying the rank r to observe its influence on model performance and acceleration ratio. The second set applied LoGE at different points within the model to assess the limitations of the low-rank estimator in maintaining performance.
Rank r choice for LoGE. We adjusted the rank r from 16 to 512 for the OPT-1.3B models using the Self-Instruct datasets in Table 9. Unlike typical Parameter-Efficient Fine-Tuning (PEFT) methods, in our approach, the rank r directly influences the accuracy of gradient estimation. Our results show that even using the same low rank as LoRA, such as r = 16 , can lead to significant gradient estimation errors, although the training runtime remains comparable. We applied the Low-Rank Gradient Estimator (LoGE) specifically to W q i and W v i . However, as r increased, the computational demands of LoGE surpassed those of LoRA, especially when r reached 512, at which point the computational complexity approached almost O ( d 2 ) . To strike an optimal balance between trainable parameters, training time, and performance, we selected r = d / 16 for our studies.
Linear layer choice for LoGE. We varied the weights within the model to which LoGE was applied. Specifically, we avoided applying LoGE to weights in the Feed-Forward Network (FFN) due to dimensional scaling. However, the number of trainable parameters changed depending on which weights LoGE was applied to. Applying LoGE to more layers resulted in more compressed gradients, but we did not adjust the rank r with the increase in layers. We list the results in Table 10. Our findings align closely with existing PEFT methods [10,13]. Applying LoGE only to the layers W q i and W k i significantly reduced performance. While the performance of W v i and W o i was closer to that of { W q i , W v i } , the increase in trainable parameters from using LoGE still resulted in a larger performance gap than reported in [13]. This may be due to the difference in the number of trainable parameters between our methods. Additionally, making W v i trainable increased the training time by approximately 1.1 times, likely due to the involvement of gradient calculations in the attention mechanism. Our results suggest that although increasing the number of trainable parameters can enhance model performance, excessive compression of gradients can be detrimental. Fine-tuning with just two trainable parameters, W q i and W v i , appears sufficient, indicating that a balance is crucial for optimal fine-tuning efficiency.

6. Limitations and Future Work

One major limitation of our method is its reduced effectiveness on small models (with fewer than 1 billion parameters), as these models are more significantly affected by low-rank gradient compression. Future work could explore several directions:
  • Application to training from scratch: Investigating how this method can be integrated into training from scratch, including addressing changes in the subspace during training and fusing the low-rank update branch into the model weights.
  • Improving performance on small models: Developing techniques to enhance the accuracy and effectiveness of the method when applied to smaller models.
  • Adapting to edge devices: Exploring how to implement this approach on edge devices to enable lifelong learning, including designing systems for efficient fine-tuning tailored to resource-constrained environments.

7. Conclusions

In this paper, we propose the Low-Rank Gradient Estimator (LoGE), an efficient method for fine-tuning LLMs. LoGE incorporates model compression into low-rank weight update approximations, performing low-rank approximations of pre-trained weights during backward computation while preserving the full-rank pre-trained weights during forward computation. This approach reduces the computational cost of training a linear layer by approximately 60%. Experiments across various models and tasks demonstrated that LoGE not only maintained performance but also significantly accelerated the fine-tuning process by about 20%. Additionally, LoGE can be combined with quantization methods, reducing the training time for quantization-based efficient fine-tuning by approximately 30% without any performance degradation. LoGE advances the field of fine-tuning on edge devices and holds significant practical implications for reducing the carbon footprint of deep learning.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app15010082/s1.

Author Contributions

Conceptualization, L.Z.; methodology, L.Z.; software, L.Z.; validation, L.Z. and Z.L.; formal analysis, L.Z.; investigation, L.Z.; resources, H.Z.; data curation, L.Z.; writing—original draft preparation, L.Z.; writing—review and editing, C.Y.; visualization, Y.Y.; supervision, Z.L.; project administration, H.Z.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (2022YFC3602601). And this work was also supported by the Experimental Technology Reaseach Project of Zhejiang University (SYBJS202314).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request. And a portotype project of LoGE is provided in Supplementary Materials.

Conflicts of Interest

All authors were employed by the company Hangzhou Yihe Hui Sheng. The remaining authors declare that the re-search was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LoGELow-Rank Gradient Estimator
LoRALow-Rank Adapter
PEFTParameter-Efficient Fine-Tuning
LLMLarge Language Model

References

  1. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar] [CrossRef]
  2. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  3. Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar]
  4. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
  5. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
  6. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
  7. Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y.T.; Li, Y.; Lundberg, S.; et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv 2023, arXiv:2303.12712. [Google Scholar]
  8. Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682. [Google Scholar]
  9. Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
  10. Guo, D.; Rush, A.M.; Kim, Y. Parameter-efficient transfer learning with diff pruning. arXiv 2020, arXiv:2012.07463. [Google Scholar]
  11. Pfeiffer, J.; Kamath, A.; Rücklé, A.; Cho, K.; Gurevych, I. AdapterFusion: Non-destructive task composition for transfer learning. arXiv 2020, arXiv:2005.00247. [Google Scholar]
  12. Zaken, E.B.; Ravfogel, S.; Goldberg, Y. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv 2021, arXiv:2106.10199. [Google Scholar]
  13. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
  14. Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4582–4597. [Google Scholar] [CrossRef]
  15. Qin, G.; Eisner, J. Learning how to ask: Querying LMs with mixtures of soft prompts. arXiv 2021, arXiv:2104.06599. [Google Scholar]
  16. Zhang, Q.; Chen, M.; Bukharin, A.; He, P.; Cheng, Y.; Chen, W.; Zhao, T. Adaptive budget allocation for parameter-efficient fine-tuning. In Proceedings of the The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  17. Zhang, L.; Zhang, L.; Shi, S.; Chu, X.; Li, B. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv 2023, arXiv:2308.03303. [Google Scholar]
  18. Xu, R.; Luo, F.; Zhang, Z.; Tan, C.; Chang, B.; Huang, S.; Huang, F. Raise a child in large language model: Towards effective and generalizable fine-tuning. arXiv 2021, arXiv:2109.05687. [Google Scholar]
  19. He, H.; Cai, J.; Zhang, J.; Tao, D.; Zhuang, B. Sensitivity-aware visual parameter-efficient fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 11825–11835. [Google Scholar]
  20. Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
  21. Zhou, A.; Wang, K.; Lu, Z.; Shi, W.; Luo, S.; Qin, Z.; Lu, S.; Jia, A.; Song, L.; Zhan, M.; et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv 2023, arXiv:2308.07921. [Google Scholar]
  22. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
  23. Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2024, 36, 10088–10101. [Google Scholar]
  24. Xu, Y.; Xie, L.; Gu, X.; Chen, X.; Chang, H.; Zhang, H.; Chen, Z.; Zhang, X.; Tian, Q. Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv 2023, arXiv:2309.14717. [Google Scholar]
  25. He, Y.; Liu, J.; Wu, W.; Zhou, H.; Zhuang, B. Efficientdm: Efficient quantization-aware fine-tuning of low-bit diffusion models. arXiv 2023, arXiv:2310.03270. [Google Scholar]
  26. Li, Y.; Liang, Y.; Risteski, A. Recovery guarantee of weighted low-rank approximation via alternating minimization. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 2358–2367. [Google Scholar]
  27. Cai, J.F.; Candès, E.J.; Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 2010, 20, 1956–1982. [Google Scholar] [CrossRef]
  28. Li, Y.; Ma, T.; Zhang, H. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Proceedings of the Conference On Learning Theory, PMLR, Stockholm, Sweden, 6–9 July 2018; pp. 2–47. [Google Scholar]
  29. Grasedyck, L.; Kressner, D.; Tobler, C. A literature survey of low-rank tensor approximation techniques. GAMM-Mitteilungen 2013, 36, 53–78. [Google Scholar] [CrossRef]
  30. Oymak, S.; Fabian, Z.; Li, M.; Soltanolkotabi, M. Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian. arXiv 2019, arXiv:1906.05392. [Google Scholar]
  31. Lialin, V.; Muckatira, S.; Shivagunde, N.; Rumshisky, A. ReLoRA: High-Rank Training Through Low-Rank Updates. In Proceedings of the Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
  32. Zhao, J.; Zhang, Z.; Chen, B.; Wang, Z.; Anandkumar, A.; Tian, Y. Galore: Memory-efficient llm training by gradient low-rank projection. arXiv 2024, arXiv:2403.03507. [Google Scholar]
  33. Frantar, E.; Alistarh, D. Sparsegpt: Massive language models can be accurately pruned in one-shot. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 10323–10337. [Google Scholar]
  34. Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. Opt: Open pre-trained transformer language models. arXiv 2022, arXiv:2205.01068. [Google Scholar]
  35. Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N.A.; Khashabi, D.; Hajishirzi, H. Self-instruct: Aligning language model with self generated instructions. arXiv 2022, arXiv:2212.10560. [Google Scholar]
  36. Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Stanford Alpaca: An Instruction-Following LLaMA Model. 2023. Available online: https://github.com/tatsu-lab/stanford_alpaca (accessed on 13 March 2023).
  37. Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring massive multitask language understanding. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
  38. Köpf, A.; Kilcher, Y.; von Rütte, D.; Anagnostidis, S.; Tam, Z.R.; Stevens, K.; Barhoum, A.; Nguyen, D.; Stanley, O.; Nagyfi, R.; et al. Openassistant conversations-democratizing large language model alignment. Adv. Neural Inf. Process. Syst. 2024, 36, 47669–47681. [Google Scholar]
  39. Yu, L.; Jiang, W.; Shi, H.; Yu, J.; Liu, Z.; Zhang, Y.; Kwok, J.T.; Li, Z.; Weller, A.; Liu, W. Metamath: Bootstrap your own mathematical questions for large language models. arXiv 2023, arXiv:2309.12284. [Google Scholar]
  40. Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training Verifiers to Solve Math Word Problems. arXiv 2021, arXiv:2110.14168. [Google Scholar]
  41. Zheng, T.; Zhang, G.; Shen, T.; Liu, X.; Lin, B.Y.; Fu, J.; Chen, W.; Yue, X. OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement. arXiv 2024, arXiv:2402.14658. [Google Scholar]
  42. Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto, H.P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. Available online: http://arxiv.org/abs/2107.03374 (accessed on 23 December 2024).
  43. Xu, C.; Sun, Q.; Zheng, K.; Geng, X.; Zhao, P.; Feng, J.; Tao, C.; Lin, Q.; Jiang, D. WizardLM: Empowering large pre-trained language models to follow complex instructions. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  44. Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2024, 36, 46595–46623. [Google Scholar]
  45. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  46. Kalajdzievski, D. A rank stabilization scaling factor for fine-tuning with lora. arXiv 2023, arXiv:2312.03732. [Google Scholar]
  47. Meng, F.; Wang, Z.; Zhang, M. Pissa: Principal singular values and singular vectors adaptation of large language models. arXiv 2024, arXiv:2404.02948. [Google Scholar]
  48. McCarley, J.; Chakravarti, R.; Sil, A. Structured pruning of a bert-based question answering model. arXiv 2019, arXiv:1910.06360. [Google Scholar]
Figure 1. Comparison between different fine-tuning strategies. (a) Full-Parameter Fine-Tuning, (b) LoRA, and (c) LoGE. By comparison, our architecture is more efficient by decomposing the weight during backpropagation.
Figure 1. Comparison between different fine-tuning strategies. (a) Full-Parameter Fine-Tuning, (b) LoRA, and (c) LoGE. By comparison, our architecture is more efficient by decomposing the weight during backpropagation.
Applsci 15 00082 g001
Figure 2. Training loss of the LLaMA2-7B model fine-tuned on the MetaMathQA dataset over one epoch.
Figure 2. Training loss of the LLaMA2-7B model fine-tuned on the MetaMathQA dataset over one epoch.
Applsci 15 00082 g002
Table 1. Comparative analysis of training techniques: Naive, LoRA, GaLore, and LoGE. Assume W R m × n , x R b × n , and rank r.
Table 1. Comparative analysis of training techniques: Naive, LoRA, GaLore, and LoGE. Assume W R m × n , x R b × n , and rank r.
NaiveGaLoreLoRALoGE
Weights m n m n m n + m r + n r m n + m r + 2 n r
Gradients m n m n m r + n r n r
Optim States 2 m n m r + 2 n r 2 m r + 2 n r 2 n r
Memory Usage 4 m n 2 m n + m r + 2 n r m n + 4 m r + 4 n r m n + m r + 5 n r
F L O P s f o r w a r d 2 b m n 2 b m n 2 b m n + 2 b m r + 2 b n r 2 b m n + 2 b m r + 2 b n r
F L O P s b a c k w a r d 4 b m n 4 b m n 2 b m n + 4 b m r + 4 b n r 2 b m r + 4 b n r + n r
F L O P s t o t a l 6 b m n 6 b m n 4 b m n + 6 b m r + 6 b n r 2 b m n + 4 b m r + 6 b n r + n r
Table 2. Hyperparameters of our methods on GLUE. BS means batch size and GA means gradient accumulation.
Table 2. Hyperparameters of our methods on GLUE. BS means batch size and GA means gradient accumulation.
DatasetEpochBSGALR
MNLI10843 × 10−4
SST-210884 × 10−4
QNLI10842 × 10 −4
MRPC20843 × 10 −4
Table 3. PPL on evaluation datasets across OPT models: comparing sizes and datasets. Experiments were run with one RTX3090. The symbol “↓” indicates that lower values are better.
Table 3. PPL on evaluation datasets across OPT models: comparing sizes and datasets. Experiments were run with one RTX3090. The symbol “↓” indicates that lower values are better.
ModelDatasetMethodRankMemoryMetric (PPL ↓)Runtime
OPT1.3BSelf-InstructLoRA1283697 MiB2.731.98 h
OPT1.3BSelf-InstructGaLore1283893 MiB2.991.59 h
OPT1.3BSelf-InstructLoGE2563769 MiB2.711.53 h
OPT1.3BAlpacaLoRA1283715 MiB4.822.05 h
OPT1.3BAlpacaGaLore1283976 MiB5.001.71 h
OPT1.3BAlpacaLoGE2563713 MiB4.811.63 h
OPT2.7BSelf-InstructLoRA1286834 MiB2.442.71 h
OPT2.7BSelf-InstructGaLore1287100 MiB2.672.32 h
OPT2.7BSelf-InstructLoGE2566629 MiB2.432.09 h
OPT2.7BAlpacaLoRA1286808 MiB4.362.90 h
OPT2.7BAlpacaGaLore1287209 MiB4.552.54 h
OPT2.7BAlpacaLoGE2566969 MiB4.372.19 h
Table 4. Results of fine-tuning Llama 2-7B using LoRA, LoRArs, PiSSA, and LoGE, tested on MT-Bench, GSM8K, and Human-eval. Experiments were run with two RTX3090.
Table 4. Results of fine-tuning Llama 2-7B using LoRA, LoRArs, PiSSA, and LoGE, tested on MT-Bench, GSM8K, and Human-eval. Experiments were run with two RTX3090.
DatasetMethodRankMemoryMetricRuntime
MT-BenchLoRA12826,894 MiB5.615.70 h
MT-BenchrsLoRA12826,928 MiB5.255.58 h
MT-BenchPiSSA12826,986 MiB5.305.95 h
MT-BenchLoGE25626,680 MiB5.474.18 h
GSM8KLoRA12821,648 MiB42.087.18 h
GSM8KrsLoRA12822,744 MiB51.557.96 h
GSM8KPiSSA12821,904 MiB52.317.73 h
GSM8KLoGE25622,454 MiB48.745.20 h
Human-EvalLoRA12826,634 MiB14.7610.95 h
Human-EvalrsLoRA12826,618 MiB18.2911.72 h
Human-EvalPiSSA12825,786 MiB18.7611.61 h
Human-EvalLoGE25626,476 MiB16.469.17 h
Table 5. Results of fine-tuning RoBerta-large using Full-FT, LoRA, and LoGE on a subset of GLUE. Experiments were run with one RTX3090. The symbol “↑” indicates that upper values are better.
Table 5. Results of fine-tuning RoBerta-large using Full-FT, LoRA, and LoGE on a subset of GLUE. Experiments were run with one RTX3090. The symbol “↑” indicates that upper values are better.
DatasetMethodRankMemoryAcc. ↑Runtime
MNLIFT-10,710 MiB90.2%12.25 h
LoRA84636 MiB90.6%8.54 h
LoGE2565052 MiB89.9%7.96 h
SST2FT-9444 MiB96.4%1.25 h
LoRA84118 MiB96.2%1.22 h
LoGE2564486 MiB95.7%1.37 h
QNLIFT-16,488 MiB94.7%3.25 h
LoRA87564 MiB94.9%2.48 h
LoGE2568566 MiB94.3%2.21 h
MRPCFT-10,194 MiB90.9%0.22 h
LoRA85686 MiB90.1%0.16 h
LoGE2566032 MiB91.1%0.17 h
Table 6. Five-shot MMLU benchmark accuracy across LLaMA models: comparing sizes and datasets. The symbol “↑” indicates that upper values are better.
Table 6. Five-shot MMLU benchmark accuracy across LLaMA models: comparing sizes and datasets. The symbol “↑” indicates that upper values are better.
ModelDatasetMethodRankMemoryMetric (Acc. ↑)Runtime
LLaMA-7BOASST1QLoRA647836 MiB33.9%4.33 h
LLaMA-7BOASST1QLoGE2569390 MiB34.0%3.41 h
LLaMA-7BAlpacaQLoRA647836 MiB38.4%22.72 h
LLaMA-7BAlpacaQLoGE2569462 MiB38.6%17.67 h
LLaMA-13BOASST1QLoRA6413,258 MiB45.5%7.06 h
LLaMA-13BOASST1QLoGE32014,542 MiB45.1%5.87 h
LLaMA-13BAlpacaQLoRA6413,986 MiB47.8%37.65 h
LLaMA-13BAlpacaQLoGE32014,604 MiB47.6%31.34 h
Table 7. Perplexity of different α with different datasets for OPT-1.3B. A value of 0 means using only the magnitude of the singular value, and means using only the contribution to the gradient. The symbol “↓” indicates that lower values are better.
Table 7. Perplexity of different α with different datasets for OPT-1.3B. A value of 0 means using only the magnitude of the singular value, and means using only the contribution to the gradient. The symbol “↓” indicates that lower values are better.
PPL ↓ α = 0 α = 0.1 α = 0.3 α = 1 α = 3 α = 10 α =
Self-instruct4.184.154.154.184.164.194.69
Alpaca7.537.557.527.567.567.547.63
Table 8. Time cost of SVD and the total fine-tuning process.
Table 8. Time cost of SVD and the total fine-tuning process.
ModelSVD (s)Fine-Tuning (s)Total (s)Ratios
LLaMA-7B61239,42040,0321.53%
LLaMA-13B142044,10045,5203.11%
Table 9. PPL on Self-Instruct for OPT-1.3B models after applying different r values of LoGE. The symbol “↓” indicates that lower values are better.
Table 9. PPL on Self-Instruct for OPT-1.3B models after applying different r values of LoGE. The symbol “↓” indicates that lower values are better.
r163264128256512
PPL ↓6.095.074.574.153.823.71
Runtime (h)1.591.571.601.601.621.72
Table 10. PPL on Self-Instruct for OPT-1.3B models after applying LoGE to different linear layers. The symbol “↓” indicates that lower values are better.
Table 10. PPL on Self-Instruct for OPT-1.3B models after applying LoGE to different linear layers. The symbol “↓” indicates that lower values are better.
Weight Type W q i W k i W v i W o i W q i , W k i W q i , W v i W q i , W k i , W v i , W o i
PPL ↓5.445.544.364.444.834.154.21
Runtime (h)1.421.421.521.441.521.601.95
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, L.; Lou, Z.; Ying, Y.; Yang, C.; Zhou, H. Efficient Fine-Tuning of Large Language Models via a Low-Rank Gradient Estimator. Appl. Sci. 2025, 15, 82. https://doi.org/10.3390/app15010082

AMA Style

Zhang L, Lou Z, Ying Y, Yang C, Zhou H. Efficient Fine-Tuning of Large Language Models via a Low-Rank Gradient Estimator. Applied Sciences. 2025; 15(1):82. https://doi.org/10.3390/app15010082

Chicago/Turabian Style

Zhang, Luoming, Zhenyu Lou, Yangwei Ying, Cheng Yang, and Hong Zhou. 2025. "Efficient Fine-Tuning of Large Language Models via a Low-Rank Gradient Estimator" Applied Sciences 15, no. 1: 82. https://doi.org/10.3390/app15010082

APA Style

Zhang, L., Lou, Z., Ying, Y., Yang, C., & Zhou, H. (2025). Efficient Fine-Tuning of Large Language Models via a Low-Rank Gradient Estimator. Applied Sciences, 15(1), 82. https://doi.org/10.3390/app15010082

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop