Next Article in Journal
Using Machine Learning to Detect Vault (Anti-Forensic) Apps
Previous Article in Journal
Big-Delay Estimation for Speech Separation in Assisted Living Environments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models

Department of Applied Artificial Intelligence, Hanyang University at Ansan, Ansan 15588, Republic of Korea
*
Author to whom correspondence should be addressed.
Future Internet 2025, 17(4), 185; https://doi.org/10.3390/fi17040185
Submission received: 12 March 2025 / Revised: 10 April 2025 / Accepted: 18 April 2025 / Published: 21 April 2025
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)

Abstract

:
Modern large language models (LLMs) achieve state-of-the-art performance through architectural advancements but require high computational costs for inference. Post-training quantization is a widely adopted approach to reduce these costs by quantizing weights and activations to lower precision, such as INT8. However, we identify a critical challenge in activation quantization for GLU (Gated Linear Unit) variants, which are commonly used in the feed-forward networks of modern LLMs like the LLaMA family. Specifically, severe local quantization errors arise due to excessively large activation magnitudes, which we refer to as activation spikes, leading to significant degradation in model performance. Our analysis reveals a systematic pattern of these spikes: they predominantly occur in the FFN (feed-forward network) layers at the early and late layers of the model and are concentrated on a small subset of tokens rather than being uniformly distributed across a token sequence. To mitigate this issue, we propose two empirical methods: Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), which isolate activation spikes during quantization. Extensive experiments demonstrated that our methods effectively improve activation quantization, particularly in coarse-grained quantization schemes, enhancing the performance of LLMs with GLU variants and addressing the limitations of existing quantization techniques. The code for implementing our methods and reproducing the experiments is publicly available our GitHub repository.

1. Introduction

Large language models (LLMs) have become a key paradigm in natural language processing, accelerating the release of variations within the community [1,2]. Furthermore, the latest LLMs establish state-of-the-art performance by training with increased scale, as well as by adopting architectural improvements such as GLU [3], RoPE [4], GQA [5], and MoE [6]. Especially, GLU (Gated Linear Unit) variants (e.g., SwiGLU, GeGLU), which modulate hidden representations through element-wise gating between parallel linear transformations, have been adopted in most modern LLM architectures (e.g., LLaMA family [7]) due to their training efficiency [3,8]. Although LLMs broaden foundational capabilities in natural language tasks and potential for various applications, billions of parameters in the large models impose considerable computational costs on end users in practice. To reduce GPU memory requirements and accelerate inference speed, post-training quantization (PTQ), which reduces bit-widths of weights and activations after training without updating parameters, offers an affordable solution by quantizing weights and activations into a lower precision (e.g., INT8) without a need for expensive retraining steps [9,10,11]. However, recent studies have revealed that large magnitude values at certain coordinates exist in the activations of LLMs, which are often called outliers, posing a key challenge in activation quantization [12,13,14,15]. Another line of works attempts to explain the role of outlier values in the attention mechanism [16,17].
In this paper, we present our discovery that the GLU architecture in the feed-forward network (FFN) generates excessively large activation values, which are responsible for significant local quantization errors. Specifically, we observe that these problematic activation values occur in specific linear modules and are dedicated to a couple of tokens, which will be discussed in Section 3. To distinguish the excessive GLU activations from the outliers, we refer to them as activation spikes. In light of our observations, we propose two empirical methods to mitigate the impact of activation spikes on quantization: Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP). QFeM aims to partially exclude quantization for linear layers (or modules) where large quantization errors occur, instead of quantizing the entire linear modules in the LLM. By scoring the extent of scale disparity, QFeM selects linear modules to exclude. On the other hand, QFeP identifies the prefix that triggers activation spikes and stores its context using a key–value (KV) cache. This cache mechanism retains previously computed attention keys and values, allowing the model to bypass redundant computations during decoding. It is noteworthy that both QFeM and QFeP rely on calibration results to capture activation spikes in advance without any modifications to the target LLM. Thus, our methods can be integrated into any existing quantization methods such as Xiao et al. [13], Wei et al. [15].
In our comprehensive experiments, we demonstrated that recently released LLMs incorporating GLU variants struggled with activation spikes when applying activation quantization. Consequently, the proposed methods, QFeM and QFeP, substantially enhance the performance of the primitive quantization method, the round-to-nearest (RTN) method, which quantizes activations by rounding each value to the closest discrete level based on fixed scaling. Furthermore, we observe that current outlier alleviation methods [13,15] and state-of-the-art quantization methods [18,19] are exposed to the activation spikes and benefit from our proposed methods. Compared to the strong baseline of fine-grained activation quantization [20], our methods show competitive performance, achieving reduced latency and memory footprint. In summary, the contributions of our work are as follows:
  • We find that the GLU architecture in modern LLMs systematically generates excessive activation values, which are responsible for significant performance degradation when quantizing the entire model and input tokens.
  • Based on our observations, we propose two empirical methods, QFeM and QFeP, which effectively exclude the activation spikes during quantization, with negligible computational overhead and compatibility with any existing quantization techniques.
  • Our experiments validate the detrimental impact of the activation spikes on activation quantization, while the proposed methods consistently enhance the quantization performance.

2. Related Works

2.1. Outlier Values in LLMs

Previously, outlier values have been observed in transformer-based language models such as BERT [21] and early GPT [22] models through numerous studies [23,24,25,26,27]. Since the advent of LLMs [28,29] rooted in the GPT [22], recent studies [12,13,14] have tackled the existence of outlier values in LLMs. These outliers exhibit a large magnitude of values at the shared dimensions of hidden states across tokens. More recently, Bondarenko et al. [16], Sun et al. [17] explained that the outliers are attributed to the vertical pattern in the attention mechanism [30,31], which influences the performance of LLMs. In particular, Sun et al. [17] claims a different type of outlier existing in the hidden states of specific tokens. However, prior studies merely focused on the superficial hidden states between the decoder layers. Our work provides a module-level investigation where quantization is applied practically, focusing on different LLM architectures.

2.2. Post-Training Quantization for LLMs

Post-training quantization (PTQ) refers to the quantization of a neural network model to low precision, such as INT8, without additional parameter updates [9,10]. Especially for LLMs, this approach cost-effectively achieves inference with low memory usage and faster inference latency by quantizing the weights and activations used in matrix multiplication (e.g., linear layer). However, because of the challenges in the activation quantization of LLMs, many recent works have mainly focused on the weight-only quantization [19,32,33,34,35,36,37]. Otherwise, the activation quantization faces inherent outliers, which hinder accurate quantization by reducing the representation resolution. To address this challenge, Dettmers et al. [12] proposed a mixed-precision quantization method where the outlier dimensions are computed in high precision. Xiao et al. [13], Wei et al. [15], Shao et al. [19] approached the migration of a scale from activation to weights to alleviate the scale of outlier activations. Most similar to our work, Liu et al. [38], Son et al. [39] utilized a prompt to improve the quantization accuracy. We highlight that our work uncovers how the prefix works on GLU activations and propose an efficient prefix length that is sufficient for handling activation spikes, while they focused on preserving the attention sink [30] pattern.

3. Activation Spikes: Excessive Magnitude of GLU Activations

For clarity, “hidden states” refer to the output tensor of a transformer layer (or block), while “input activations” or “activations” denote the input tensor of a linear layer (or module) in the remainder of this paper. Recent work [17] has investigated a novel type of outlier existing in the hidden states across modern LLMs. Although these outliers of hidden states play a crucial role in the attention mechanism [16,17,30], their relationship with input activations for quantization has not been fully explored. Importantly, because recent LLMs adopt Pre-LN [40,41], which normalizes hidden states before self-attention and feed-forward network (FFN) blocks, the scale of hidden states does not reflect the scale of input activations within the transformer block. Therefore, we focus on the input activations fed into each linear module within the transformer block to connect to activation quantization. Specifically, we examine the four linear (projection) layers: query (parallel to key and value), out, up (parallel to gate), and down modules. For details of the Pre-LN transformer, please see Appendix A.

3.1. Existence of Activation Spikes in GLU Variants

To analyze the input activations, we employed a calibration method, which was used to estimate the quantization factors such as the scale and zero-point. For the calibration data, we used 512 samples randomly collected from the C4 [42] training dataset. Afterwards, we fed each sample into the LLM and monitored each hidden state and input activation through the decoder layers. To estimate the scale factor, we used absolute maximum value. The tested LLMs are listed in Table A1.

3.1.1. GLU-Implemented LLMs Exhibit Activation Spikes at Specific Layers

In Figure 1a, we display the calibrated scale factors for the LLMs that implement GLU variants (e.g., SwiGLU, GeGLU). Across models, we observe a shared pattern of scale from the results. Within the early and late layers, the down modules in the FFN show noticeable magnitudes of input activations. Note that these input activations are derived from the Hadamard Product within the GLU [3]. Thus, the GLU variants generate activation spikes at the specific layers. Interestingly, we notice a high correlation between the emergence of activation spikes and intermediate hidden states of a large scale. This indicates that the FFN contributes to amplifying the hidden states via the addition operation in the residual connection [43]. Once the magnitude of the hidden states is exploded, it persists through layers until it encounters the activation spikes at late layers.

3.1.2. Non GLU-Implemented LLMs Show Modest Scale Distribution

Figure 1b illustrates the calibration results for LLMs with the original feed-forward implementation in the transformer [44]. We observe that the LLMs continue to generate the large-scale hidden states regardless of the GLU implementation. This corresponds to the observations in Sun et al. [17]. More importantly, our module-level results elaborate that the scale of hidden states is not transferable to the input activations of inner linear modules. Instead, we reveal that GLU variants are associated with the hidden states and generate activation spikes. This clarifies the quantization challenge of the GLU-implemented LLMs concentrated in the early and late layers. Because excessive scales of activation spikes have the potential to hinder the accurate quantization, we conduct an in-depth analysis to better understand these activation spikes in the following sections.

3.2. Token-Level Scale Analysis Within Activation Spikes

In the previous section, we observe the excessive scale of the input activations derived from GLU activation. When quantizing the input activations, the variance of input activation scales for each token affects the quantization performance [20]. To delve into the disparity between token-wise scales in the activation spikes, we unroll them through the sequence of tokens. Figure 2 illustrates the individual input activation scales where the activation spike appears. Given a token sequence, the large magnitudes of input activations are observed in a couple of tokens, such as the BOS token, newline (\n), and apostrophe (′). These specific tokens coincide with the observations of Sun et al. [17], which suggests that such tokens exhibit massive values in the hidden states. Thus, the activation spike is associated with the process of assigning a special role to these tokens in later transformer layers. However, the excessive scale of a specific token hinders the estimation of the scale factor for the other tokens, such as in per-tensor quantization. Additionally, the largest scale is dedicated to the first instance of the specified token, while the following usage exhibits a modest scale. This phenomenon makes the quantization more complicated, as the activation spikes dynamically occur depending on the current input sequence.

3.3. Effect of Quantization on Activation Spikes

We explore the impact of local quantization errors caused by activation spikes on LLM outputs. To identify the layers where activation spikes occur, we utilize a ratio between the maximum and median values of the token-wise input activation scales instead of using the maximum scale value alone. The max–median ratio for linear layer m can be formulated as r ( m ) = max ( S ( m ) ) median ( S ( m ) ) , where S ( m ) represents the token-wise input activation scales coming to module m M . This max–median ratio captures the extent to which the maximum scale dominates the other token scales. For comparison, we chose the activation quantization targets as the top-4, middle-4, and bottom-4 modules based on the max–median ratio in descending order. Then, we evaluated the perplexity and mean-squared error (MSE) using the calibration dataset. Here, the MSE was calculated for the last hidden states between the original (FP16) and partially quantized LLM. As shown in Table 1, quantization on the top-4-rated modules solely degrades the LLM performance by significant margins, while the other cases exhibit negligible performance changes. We consider these quantization-sensitive input activations (inter alia activation spikes) to be the quantization bottleneck, which, in this paper, refers to the quantization error caused by outliers.
Furthermore, the activation spikes are conditioned on the specific context of the input sequence as discussed in Section 3.2. Altogether, to address the influence of activation spikes—which vary across layers due to the GLU architecture—we propose in the following sections a layer-wise selective quantization method and a prefix-based approach for mitigating quantization errors.

4. Mitigating Quantization Quality Degradation Based on the Observation

To address the quantization bottleneck, our approach is based on the deterministic occurrence patterns of activation spikes. First, we utilize the observation that bottlenecks occur at a few specific layers. This implies that the naive full quantization of LLMs is affected by these bottlenecks. Second, we exploit the phenomenon that the activation spike is derived from the first occurrence of specific tokens. Thus, the planned occurrence prevents recurrence in the subsequent and possibly future tokens.

4.1. Overview of Adapting Our Techniques for Quantization

In this section, we introduce two practical methods—Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP)—that can be applied independently or jointly to mitigate activation spike-induced quantization errors. An overview of these methods is illustrated in Figure 3, which summarizes the key ideas behind QFeM and QFeP. QFeM is applied in a layer-wise manner: after selecting an LLM with a GLU-based architecture and a layer-wise quantization method, QFeM evaluates the quantization efficiency for each linear layer based on activation spike severity. Layers with strong spikes are excluded from activation quantization to reduce degradation. The scoring and selection process is detailed in Section 4.2. QFeP is a token-level method applicable regardless of GLU usage or layer-wise quantization. It identifies tokens that trigger strong spikes and preemptively inserts them as a prefix to be stored in the KV cache, effectively suppressing spikes during inference. For tested LLMs listed in Table 2, preselected prefixes can be used directly; otherwise, a prefix can be searched and applied according to the technique described in Section 4.3.

4.2. Quantization-Free Module (QFeM)

In the full quantization of LLM, all linear layers within the LLM are quantized. Among these linear layers, we propose omitting the quantization of input activations for linear layers where significant quantization errors are caused by activation spikes. To be noted, increasing the number of unquantized modules exhibits a trade-off between the inference latency and the model performance. Thus, determining which module should be quantized (or left unquantized) is crucial to retaining the efficacy of quantization. Here, we use the max–median ratio r ( m ) and define a set of unquantized modules, denoted as M unq , where the ratio r ( m ) of each linear layer is larger than the threshold α . For instance, all linear layers in M are quantized if α = . For clarity, we treat sibling linear layers, such as query-key-value, as a single linear layer. To control the impact of activation quantization only, we leave the weight parameters in unquantized linear layers as INT8 and dequantize them into FP16 during matrix multiplication with the incoming activations, operating as weight-only quantization.

Optimizing the Threshold α

To calculate the activation scale ratio for each linear layer, we first gather token-wise input activation scales from the calibration examples discussed in Section 3.1. Exceptionally, for FFN experts in mixture of experts (MoE) architectures like the Mixtral model [6], calibration is performed separately. After determining these ratios, we use binary search to set the threshold value α , balancing inference latency and performance degradation. As a metric, we assess performance through perplexity measured on the same calibration examples. For example, the relationship between the threshold value α and its impact on performance is depicted in Figure 4, demonstrating how full quantization can degrade performance. Rather than fully quantizing, we identify an optimal threshold by finding the intersection of two performance curves; in Figure 4, this threshold is approximately 16. Details on the QFeM implementation are provided in Table 2.

4.3. Quantization-Free Prefix (QFeP)

Orthogonal to the QFeM, we propose the Quantization-free Prefix (QFeP), which mitigates the quantization errors by precomputing the prefix (or short prompt) corresponding to activation spikes. This method is based on the observations presented in Section 3.2, which indicate that significant quantization errors result from the overestimated scale factor of the first instance within the restricted token set. Inspired by this occurrence pattern of activation spikes, we aim to construct a prefix that stabilizes the quantization scale factor of the tokens that come after the prefix. In other words, once the prefix is fixed at the beginning, the activation spikes consistently occur within the prefix. Afterward, we employ the key–value (KV) caching mechanism to process the activation spikes in advance. In practice, KV cache is utilized to optimize the decoding speed of causal language models by storing precomputed key and value states of the previous tokens [45,46]. This approach provides a bypass of the quantization including activation spikes, while preserving the context of the prefix through the KV cache. The KV cache for the prefix is precomputed once through the offline inference of the LLM without quantization. Then, this KV cache is exploited in the quantization phases, such as calibration or dynamic quantization, even for quantized inference. The process of QFeP is illustrated in Figure 3. We verified that our prefix design is effective in mitigating the quantization errors through a prefix ablation study (Section 5.4).

4.3.1. Prefix Search

To form a prefix of an explicit activation spike, we first identify a candidate token that represents the activation spike at the linear layer with the highest max–median ratio r ( m ) . For instance, the candidate token can be an apostrophe (′) token for the LLaMA-2-70B model, as highlighted in red in Figure 2. Once the candidate token is identified, we search the middle context token between the BOS token and the candidate token in the prefix. This middle context provides a dummy context, which is required to activate the candidate token. To find the middle context, we design a template [ B , T 1 , C 1 , T 2 , C 2 ] where B, T i , and C i denote the BOS token, context token, and candidate token in the vocabulary V, respectively. Then, we select the context token T where C 1 triggers an activation spike, whereas the latter instance of the same token C 2 does not. When the context token for the activation spikes is varied, we choose the token that maximizes the activation scale ratio between C 1 and C 2 . Finally, we prepare the KV cache for the searched prefix of [ B , T , C ] . Note that the latter sequence in the template can be replaced with sequences from the dataset instead of a repetition.

4.3.2. Implementation Details

During the prefix search phase, we exploit the calibration dataset used in Section 3.1. For the candidate tokens, we consider the tokens with the top three largest input activation magnitudes. Then, we search for the middle context token among the top 200 most frequent tokens in the calibration dataset, which is the subset of the vocabulary V. Finally, with the search result, we prepare the KV cache for the target model in FP16 precision. Exceptionally, for the Mixtral [6] model, we use the scale of output hidden states instead of input activations, as the tokens are divided sparsely in a mixture of experts architecture. Table 2 presents the searched prefix.

5. Experiments

5.1. Experimental Setup

5.1.1. Models and Environment

Our proposed methods, QFeM and QFeP, aim to mitigate the quantization bottleneck, which is discussed in Section 3.3, caused by the activation spikes, especially in the GLU variants. To validate the proposed efficiency methods, we tested publicly released LLMs that were implemented with GLU according to their paper and source code. We recognize that recent LLMs, including LLAMA-2-{7B, 13B, 70B} [47], LLaMA-3-{7B, 70B} [48], Mistral-7B [49], Mixtral-8x7B [6], SOLAR-10.7B [50], and Gemma-7B [51], utilize the GLU architecture. LLMs with the original FFN are not covered, as they suffer from the existing outliers rather than activation spikes. All models were sourced from the Hugging Face model repository (available at: https://huggingface.co/models, accessed on 17 April 2025). We used Python 3.10, PyTorch 2.2, and Transformers 4.38.1 as our primary software environment and employed the lm-eval-harness library for LLM evaluation. Experiments were conducted on a single NVIDIA RTX 4090 and a single NVIDIA A100 GPU, where smaller models were evaluated on the NVIDIA RTX 4090, and larger models were evaluated on the NVIDIA A100 based on their respective sizes.

5.1.2. Quantization

In the experiments, we quantized both the input activations and the weights of linear layers for INT8 matrix multiplication operations. Note that in Table 2, | M | denotes the total number of linear modules targeted for quantization. In these linear layers, we opted for dynamic per-tensor quantization as the quantization scheme of input activations and per-channel quantization for weights, respectively. Regarding both input activations and weights, we symmetrically quantized the range using the absolute maximum value as the scale estimation function. For comparison, we used FP16 and per-token activation quantization [20] as baselines. We refer the reader to Appendix E for Batch Matrix-Multiplication (BMM) quantization, which involves quantizing tensors in the self-attention.

5.1.3. Evaluations

We evaluated the quantized LLMs with two metrics: zero-shot evaluation accuracy and perplexity. For zero-shot evaluation, we used four datasets: PIQA [52], LAMBADA [53], HellaSwag [54], and WinoGrande [55]. We utilized the lm-evaluation-harness library [56] to evaluate zero-shot tasks. To measure perplexity, we used the WikiText-2 [57] dataset. In all cases, we used the [BOS] token as the starting token for each input sequence by default.

5.2. Main Results

5.2.1. LLaMA-2 Models

We report the evaluation results of quantization on LLaMA-2 models in Table 3. Compared to FP16 precision, quantizing both weights and activations (W8A8) degrades the overall performance. The results demonstrate that our proposed methods resolve the activation spikes and, surprisingly, restore the performance of the W8A8 close to that of FP16. For example, the LLaMA-2 7B model achieved less than a 1% performance drop from FP16. It is worth noting that the proposed QFeM and QFeP improved at comparable levels. This indicates that the activation spikes present a direct cause of the significant decrease in quantization performance. Because the proposed methods are orthogonal, the performance slightly increases when incorporating both QFeM and QFeP compared to applying them individually.

5.2.2. Other GLU-Implemented LLMs

For other LLMs that incorporate GLU, we investigated the effectiveness of our methods in mitigating the quantization bottleneck. As can be seen in Figure 5, our methods consistently remedy the performance drop caused by activation spikes. Noticeably, the Mixtral model demonstrates robustness towards the performance degradation. This indicates that the mixture of experts architecture, which divides the MLP experts by tokens, helps to alleviate the impact of the activation spikes. Meanwhile, addressing the activation spikes is not a sufficient complement for the Gemma model compared to other models. We attribute this to the choice of activation function among GLU variants; specifically, Gemma uses GeGLU, while other models employ SwiGLU.

5.3. Combining Outlier Alleviation Methods

While our method focuses on the activation spikes, the inherent outlier values in the input activations remain. Here, we combine the prior outlier alleviation methods, such as SmoothQuant (SQ) [13] and OutlierSuppressionPlus (OSP) [15], to further improve the quantization error. Table 4 demonstrates the evaluation results of applying the outlier alleviation methods solely and combining them with our methods. We find that there are cases where the alleviation method fails to recover the performance when quantizing the activations with per-tensor scheme. In the original papers [13,15], the activations of LLaMA models were quantized using only a dynamic per-token scheme. This indicates that alleviating the outlier scales, including the activation spikes, is challenging. Under 4-bit quantization, we observe that QFeM and QFeP are efficient even when per-token quantization is applied. Similarly, we confirm this increase in our analysis of state-of-the-art quantization methods in Appendix D.

5.4. Prefix Ablation Study

For the QFeP, we designed a length-three prefix for the KV cache, including the BOS token, context token, and extra token for activation spike. Because the KV cache consumes the capacity of the pretrained sequence position, it raises a question about the length of the prefix. Therefore, we conducted an ablation study for different prefixes for the KV cache. For the prefixes, we prepared random, BOS only, and both QFeP without and with the context tokens. Notably, the BOS cache corresponds to I n t a c t K V [ B ] [38]. We illustrate the results of the ablation study in Figure 6. In all cases, the random prefix showcases the lowest performance. While the KV cache with the BOS token demonstrates inconsistent performance, our QFeP consistently shows significant improvements. Importantly, the results imply that the sufficient prefix for the models exhibits differences. However, we emphasize that our KV design for QFeP shows improvements by large margins across all models. Beyond our QFeP, I n t a c t K V [ P ] [38] extends the BOS token to the system prompt for supervised fine-tuned models and CushionCache [39] update prefixes through quantization-aware prefix tuning.

5.5. Computational Cost Analysis

The proposed methods require additional resources to evict the activation spikes. Therefore, we analyzed the computational costs of the methods and compared them in various schemes. For comparison, we evaluated different activation quantization schemes: dynamic per-token, dynamic per-tensor, and static per-tensor, denoted as AQ1, AQ2, and AQ3, respectively. This distinction establishes strong baselines and demonstrates the potential of the methods. To calibrate the static scales, we estimated the absolute maximum value using the calibration dataset, which is used in Section 3.1.

5.5.1. Inference Latency

For each setting, we present the accuracy of the zero-shot tasks and inference latency of the fixed token sequence, as shown in Figure 7. While the fine-grained scheme (AQ1) shows a negligible accuracy drop, the counterparts (AQ2, AQ3) degrade with the quantization bottleneck. However, by applying our methods, the coarse-grained schemes achieved a competitive performance gain. For example, the combination of AQ2 and QFeM demonstrates a performance close to AQ1 but with a faster latency. The results signify that addressing the quantization bottleneck is important to accelerate the inference latency with coarser granularity. Specifically, the naive static quantization (AQ3), the fastest scheme, exhibits a significant decline. We hope that our work contributes to future works that address the remaining challenges in static quantization.

5.5.2. Memory Footprint

In Table 5, we record the maximum memory footprint of our methods. For QFeP, the additional memory is consistently required for the preserved KV cache. However, this memory overhead is much smaller than that used in the fine-grained quantization (AQ1), as QFeM utilizes only three tokens for the cache. Contrary to QFeP, QFeM shows inconsistent memory utilization. For example, the 7B model with QFeM exhibits memory usage similar to AQ2, while the 70B model with QFeM incurs additional consumption for a sequence length of 1K. This is attributed to the use of W8A16 for the unquantization modules in QFeM. To tailor the memory usage or inference speed, an alternative strategy can be utilized for QFeM, such as applying fine-grained activation quantization to the unquantization modules instead of using W8A16.

6. Conclusions

We explored the quantization challenge of GLU activations for modern LLMs. We found that the GLU variants generate excessive activation scales, which cause significant quantization bottlenecks at the specific layers. Based on the systematic generation pattern of the activation spikes, we propose methods that address the spikes in a layer-wise (QFeM) and token-wise manner (QFeP). Our experimental results demonstrate significant improvements, showing approximately a 7–16% increased accuracy in quantized LLaMA-2 models compared to traditional quantization methods. Additionally, our methods achieved a substantial reduction in perplexity—up to 85.4% in LLaMA-2-13B, and over 50% in LLaMA-2-7B—closely approaching full-precision (FP16) performance. Furthermore, our analysis indicates that QFeM optimizes computational efficiency, reducing latency by up to 10.4% in LLaMA-2-13B, by selectively avoiding quantization in layers severely impacted by activation spikes, effectively balancing latency–performance trade-offs. Similarly, QFeP minimizes memory overhead, increasing memory usage by less than 0.05% in LLaMA-2-7B for 2 K tokens, while efficiently managing problematic activations through KV caching mechanisms. We expect that our work sheds light on the potential challenges in future studies on quantization and facilitates the development of efficient LLM systems.

Author Contributions

Conceptualization, J.Y.; methodology, J.Y.; software, J.Y. and H.K.; validation, J.Y. and H.K.; formal analysis, J.Y.; investigation, H.K.; resources, H.K.; data curation, H.K.; writing—original draft preparation, J.Y.; writing—review and editing, H.K., J.J. and Y.K.; visualization, H.K.; supervision, Y.K.; project administration, Y.K.; funding acquisition, Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the Ministry of Trade, Industry and Energy (MOTIE) and Korea Institute for Advancement of Technology (KIAT) through the International Cooperative R&D program (Project No. P0025661). This work was also supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2022-00155885, Artificial Intelligence Convergence Innovation Human Resources Development (Hanyang University ERICA)).

Data Availability Statement

The dataset and source code utilized in this study have been made publicly available in the GitHub repository (https://github.com/onnoo/activation-spikes) and can be accessed at any time. (We last accessed on 17 April 2025).

Acknowledgments

We sincerely thank the anonymous reviewers for their insightful and constructive comments, which greatly contributed to enhancing the content and quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Transformer Architecture

In Figure A1, we illustrate the Pre-LN transformer architecture and each sub-module. We highlight with the same color the linear modules that accept identical input activations. Note that the hidden states are normalized before forwarding into the query and up linear modules.
Figure A1. An illustration of the Pre-LN transformer block and its sub-modules. Two feed-forward implementation, GLU and Non-GLU, are visualized in (c) and (d) respectively. In the feed-forward network, σ denotes a non-linear activation function, such as GeLU. We highlight the linear modules where input activations are quantized.
Figure A1. An illustration of the Pre-LN transformer block and its sub-modules. Two feed-forward implementation, GLU and Non-GLU, are visualized in (c) and (d) respectively. In the feed-forward network, σ denotes a non-linear activation function, such as GeLU. We highlight the linear modules where input activations are quantized.
Futureinternet 17 00185 g0a1

Appendix B. Additional Calibration Results

In this section, we provide details of LLMs when performing calibration, which is the step during quantization where the FP16 ranges are computed (Appendix B.1), and additional calibration results (Appendices Appendix B.2 and Appendix B.3).

Appendix B.1. Detailed Specification of LLMs

In Section 3.1, we describe the application of the calibration method on various LLMs. We observe the calibration results by categorizing based on the presence of GLU in the LLMs. Table A1 shows the detailed structures of the LLMs. We refer to notations for feed-forward implementation from [3]. In the case of GLU-implemented LLMs, which are LLaMA-2, LLaMA-3, Mistral, Mixtral, SOLAR, StableLM-2, and Gemma, most models have SwiGLU for FFN activation, while only Gemma has GeGLU. On the other hand, in non GLU-implemented LLMs, most of them utilize GeLU for FFN activation, with the exception of OPT, which uses ReLU.
Table A1. Architecture specification of LLMs. We categorize them into two groups depending on whether GLU is implemented in the FFN. All LLMs in the table use Pre-LN for the LayerNorm position.
Table A1. Architecture specification of LLMs. We categorize them into two groups depending on whether GLU is implemented in the FFN. All LLMs in the table use Pre-LN for the LayerNorm position.
ModelSizeFFN ActivationNormalizationPEVocabulary Size
GLU-implemented LLMs:
LLaMA-2 [47]7 B, 13 B, 70 BSwiGLURMSNormRoPE32,000
LLaMA-38 B, 70 BSwiGLURMSNormRoPE128,256
Mistral [49]7 BSwiGLURMSNormRoPE32,000
Mixtral [6]8 × 7 BSwiGLURMSNormRoPE32,000
SOLAR [50]10.7 BSwiGLURMSNormRoPE32,000
StableLM-2 [58]12 BSwiGLULayerNormRoPE100,352
Gemma [51]7 BGeGLURMSNormRoPE256,000
Non GLU-implemented LLMs:
OPT [29]6.7 B, 13 B, 30 B, 66 BReLULayerNormLearned50,272
MPT [59]7 B, 30 BGeLULayerNormALiBi50,432
Pythia [60]6.9 B, 12 BGeLULayerNormRoPE50,432, 50,688
Falcon [61]7 B, 40 BGeLULayerNormRoPE65,024
Phi-2 [62]2.7 BGeLULayerNormRoPE51,200

Appendix B.2. Additional Calibration Results on GLU-Implementation

Figure A2 and Figure A3 show the calibration result examples for various GLU-implemented LLMs that are not shown in the models in Figure 1a. In most GLU-implemented LLMs, we observe that the input activations have large values near the first and last layers. Unlike the typical GLU-implemented LLM architecture, Mixtral is composed of eight feed-forward blocks in the single FFN, containing multiple gate linear units [6]. According to this structure, we can observe that one of the gates spikes in value in Figure A2.
Figure A2. Calibration results on GLU-implemented LLMs (Mixtral-8 × 7B).
Figure A2. Calibration results on GLU-implemented LLMs (Mixtral-8 × 7B).
Futureinternet 17 00185 g0a2

Appendix B.3. Additional Calibration Results on Non-GLU Implementation

Figure A4 shows the calibration result examples for various non-GLU-implemented LLMs that are not shown in the models in Figure 1b. There are no activation spikes on non-GLU-implemented LLMs.
Figure A3. Calibration results on GLU-implemented LLMs.
Figure A3. Calibration results on GLU-implemented LLMs.
Futureinternet 17 00185 g0a3
Figure A4. Calibration results on non-GLU-implemented LLMs.
Figure A4. Calibration results on non-GLU-implemented LLMs.
Futureinternet 17 00185 g0a4

Appendix C. Additional Results for Token-Level Scale Analysis

We provide additional results for the token-level scale analysis (Section 3.2). In Figure A5 and Figure A6, the token for the activation spikes behind the BOS token does not exhibit the excessive activation scale.
Figure A5. Token-wise scale analysis for LLaMA-2-7B. The newline token behind the BOS token does not exhibit the activation spikes.
Figure A5. Token-wise scale analysis for LLaMA-2-7B. The newline token behind the BOS token does not exhibit the activation spikes.
Futureinternet 17 00185 g0a5
Figure A6. Token-wise scales from the unrolled activation spike of LLaMA-2-70B. The apostrophe token behind the BOS token does not exhibit the activation spikes.
Figure A6. Token-wise scales from the unrolled activation spike of LLaMA-2-70B. The apostrophe token behind the BOS token does not exhibit the activation spikes.
Futureinternet 17 00185 g0a6

Appendix D. Low-Bit Activation Quantization

In Section 5.3, we discuss the efficacy of the QFeM and QFeP for outlier alleviation methods. To concretize the compatibility of our proposed methods, we provide additional experimental results for recent state-of-the-art (SOTA) quantization techniques, Atom [18] and OmniQuant [19]. Note that we follow their original implementation and parameter settings (e.g., fine-grained group quantization in Atom) during evaluation. Table A2 illustrates the results for SOTA methods for the perplexity of WikiText-2 [57] and C4 [42] datasets. As expected, our methods are compatible with SOTA quantization techniques and enhance their performance, especially for OmniQuant. For Atom, we observe that fine-grained group quantization has the potential to address activation spikes. By using this level of granularity, one can quantize the problematic linear modules for QFeM rather than leaving them unquantized.
Table A2. Evaluation of state-of-the-art activation quantization methods with QFeM and QFeP. Bold values indicate the best performance for each method across the columns.
Table A2. Evaluation of state-of-the-art activation quantization methods with QFeM and QFeP. Bold values indicate the best performance for each method across the columns.
MethodLLaMA-2-7BLLaMA-2-13B
Wiki2C4Wiki2C4
Atom (W4A4)5.7107.6015.0816.878
  +QFeM5.6347.5385.0716.854
  +QFeP5.6857.4935.0896.872
  +QFeM+QFeP5.6077.4425.0736.855
OmniQuant (W4A4)14.20819.00510.41614.103
  +QFeM9.48313.5699.49912.348
  +QFeP11.92615.79610.80813.584
  +QFeM+QFeP8.81812.2449.90112.294

Appendix E. BMM Quantization

To achieve faster inference latency, BMM operations in the self-attention also can be computed as INT8 operation [13]. This requires a quantization on the query, key, and value states including the cached context. Because activation spikes produce a large magnitude of latent values, they are important to confirm the extent of quantization errors from KV quantization. This confirmation is necessary to gain advantages from BMM quantization. In Table A3, we examine the impact of BMM quantization on the W8A8 and QFeM. Regardless of the BMM quantization, the QFeM method consistently improves the quantization bottleneck. For example, the 13B and 70B models maintain their performance, while the 7B model shows a slight decrease. However, this decrease appears to be due to inherent quantization errors rather than a quantization bottleneck from activation spikes. As a result, we confirm that our QFeM method effectively improves the overall performance even in the BMM quantization scenario.
Table A3. BMM quantization results.
Table A3. BMM quantization results.
ModelMethodBMM Quantization
NoYes
7BW8A862.08%61.66%
  +QFeP68.69%68.30%
13BW8A855.29%55.43%
  +QFeP69.91%69.77%
70BW8A866.87%66.75%
  +QFeP72.62%72.69%

References

  1. Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
  2. Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682. [Google Scholar]
  3. Shazeer, N. Glu variants improve transformer. arXiv 2020, arXiv:2002.05202. [Google Scholar]
  4. Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
  5. Ainslie, J.; Lee-Thorp, J.; de Jong, M.; Zemlyanskiy, Y.; Lebrón, F.; Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv 2023, arXiv:2305.13245. [Google Scholar]
  6. Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Hanna, E.B.; Bressand, F.; et al. Mixtral of experts. arXiv 2024, arXiv:2401.04088. [Google Scholar]
  7. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
  8. Narang, S.; Chung, H.W.; Tay, Y.; Fedus, L.; Fevry, T.; Matena, M.; Malkan, K.; Fiedel, N.; Shazeer, N.; Lan, Z.; et al. Do Transformer Modifications Transfer Across Implementations and Applications? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 5758–5773. [Google Scholar] [CrossRef]
  9. Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar]
  10. Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision; Chapman and Hall/CRC: Boca Raton, FL, USA, 2022; pp. 291–326. [Google Scholar]
  11. Nagel, M.; Fournarakis, M.; Amjad, R.A.; Bondarenko, Y.; Van Baalen, M.; Blankevoort, T. A white paper on neural network quantization. arXiv 2021, arXiv:2106.08295. [Google Scholar]
  12. Dettmers, T.; Lewis, M.; Belkada, Y.; Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Adv. Neural Inf. Process. Syst. 2022, 35, 30318–30332. [Google Scholar]
  13. Xiao, G.; Lin, J.; Seznec, M.; Wu, H.; Demouth, J.; Han, S. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
  14. Ahmadian, A.; Dash, S.; Chen, H.; Venkitesh, B.; Gou, Z.S.; Blunsom, P.; Üstün, A.; Hooker, S. Intriguing properties of quantization at scale. Adv. Neural Inf. Process. Syst. 2023, 36, 34278–34294. [Google Scholar]
  15. Wei, X.; Zhang, Y.; Li, Y.; Zhang, X.; Gong, R.; Guo, J.; Liu, X. Outlier Suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 1648–1665. [Google Scholar] [CrossRef]
  16. Bondarenko, Y.; Nagel, M.; Blankevoort, T. Quantizable transformers: Removing outliers by helping attention heads do nothing. Adv. Neural Inf. Process. Syst. 2023, 36, 75067–75096. [Google Scholar]
  17. Sun, M.; Chen, X.; Kolter, J.Z.; Liu, Z. Massive Activations in Large Language Models. arXiv 2024, arXiv:2402.17762. [Google Scholar]
  18. Zhao, Y.; Lin, C.Y.; Zhu, K.; Ye, Z.; Chen, L.; Zheng, S.; Ceze, L.; Krishnamurthy, A.; Chen, T.; Kasikci, B. Atom: Low-bit quantization for efficient and accurate llm serving. Proc. Mach. Learn. Syst. 2024, 6, 196–209. [Google Scholar]
  19. Shao, W.; Chen, M.; Zhang, Z.; Xu, P.; Zhao, L.; Li, Z.; Zhang, K.; Gao, P.; Qiao, Y.; Luo, P. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv 2023, arXiv:2308.13137. [Google Scholar]
  20. Yao, Z.; Yazdani Aminabadi, R.; Zhang, M.; Wu, X.; Li, C.; He, Y. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Adv. Neural Inf. Process. Syst. 2022, 35, 27168–27183. [Google Scholar]
  21. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  22. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
  23. Kovaleva, O.; Kulshreshtha, S.; Rogers, A.; Rumshisky, A. BERT Busters: Outlier Dimensions that Disrupt Transformers. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online Event, 1–6 August 2021; pp. 3392–3405. [Google Scholar] [CrossRef]
  24. Bondarenko, Y.; Nagel, M.; Blankevoort, T. Understanding and Overcoming the Challenges of Efficient Transformer Quantization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7947–7969. [Google Scholar] [CrossRef]
  25. Luo, Z.; Kulmizev, A.; Mao, X. Positional Artefacts Propagate Through Masked Language Model Embeddings. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 5312–5327. [Google Scholar] [CrossRef]
  26. Timkey, W.; van Schijndel, M. All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual Event, Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 4527–4546. [Google Scholar] [CrossRef]
  27. Puccetti, G.; Rogers, A.; Drozd, A.; Dell’Orletta, F. Outlier Dimensions that Disrupt Transformers are Driven by Frequency. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 1286–1304. [Google Scholar] [CrossRef]
  28. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  29. Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. Opt: Open pre-trained transformer language models. arXiv 2022, arXiv:2205.01068. [Google Scholar]
  30. Xiao, G.; Tian, Y.; Chen, B.; Han, S.; Lewis, M. Efficient streaming language models with attention sinks. arXiv 2023, arXiv:2309.17453. [Google Scholar]
  31. Kovaleva, O.; Romanov, A.; Rogers, A.; Rumshisky, A. Revealing the Dark Secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 4365–4374. [Google Scholar] [CrossRef]
  32. Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv 2022, arXiv:2210.17323. [Google Scholar]
  33. Kim, S.; Hooper, C.; Gholami, A.; Dong, Z.; Li, X.; Shen, S.; Mahoney, M.W.; Keutzer, K. Squeezellm: Dense-and-sparse quantization. arXiv 2023, arXiv:2306.07629. [Google Scholar]
  34. Lin, J.; Tang, J.; Tang, H.; Yang, S.; Dang, X.; Han, S. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv 2023, arXiv:2306.00978. [Google Scholar] [CrossRef]
  35. Chee, J.; Cai, Y.; Kuleshov, V.; De Sa, C.M. Quip: 2-bit quantization of large language models with guarantees. Adv. Neural Inf. Process. Syst. 2023, 196, 4396–4429. [Google Scholar]
  36. Yao, Z.; Li, C.; Wu, X.; Youn, S.; He, Y. A comprehensive study on post-training quantization for large language models. arXiv 2023, arXiv:2303.08302. [Google Scholar]
  37. Dettmers, T.; Svirschevski, R.; Egiazarian, V.; Kuznedelev, D.; Frantar, E.; Ashkboos, S.; Borzunov, A.; Hoefler, T.; Alistarh, D. Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv 2023, arXiv:2306.03078. [Google Scholar]
  38. Liu, R.; Bai, H.; Lin, H.; Li, Y.; Gao, H.; Xu, Z.; Hou, L.; Yao, J.; Yuan, C. IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact. arXiv 2024, arXiv:2403.01241. [Google Scholar]
  39. Son, S.; Park, W.; Han, W.; Kim, K.; Lee, J. Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization. arXiv 2024, arXiv:2406.12016. [Google Scholar]
  40. Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing, C.; Zhang, H.; Lan, Y.; Wang, L.; Liu, T. On layer normalization in the transformer architecture. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual Event, 13–18 July 2020; pp. 10524–10533. [Google Scholar]
  41. Baevski, A.; Auli, M. Adaptive Input Representations for Neural Language Modeling. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  42. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
  43. He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 630–645. [Google Scholar]
  44. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 11, 6000–6010. [Google Scholar]
  45. Pope, R.; Douglas, S.; Chowdhery, A.; Devlin, J.; Bradbury, J.; Heek, J.; Xiao, K.; Agrawal, S.; Dean, J. Efficiently scaling transformer inference. Proc. Mach. Learn. Syst. 2023, 5, 606–624. [Google Scholar]
  46. Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. arXiv 2019, arXiv:1904.01038. [Google Scholar]
  47. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
  48. Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
  49. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
  50. Kim, D.; Park, C.; Kim, S.; Lee, W.; Song, W.; Kim, Y.; Kim, H.; Kim, Y.; Lee, H.; Kim, J.; et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv 2023, arXiv:2312.15166. [Google Scholar]
  51. Team, G.; Mesnard, T.; Hardin, C.; Dadashi, R.; Bhupatiraju, S.; Pathak, S.; Sifre, L.; Rivière, M.; Kale, M.S.; Love, J.; et al. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024, arXiv:2403.08295. [Google Scholar]
  52. Bisk, Y.; Zellers, R.; Gao, J.; Choi, Y. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7432–7439. [Google Scholar]
  53. Paperno, D.; Kruszewski, G.; Lazaridou, A.; Pham, N.Q.; Bernardi, R.; Pezzelle, S.; Baroni, M.; Boleda, G.; Fernández, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1525–1534. [Google Scholar] [CrossRef]
  54. Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; Choi, Y. HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D., Màrquez, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 4791–4800. [Google Scholar] [CrossRef]
  55. Sakaguchi, K.; Bras, R.L.; Bhagavatula, C.; Choi, Y. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. arXiv 2019, arXiv:1907.10641. [Google Scholar] [CrossRef]
  56. Gao, L.; Tow, J.; Abbasi, B.; Biderman, S.; Black, S.; DiPofi, A.; Foster, C.; Golding, L.; Hsu, J.; Le Noac’h, A.; et al. A Framework for Few-Shot Language Model Evaluation. 2023. Available online: https://github.com/EleutherAI/lm-evaluation-harness (accessed on 2 September 2021). [CrossRef]
  57. Merity, S.; Xiong, C.; Bradbury, J.; Socher, R. Pointer sentinel mixture models. arXiv 2016, arXiv:1609.07843. [Google Scholar]
  58. Bellagente, M.; Tow, J.; Mahan, D.; Phung, D.; Zhuravinskyi, M.; Adithyan, R.; Baicoianu, J.; Brooks, B.; Cooper, N.; Datta, A.; et al. Stable LM 2 1.6 B Technical Report. arXiv 2024, arXiv:2402.17834. [Google Scholar]
  59. Team, M.N. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs. 2023. Available online: https://www.databricks.com/blog/mpt-7b (accessed on 5 May 2023).
  60. Biderman, S.; Schoelkopf, H.; Anthony, Q.G.; Bradley, H.; O’Brien, K.; Hallahan, E.; Khan, M.A.; Purohit, S.; Prashanth, U.S.; Raff, E.; et al. Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 2397–2430. [Google Scholar]
  61. Almazrouei, E.; Alobeidli, H.; Alshamsi, A.; Cappelli, A.; Cojocaru, R.; Debbah, M.; Goffinet, É.; Hesslow, D.; Launay, J.; Malartic, Q.; et al. The falcon series of open language models. arXiv 2023, arXiv:2311.16867. [Google Scholar]
  62. Mojan, J.; Sébastien, B. Phi-2: The Surprising Power of Small Language Models. 2023. Available online: https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ (accessed on 12 December 2023).
Figure 1. Calibration results on GLU-implemented and non GLU-implemented LLMs. We present the maximum magnitudes of input activations for each linear module and layer-wise hidden state. For more results on different LLMs, see Appendices Appendix B.2 and Appendix B.3.
Figure 1. Calibration results on GLU-implemented and non GLU-implemented LLMs. We present the maximum magnitudes of input activations for each linear module and layer-wise hidden state. For more results on different LLMs, see Appendices Appendix B.2 and Appendix B.3.
Futureinternet 17 00185 g001
Figure 2. Token-wise scales in a specific layer with an activation spike. When quantizing the input activations using a per-tensor scale, the scale of the activation spike dominates the scales of the other tokens: token-wise scales in a specific layer with an activation spike. The y-axis represents the activation magnitude, and the x-axis corresponds to the input sequence tokens. The background color of each token indicates its role: yellow for general tokens, red for the first occurrence of a token that causes an activation spike, and green for a repeated occurrence of the same token that does not trigger a spike. For more examples, see Appendix C.
Figure 2. Token-wise scales in a specific layer with an activation spike. When quantizing the input activations using a per-tensor scale, the scale of the activation spike dominates the scales of the other tokens: token-wise scales in a specific layer with an activation spike. The y-axis represents the activation magnitude, and the x-axis corresponds to the input sequence tokens. The background color of each token indicates its role: yellow for general tokens, red for the first occurrence of a token that causes an activation spike, and green for a repeated occurrence of the same token that does not trigger a spike. For more examples, see Appendix C.
Futureinternet 17 00185 g002
Figure 3. Overview of QFeM and QFeP. (Left): QFeM excludes the modules whose r ( m ) is larger than the hyperparameter α from quantization. (Right): QFeP computes in advance the prefix of activation spikes and solely utilizes their KV cache during the quantization phase, effectively preventing further activation spikes in subsequent sequences. In both figures, the y-axis represents transformer layers where i < j < k , indicating early, middle, and late layers, respectively. The x-axis corresponds to the input sequence, where tokens with yellow highlights are general tokens, and those with red highlights are the tokens that trigger activation spikes.
Figure 3. Overview of QFeM and QFeP. (Left): QFeM excludes the modules whose r ( m ) is larger than the hyperparameter α from quantization. (Right): QFeP computes in advance the prefix of activation spikes and solely utilizes their KV cache during the quantization phase, effectively preventing further activation spikes in subsequent sequences. In both figures, the y-axis represents transformer layers where i < j < k , indicating early, middle, and late layers, respectively. The x-axis corresponds to the input sequence, where tokens with yellow highlights are general tokens, and those with red highlights are the tokens that trigger activation spikes.
Futureinternet 17 00185 g003
Figure 4. Trade-off between perplexity and | M u n q | for LLaMA-2-13B as threshold α varies. A higher α quantizes more layers, increasing | M u n q | (left y-axis) but also raising perplexity (right y-axis), where a lower perplexity value indicates better performance. A lower α restricts quantization, preserving performance but reducing latency gains.
Figure 4. Trade-off between perplexity and | M u n q | for LLaMA-2-13B as threshold α varies. A higher α quantizes more layers, increasing | M u n q | (left y-axis) but also raising perplexity (right y-axis), where a lower perplexity value indicates better performance. A lower α restricts quantization, preserving performance but reducing latency gains.
Futureinternet 17 00185 g004
Figure 5. The average accuracy of a zero-shot evaluation on other GLU-implemented LLMs. Most models recover significantly compared to W8A8, with performance close to FP16. The x-axis represents bar plots for each method: W8A8, QFeM, QFeP, and QFeM + QFeP. The y-axis indicates the zero-shot average accuracy across evaluation tasks. The red dashed line shows the performance of the original FP16 model for reference.
Figure 5. The average accuracy of a zero-shot evaluation on other GLU-implemented LLMs. Most models recover significantly compared to W8A8, with performance close to FP16. The x-axis represents bar plots for each method: W8A8, QFeM, QFeP, and QFeM + QFeP. The y-axis indicates the zero-shot average accuracy across evaluation tasks. The red dashed line shows the performance of the original FP16 model for reference.
Futureinternet 17 00185 g005
Figure 6. Prefix ablation. The y-axis represents the averaged accuracy of four zero-shot tasks. The x-axis shows bar plots for different prefix types: random, BOS only, QFeP without context, and QFeP with context.
Figure 6. Prefix ablation. The y-axis represents the averaged accuracy of four zero-shot tasks. The x-axis shows bar plots for different prefix types: random, BOS only, QFeP without context, and QFeP with context.
Futureinternet 17 00185 g006
Figure 7. Accuracy–latency comparison of different activation quantization schemes: dynamic per-token (AQ1), dynamic per-tensor (AQ2), and static per-tensor (AQ3). The x-axis represents latency, and the y-axis represents accuracy.
Figure 7. Accuracy–latency comparison of different activation quantization schemes: dynamic per-token (AQ1), dynamic per-tensor (AQ2), and static per-tensor (AQ3). The x-axis represents latency, and the y-axis represents accuracy.
Futureinternet 17 00185 g007
Table 1. Perplexity and MSE of partial activation quantization of LLMs. Quantization is applied to the Top 4, Middle 4, and Bottom 4 layers, selected based on the max-to-median ratio of token-wise activation scales. Both perplexity and MSE are measured using the calibration dataset. MSE is computed between the last hidden states of the FP16 and partially quantized models.
Table 1. Perplexity and MSE of partial activation quantization of LLMs. Quantization is applied to the Top 4, Middle 4, and Bottom 4 layers, selected based on the max-to-median ratio of token-wise activation scales. Both perplexity and MSE are measured using the calibration dataset. MSE is computed between the last hidden states of the FP16 and partially quantized models.
ModelPerplexity (↓)MSE (↓)
FP16Top 4Middle 4Bottom 4Top 4Middle 4Bottom 4
LLaMA-2-7B7.3711.777.387.401908.801.0312.90
LLaMA-2-13B6.8415.096.846.844762.110.9110.38
Mistral-7B8.3569.458.358.36218.600.020.18
Gemma-7B10.8585.8310.9410.87213.931.601.07
Table 2. Specifications for QFeM and QFeP used in experiments. | M | denotes the total number of linear layers in the LLM, and | M u n q | represents the number of unquantized layers for QFeM.
Table 2. Specifications for QFeM and QFeP used in experiments. | M | denotes the total number of linear layers in the LLM, and | M u n q | represents the number of unquantized layers for QFeM.
ModelPrefix α | M unq | / | M |
LLaMA-2-7B[BOS] all.6.6817/128
LLaMA-2-13B[BOS] then,12.916/160
LLaMA-2-70B[BOS] I′9.1625/320
Mistral-7B[BOS] how\n49.003/128
Mixtral-8x7B[BOS]) .\n4.03191/608
SOLAR-10.7B[BOS] a 16.4811/192
Gemma-7B[BOS]. Più10.655/112
LLaMA-3-8B[BOS]-nd6.646/128
LLaMA-3-70B[BOS] and,78.373/320
Table 3. Perplexity and zero-shot evaluation for the quantization on LLaMA-2 models. FP16 denotes the original model precision, and W8A8 denotes the model quantized to INT8 for both weights and activations. Colored cells indicate results obtained by applying our proposed methods to existing quantization approaches. Column headers represent different benchmarks, with the values in parentheses denoting ppl (perplexity) and acl (accuracy), respectively. Downward arrows (↓) indicate that lower values are better (e.g., for perplexity), while upward arrows (↑) indicate that higher values are better (e.g., for accuracy). Bold text in the main table highlights the improvements over the baseline quantization method (W8A8 without our method).
Table 3. Perplexity and zero-shot evaluation for the quantization on LLaMA-2 models. FP16 denotes the original model precision, and W8A8 denotes the model quantized to INT8 for both weights and activations. Colored cells indicate results obtained by applying our proposed methods to existing quantization approaches. Column headers represent different benchmarks, with the values in parentheses denoting ppl (perplexity) and acl (accuracy), respectively. Downward arrows (↓) indicate that lower values are better (e.g., for perplexity), while upward arrows (↑) indicate that higher values are better (e.g., for accuracy). Bold text in the main table highlights the improvements over the baseline quantization method (W8A8 without our method).
MethodWikiText-2
(ppl ↓)
PIQA
(acc ↑)
LAMBADA
(acc ↑)
HellaSwag
(acc ↑)
WinoGrande
(acc ↑)
Avg
(acc ↑)
LLaMA-2-7B
FP165.26878.18%73.67%57.13%69.46%69.61%
W8A88.63472.80%62.27%49.57%63.69%62.08%
  +QFeM5.758 [−2.876]78.02%73.86%56.32%68.35%69.14% [+7.06]
  +QFeP5.758 [−2.876]76.44%73.57%55.55%69.22%68.69% [+6.61]
  +QFeM+QFeP5.573 [−3.061]77.86%74.58%56.05%69.38%69.47% [+7.39]
LLaMA-2-13B
FP164.78979.49%76.54%60.20%72.38%72.15%
W8A834.08970.13%49.66%42.65%58.72%55.29%
  +QFeM5.241 [−28.848]77.58%75.68%59.13%72.61%71.25% [+15.96]
  +QFeP6.000 [−28.089]77.53%73.94%57.23%70.96%69.91% [+14.62]
  +QFeM+QFeP5.126 [−28.963]78.51%75.86%59.44%72.61%71.61% [+16.32]
LLaMA-2-70B
FP163.21881.45%79.45%65.29%80.43%76.65%
W8A88.05574.05%70.27%55.21%67.96%66.87%
  +QFeM3.830 [−4.225]81.23%77.66%64.15%78.14%75.30% [+8.43]
  +QFeP6.007 [−2.048]77.64%73.26%63.40%76.16%72.62% [+5.75]
  +QFeM+QFeP3.708 [−4.347]81.23%77.82%64.65%77.11%75.20% [+8.33]
Table 4. Perplexity on Wikitext-2 for outlier alleviation methods with QFeM and QFeP under different activation quantization granularities. Activation quantization is applied at two levels: fine-grained (per-token) and coarse-grained (per-tensor). We employ per-channel granularity for weight quantization. For low-bit quantization (W6A6 and W4A4), we adopt asymmetric scale estimation. The bold values indicate the best performance within each group. “#Bits” denotes the bit-width configuration used for quantization, representing how much the weights and activations have been quantized.
Table 4. Perplexity on Wikitext-2 for outlier alleviation methods with QFeM and QFeP under different activation quantization granularities. Activation quantization is applied at two levels: fine-grained (per-token) and coarse-grained (per-tensor). We employ per-channel granularity for weight quantization. For low-bit quantization (W6A6 and W4A4), we adopt asymmetric scale estimation. The bold values indicate the best performance within each group. “#Bits” denotes the bit-width configuration used for quantization, representing how much the weights and activations have been quantized.
Model#BitsMethodPer-Token Quant.Per-Tensor Quant.
BaseQFeMQFePBaseQFeMQFeP
LLaMA-2-7B
(FP16: 5.268)
W8A8SQ5.2965.2885.3029.9075.5345.715
W8A8OSP5.2935.2885.30338.4905.4935.642
W4A4OSP48.19916.15144.63523491477210940
LLaMA-2-13B
(FP16: 4.789)
W8A8SQ4.8094.8084.81834.8695.1186.551
W8A8OSP4.8134.8124.8195.1485.0995.144
W4A4OSP21.03711.53511.860934086809362
Table 5. Memory footprint (in MiB) for different quantization methods across sequence lengths of 1 K and 2 K. Results are shown for both LLaMA-2-7B and LLaMA-2-70B models.
Table 5. Memory footprint (in MiB) for different quantization methods across sequence lengths of 1 K and 2 K. Results are shown for both LLaMA-2-7B and LLaMA-2-70B models.
MethodSeqLen
1 K2 K
LLaMA-2-7B
AQ18185 MiB9516 MiB
AQ28148 MiB9474 MiB
+QFeP8149 MiB9478 MiB
+QFeM8148 MiB9474 MiB
LLaMA-2-70B
AQ167,756 MiB69,037 MiB
AQ267,648 MiB68,820 MiB
+QFeP67,651 MiB68,822 MiB
+QFeM67,838 MiB68,819 MiB
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, J.; Kim, H.; Ji, J.; Kim, Y. Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models. Future Internet 2025, 17, 185. https://doi.org/10.3390/fi17040185

AMA Style

Yang J, Kim H, Ji J, Kim Y. Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models. Future Internet. 2025; 17(4):185. https://doi.org/10.3390/fi17040185

Chicago/Turabian Style

Yang, Jaewoo, Hayun Kim, Junyung Ji, and Younghoon Kim. 2025. "Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models" Future Internet 17, no. 4: 185. https://doi.org/10.3390/fi17040185

APA Style

Yang, J., Kim, H., Ji, J., & Kim, Y. (2025). Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models. Future Internet, 17(4), 185. https://doi.org/10.3390/fi17040185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop