Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models
Abstract
:1. Introduction
- We find that the GLU architecture in modern LLMs systematically generates excessive activation values, which are responsible for significant performance degradation when quantizing the entire model and input tokens.
- Based on our observations, we propose two empirical methods, QFeM and QFeP, which effectively exclude the activation spikes during quantization, with negligible computational overhead and compatibility with any existing quantization techniques.
- Our experiments validate the detrimental impact of the activation spikes on activation quantization, while the proposed methods consistently enhance the quantization performance.
2. Related Works
2.1. Outlier Values in LLMs
2.2. Post-Training Quantization for LLMs
3. Activation Spikes: Excessive Magnitude of GLU Activations
3.1. Existence of Activation Spikes in GLU Variants
3.1.1. GLU-Implemented LLMs Exhibit Activation Spikes at Specific Layers
3.1.2. Non GLU-Implemented LLMs Show Modest Scale Distribution
3.2. Token-Level Scale Analysis Within Activation Spikes
3.3. Effect of Quantization on Activation Spikes
4. Mitigating Quantization Quality Degradation Based on the Observation
4.1. Overview of Adapting Our Techniques for Quantization
4.2. Quantization-Free Module (QFeM)
Optimizing the Threshold
4.3. Quantization-Free Prefix (QFeP)
4.3.1. Prefix Search
4.3.2. Implementation Details
5. Experiments
5.1. Experimental Setup
5.1.1. Models and Environment
5.1.2. Quantization
5.1.3. Evaluations
5.2. Main Results
5.2.1. LLaMA-2 Models
5.2.2. Other GLU-Implemented LLMs
5.3. Combining Outlier Alleviation Methods
5.4. Prefix Ablation Study
5.5. Computational Cost Analysis
5.5.1. Inference Latency
5.5.2. Memory Footprint
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Transformer Architecture
Appendix B. Additional Calibration Results
Appendix B.1. Detailed Specification of LLMs
Model | Size | FFN Activation | Normalization | PE | Vocabulary Size |
---|---|---|---|---|---|
GLU-implemented LLMs: | |||||
LLaMA-2 [47] | 7 B, 13 B, 70 B | SwiGLU | RMSNorm | RoPE | 32,000 |
LLaMA-3 | 8 B, 70 B | SwiGLU | RMSNorm | RoPE | 128,256 |
Mistral [49] | 7 B | SwiGLU | RMSNorm | RoPE | 32,000 |
Mixtral [6] | 8 × 7 B | SwiGLU | RMSNorm | RoPE | 32,000 |
SOLAR [50] | 10.7 B | SwiGLU | RMSNorm | RoPE | 32,000 |
StableLM-2 [58] | 12 B | SwiGLU | LayerNorm | RoPE | 100,352 |
Gemma [51] | 7 B | GeGLU | RMSNorm | RoPE | 256,000 |
Non GLU-implemented LLMs: | |||||
OPT [29] | 6.7 B, 13 B, 30 B, 66 B | ReLU | LayerNorm | Learned | 50,272 |
MPT [59] | 7 B, 30 B | GeLU | LayerNorm | ALiBi | 50,432 |
Pythia [60] | 6.9 B, 12 B | GeLU | LayerNorm | RoPE | 50,432, 50,688 |
Falcon [61] | 7 B, 40 B | GeLU | LayerNorm | RoPE | 65,024 |
Phi-2 [62] | 2.7 B | GeLU | LayerNorm | RoPE | 51,200 |
Appendix B.2. Additional Calibration Results on GLU-Implementation
Appendix B.3. Additional Calibration Results on Non-GLU Implementation
Appendix C. Additional Results for Token-Level Scale Analysis
Appendix D. Low-Bit Activation Quantization
Method | LLaMA-2-7B | LLaMA-2-13B | ||
---|---|---|---|---|
Wiki2 | C4 | Wiki2 | C4 | |
Atom (W4A4) | 5.710 | 7.601 | 5.081 | 6.878 |
+QFeM | 5.634 | 7.538 | 5.071 | 6.854 |
+QFeP | 5.685 | 7.493 | 5.089 | 6.872 |
+QFeM+QFeP | 5.607 | 7.442 | 5.073 | 6.855 |
OmniQuant (W4A4) | 14.208 | 19.005 | 10.416 | 14.103 |
+QFeM | 9.483 | 13.569 | 9.499 | 12.348 |
+QFeP | 11.926 | 15.796 | 10.808 | 13.584 |
+QFeM+QFeP | 8.818 | 12.244 | 9.901 | 12.294 |
Appendix E. BMM Quantization
Model | Method | BMM Quantization | |
---|---|---|---|
No | Yes | ||
7B | W8A8 | 62.08% | 61.66% |
+QFeP | 68.69% | 68.30% | |
13B | W8A8 | 55.29% | 55.43% |
+QFeP | 69.91% | 69.77% | |
70B | W8A8 | 66.87% | 66.75% |
+QFeP | 72.62% | 72.69% |
References
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
- Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682. [Google Scholar]
- Shazeer, N. Glu variants improve transformer. arXiv 2020, arXiv:2002.05202. [Google Scholar]
- Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
- Ainslie, J.; Lee-Thorp, J.; de Jong, M.; Zemlyanskiy, Y.; Lebrón, F.; Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv 2023, arXiv:2305.13245. [Google Scholar]
- Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Hanna, E.B.; Bressand, F.; et al. Mixtral of experts. arXiv 2024, arXiv:2401.04088. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Narang, S.; Chung, H.W.; Tay, Y.; Fedus, L.; Fevry, T.; Matena, M.; Malkan, K.; Fiedel, N.; Shazeer, N.; Lan, Z.; et al. Do Transformer Modifications Transfer Across Implementations and Applications? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 5758–5773. [Google Scholar] [CrossRef]
- Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar]
- Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision; Chapman and Hall/CRC: Boca Raton, FL, USA, 2022; pp. 291–326. [Google Scholar]
- Nagel, M.; Fournarakis, M.; Amjad, R.A.; Bondarenko, Y.; Van Baalen, M.; Blankevoort, T. A white paper on neural network quantization. arXiv 2021, arXiv:2106.08295. [Google Scholar]
- Dettmers, T.; Lewis, M.; Belkada, Y.; Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Adv. Neural Inf. Process. Syst. 2022, 35, 30318–30332. [Google Scholar]
- Xiao, G.; Lin, J.; Seznec, M.; Wu, H.; Demouth, J.; Han, S. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
- Ahmadian, A.; Dash, S.; Chen, H.; Venkitesh, B.; Gou, Z.S.; Blunsom, P.; Üstün, A.; Hooker, S. Intriguing properties of quantization at scale. Adv. Neural Inf. Process. Syst. 2023, 36, 34278–34294. [Google Scholar]
- Wei, X.; Zhang, Y.; Li, Y.; Zhang, X.; Gong, R.; Guo, J.; Liu, X. Outlier Suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 1648–1665. [Google Scholar] [CrossRef]
- Bondarenko, Y.; Nagel, M.; Blankevoort, T. Quantizable transformers: Removing outliers by helping attention heads do nothing. Adv. Neural Inf. Process. Syst. 2023, 36, 75067–75096. [Google Scholar]
- Sun, M.; Chen, X.; Kolter, J.Z.; Liu, Z. Massive Activations in Large Language Models. arXiv 2024, arXiv:2402.17762. [Google Scholar]
- Zhao, Y.; Lin, C.Y.; Zhu, K.; Ye, Z.; Chen, L.; Zheng, S.; Ceze, L.; Krishnamurthy, A.; Chen, T.; Kasikci, B. Atom: Low-bit quantization for efficient and accurate llm serving. Proc. Mach. Learn. Syst. 2024, 6, 196–209. [Google Scholar]
- Shao, W.; Chen, M.; Zhang, Z.; Xu, P.; Zhao, L.; Li, Z.; Zhang, K.; Gao, P.; Qiao, Y.; Luo, P. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv 2023, arXiv:2308.13137. [Google Scholar]
- Yao, Z.; Yazdani Aminabadi, R.; Zhang, M.; Wu, X.; Li, C.; He, Y. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Adv. Neural Inf. Process. Syst. 2022, 35, 27168–27183. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Kovaleva, O.; Kulshreshtha, S.; Rogers, A.; Rumshisky, A. BERT Busters: Outlier Dimensions that Disrupt Transformers. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online Event, 1–6 August 2021; pp. 3392–3405. [Google Scholar] [CrossRef]
- Bondarenko, Y.; Nagel, M.; Blankevoort, T. Understanding and Overcoming the Challenges of Efficient Transformer Quantization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7947–7969. [Google Scholar] [CrossRef]
- Luo, Z.; Kulmizev, A.; Mao, X. Positional Artefacts Propagate Through Masked Language Model Embeddings. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 5312–5327. [Google Scholar] [CrossRef]
- Timkey, W.; van Schijndel, M. All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual Event, Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 4527–4546. [Google Scholar] [CrossRef]
- Puccetti, G.; Rogers, A.; Drozd, A.; Dell’Orletta, F. Outlier Dimensions that Disrupt Transformers are Driven by Frequency. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 1286–1304. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. Opt: Open pre-trained transformer language models. arXiv 2022, arXiv:2205.01068. [Google Scholar]
- Xiao, G.; Tian, Y.; Chen, B.; Han, S.; Lewis, M. Efficient streaming language models with attention sinks. arXiv 2023, arXiv:2309.17453. [Google Scholar]
- Kovaleva, O.; Romanov, A.; Rogers, A.; Rumshisky, A. Revealing the Dark Secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 4365–4374. [Google Scholar] [CrossRef]
- Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv 2022, arXiv:2210.17323. [Google Scholar]
- Kim, S.; Hooper, C.; Gholami, A.; Dong, Z.; Li, X.; Shen, S.; Mahoney, M.W.; Keutzer, K. Squeezellm: Dense-and-sparse quantization. arXiv 2023, arXiv:2306.07629. [Google Scholar]
- Lin, J.; Tang, J.; Tang, H.; Yang, S.; Dang, X.; Han, S. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv 2023, arXiv:2306.00978. [Google Scholar] [CrossRef]
- Chee, J.; Cai, Y.; Kuleshov, V.; De Sa, C.M. Quip: 2-bit quantization of large language models with guarantees. Adv. Neural Inf. Process. Syst. 2023, 196, 4396–4429. [Google Scholar]
- Yao, Z.; Li, C.; Wu, X.; Youn, S.; He, Y. A comprehensive study on post-training quantization for large language models. arXiv 2023, arXiv:2303.08302. [Google Scholar]
- Dettmers, T.; Svirschevski, R.; Egiazarian, V.; Kuznedelev, D.; Frantar, E.; Ashkboos, S.; Borzunov, A.; Hoefler, T.; Alistarh, D. Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv 2023, arXiv:2306.03078. [Google Scholar]
- Liu, R.; Bai, H.; Lin, H.; Li, Y.; Gao, H.; Xu, Z.; Hou, L.; Yao, J.; Yuan, C. IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact. arXiv 2024, arXiv:2403.01241. [Google Scholar]
- Son, S.; Park, W.; Han, W.; Kim, K.; Lee, J. Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization. arXiv 2024, arXiv:2406.12016. [Google Scholar]
- Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing, C.; Zhang, H.; Lan, Y.; Wang, L.; Liu, T. On layer normalization in the transformer architecture. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual Event, 13–18 July 2020; pp. 10524–10533. [Google Scholar]
- Baevski, A.; Auli, M. Adaptive Input Representations for Neural Language Modeling. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 630–645. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 11, 6000–6010. [Google Scholar]
- Pope, R.; Douglas, S.; Chowdhery, A.; Devlin, J.; Bradbury, J.; Heek, J.; Xiao, K.; Agrawal, S.; Dean, J. Efficiently scaling transformer inference. Proc. Mach. Learn. Syst. 2023, 5, 606–624. [Google Scholar]
- Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. arXiv 2019, arXiv:1904.01038. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
- Kim, D.; Park, C.; Kim, S.; Lee, W.; Song, W.; Kim, Y.; Kim, H.; Kim, Y.; Lee, H.; Kim, J.; et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv 2023, arXiv:2312.15166. [Google Scholar]
- Team, G.; Mesnard, T.; Hardin, C.; Dadashi, R.; Bhupatiraju, S.; Pathak, S.; Sifre, L.; Rivière, M.; Kale, M.S.; Love, J.; et al. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024, arXiv:2403.08295. [Google Scholar]
- Bisk, Y.; Zellers, R.; Gao, J.; Choi, Y. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7432–7439. [Google Scholar]
- Paperno, D.; Kruszewski, G.; Lazaridou, A.; Pham, N.Q.; Bernardi, R.; Pezzelle, S.; Baroni, M.; Boleda, G.; Fernández, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1525–1534. [Google Scholar] [CrossRef]
- Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; Choi, Y. HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D., Màrquez, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 4791–4800. [Google Scholar] [CrossRef]
- Sakaguchi, K.; Bras, R.L.; Bhagavatula, C.; Choi, Y. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. arXiv 2019, arXiv:1907.10641. [Google Scholar] [CrossRef]
- Gao, L.; Tow, J.; Abbasi, B.; Biderman, S.; Black, S.; DiPofi, A.; Foster, C.; Golding, L.; Hsu, J.; Le Noac’h, A.; et al. A Framework for Few-Shot Language Model Evaluation. 2023. Available online: https://github.com/EleutherAI/lm-evaluation-harness (accessed on 2 September 2021). [CrossRef]
- Merity, S.; Xiong, C.; Bradbury, J.; Socher, R. Pointer sentinel mixture models. arXiv 2016, arXiv:1609.07843. [Google Scholar]
- Bellagente, M.; Tow, J.; Mahan, D.; Phung, D.; Zhuravinskyi, M.; Adithyan, R.; Baicoianu, J.; Brooks, B.; Cooper, N.; Datta, A.; et al. Stable LM 2 1.6 B Technical Report. arXiv 2024, arXiv:2402.17834. [Google Scholar]
- Team, M.N. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs. 2023. Available online: https://www.databricks.com/blog/mpt-7b (accessed on 5 May 2023).
- Biderman, S.; Schoelkopf, H.; Anthony, Q.G.; Bradley, H.; O’Brien, K.; Hallahan, E.; Khan, M.A.; Purohit, S.; Prashanth, U.S.; Raff, E.; et al. Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 2397–2430. [Google Scholar]
- Almazrouei, E.; Alobeidli, H.; Alshamsi, A.; Cappelli, A.; Cojocaru, R.; Debbah, M.; Goffinet, É.; Hesslow, D.; Launay, J.; Malartic, Q.; et al. The falcon series of open language models. arXiv 2023, arXiv:2311.16867. [Google Scholar]
- Mojan, J.; Sébastien, B. Phi-2: The Surprising Power of Small Language Models. 2023. Available online: https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ (accessed on 12 December 2023).
Model | Perplexity (↓) | MSE (↓) | |||||
---|---|---|---|---|---|---|---|
FP16 | Top 4 | Middle 4 | Bottom 4 | Top 4 | Middle 4 | Bottom 4 | |
LLaMA-2-7B | 7.37 | 11.77 | 7.38 | 7.40 | 1908.80 | 1.03 | 12.90 |
LLaMA-2-13B | 6.84 | 15.09 | 6.84 | 6.84 | 4762.11 | 0.91 | 10.38 |
Mistral-7B | 8.35 | 69.45 | 8.35 | 8.36 | 218.60 | 0.02 | 0.18 |
Gemma-7B | 10.85 | 85.83 | 10.94 | 10.87 | 213.93 | 1.60 | 1.07 |
Model | Prefix | ||
---|---|---|---|
LLaMA-2-7B | [BOS] all. | 6.68 | 17/128 |
LLaMA-2-13B | [BOS] then, | 12.91 | 6/160 |
LLaMA-2-70B | [BOS] I′ | 9.16 | 25/320 |
Mistral-7B | [BOS] how\n | 49.00 | 3/128 |
Mixtral-8x7B | [BOS]) .\n | 4.03 | 191/608 |
SOLAR-10.7B | [BOS] a 1 | 6.48 | 11/192 |
Gemma-7B | [BOS]. Più | 10.65 | 5/112 |
LLaMA-3-8B | [BOS]-nd | 6.64 | 6/128 |
LLaMA-3-70B | [BOS] and, | 78.37 | 3/320 |
Method | WikiText-2 (ppl ↓) | PIQA (acc ↑) | LAMBADA (acc ↑) | HellaSwag (acc ↑) | WinoGrande (acc ↑) | Avg (acc ↑) |
---|---|---|---|---|---|---|
LLaMA-2-7B | ||||||
FP16 | 5.268 | 78.18% | 73.67% | 57.13% | 69.46% | 69.61% |
W8A8 | 8.634 | 72.80% | 62.27% | 49.57% | 63.69% | 62.08% |
+QFeM | 5.758 [−2.876] | 78.02% | 73.86% | 56.32% | 68.35% | 69.14% [+7.06] |
+QFeP | 5.758 [−2.876] | 76.44% | 73.57% | 55.55% | 69.22% | 68.69% [+6.61] |
+QFeM+QFeP | 5.573 [−3.061] | 77.86% | 74.58% | 56.05% | 69.38% | 69.47% [+7.39] |
LLaMA-2-13B | ||||||
FP16 | 4.789 | 79.49% | 76.54% | 60.20% | 72.38% | 72.15% |
W8A8 | 34.089 | 70.13% | 49.66% | 42.65% | 58.72% | 55.29% |
+QFeM | 5.241 [−28.848] | 77.58% | 75.68% | 59.13% | 72.61% | 71.25% [+15.96] |
+QFeP | 6.000 [−28.089] | 77.53% | 73.94% | 57.23% | 70.96% | 69.91% [+14.62] |
+QFeM+QFeP | 5.126 [−28.963] | 78.51% | 75.86% | 59.44% | 72.61% | 71.61% [+16.32] |
LLaMA-2-70B | ||||||
FP16 | 3.218 | 81.45% | 79.45% | 65.29% | 80.43% | 76.65% |
W8A8 | 8.055 | 74.05% | 70.27% | 55.21% | 67.96% | 66.87% |
+QFeM | 3.830 [−4.225] | 81.23% | 77.66% | 64.15% | 78.14% | 75.30% [+8.43] |
+QFeP | 6.007 [−2.048] | 77.64% | 73.26% | 63.40% | 76.16% | 72.62% [+5.75] |
+QFeM+QFeP | 3.708 [−4.347] | 81.23% | 77.82% | 64.65% | 77.11% | 75.20% [+8.33] |
Model | #Bits | Method | Per-Token Quant. | Per-Tensor Quant. | ||||
---|---|---|---|---|---|---|---|---|
Base | QFeM | QFeP | Base | QFeM | QFeP | |||
LLaMA-2-7B (FP16: 5.268) | W8A8 | SQ | 5.296 | 5.288 | 5.302 | 9.907 | 5.534 | 5.715 |
W8A8 | OSP | 5.293 | 5.288 | 5.303 | 38.490 | 5.493 | 5.642 | |
W4A4 | OSP | 48.199 | 16.151 | 44.635 | 23491 | 4772 | 10940 | |
LLaMA-2-13B (FP16: 4.789) | W8A8 | SQ | 4.809 | 4.808 | 4.818 | 34.869 | 5.118 | 6.551 |
W8A8 | OSP | 4.813 | 4.812 | 4.819 | 5.148 | 5.099 | 5.144 | |
W4A4 | OSP | 21.037 | 11.535 | 11.860 | 9340 | 8680 | 9362 |
Method | SeqLen | |
---|---|---|
1 K | 2 K | |
LLaMA-2-7B | ||
AQ1 | 8185 MiB | 9516 MiB |
AQ2 | 8148 MiB | 9474 MiB |
+QFeP | 8149 MiB | 9478 MiB |
+QFeM | 8148 MiB | 9474 MiB |
LLaMA-2-70B | ||
AQ1 | 67,756 MiB | 69,037 MiB |
AQ2 | 67,648 MiB | 68,820 MiB |
+QFeP | 67,651 MiB | 68,822 MiB |
+QFeM | 67,838 MiB | 68,819 MiB |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, J.; Kim, H.; Ji, J.; Kim, Y. Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models. Future Internet 2025, 17, 185. https://doi.org/10.3390/fi17040185
Yang J, Kim H, Ji J, Kim Y. Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models. Future Internet. 2025; 17(4):185. https://doi.org/10.3390/fi17040185
Chicago/Turabian StyleYang, Jaewoo, Hayun Kim, Junyung Ji, and Younghoon Kim. 2025. "Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models" Future Internet 17, no. 4: 185. https://doi.org/10.3390/fi17040185
APA StyleYang, J., Kim, H., Ji, J., & Kim, Y. (2025). Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models. Future Internet, 17(4), 185. https://doi.org/10.3390/fi17040185