Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning
Abstract
1. Introduction
2. Related Work
2.1. Token Pruning
2.2. Efficient FiD
3. Method
3.1. Fusion-in-Decoder
3.2. Layer-Wise Pruning Network
3.3. Generation-Aware Pruning
3.4. Training Strategy
3.4.1. Pruning Rate Control
3.4.2. Cross-Attention Alignment
3.4.3. Overall Loss
3.4.4. Weight Initialization
4. Experiments
4.1. Experimental Setup
4.1.1. Dataset
4.1.2. Training Details
4.1.3. Evaluation Metrics
4.2. Results
4.3. Inference Efficiency
4.4. Case Study
5. Discussion and Limitations
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Hyperparameter
Hyperparameter | ASQA/CLAPNQ |
---|---|
Learning rate | |
Optimizer | AdamW |
LR Scheduler | Linear |
Warm-up steps | 1000 |
Total training steps | 20,000 |
Per-GPU batch size | 1 |
Accumulation steps | 32 |
Effective batch size | 64 |
Max input length (passage) | 250 |
Max output length | 128 |
Number of passages (K) | 50 |
Initial retention rate () | 0.9 |
Final retention rate () | 0.3 |
Generation-aware retention rate () | 0.05 |
Gumbel temperature (, initial) | 1.0 |
Temp. retain steps | 1000 |
Temp. reducing steps | 2000 |
2.0 | |
1.0 |
Appendix A. Error Analysis
- Incorrect Entity and Attribute Association: The proposed model often misidentified the core entity or provided inaccurate attributes (e.g., numerical data or roles) for the correct one. This suggests issues with precise information retrieval or the generation of unsupported facts.
- Misinterpretation of Relational Information: Even when entities were correctly identified, the proposed model sometimes failed to accurately represent the relationships between them (e.g., familial or professional connections), leading to distorted contextual understanding.
Question | Answers |
---|---|
Who has scored most goals in international football? | Gold: [“The record (85 goals) for the most international goals scored in men’s football is held by Iranian professional footballer Ali Daei. On the women’s side, this record (187 goals) belongs to the Canadian soccer player Christine Sinclair.”, “In total, 67 male footballers to date have managed to score at least 50 goals with their national team at senior level. Ali Daei has scored the most goals in men’s international football and is the only player to score over 100 goals in international men’s football with 109 goals. Christine Sinclair scored the most goals in international women’s football with 187 goals.”] Ours: The player who has scored the most goals in international football is Alessandro Rosa Vieira, a Czech-Austrian professional footballer. He is also the world’s all-time leading goalscorer with 159 goals. |
Who is fruma sarah in fiddler on the roof? | Gold: [’In the original 1964 Broadway musical cast of Fiddler on the Roof, the ghostly depiction of the late wife of Lazar Wolf, Fruma Sarah is played by Carol Sawyer Yussel. In the 1971 film version of the musical, Fruma Sarah is portrayed by Ruth Madoc. In the 2015–2016 Broadway Revival of Fiddler on the Roof, American singer and actress Jessica Vosk plays the character of Fruma Sarah.’, ’There are several versions of Fiddler on the Roof. The original Broadway production of the show, which opened in 1964, had the first musical theatre run in history to surpass 3000 performances. The character of Fruma Sarah, a ghostly depiction of the late wife of Lazar Wolf, is played by Carol Sawyer Yussel. In the 1971 film, Fiddler on the Roof, which theatrically released on 3 November 1971, by United Artists to critical and commercial success, Ruth Madoc played Fruma Sarah. In the 2015–2016 Broadway Revival of Fiddler on the Roof, American singer and actress Jessica Vosk played Fruma Sarah.’] Ours: Fiddler on the Roof is a 1971 American musical comedy-drama film produced and directed by Norman Jewison. It is an adaptation of the 1964 Broadway musical of same name, with music composed by Jerry Bock, lyrics by Sheldon Harnick, and screenplay by Joseph Stein, based on stories by Sholem Aleichem. The film centers on Tevye the Dairyman and his three daughters, Fruma Sarah, who rises to warn Lazar Wolf about the possibility of marriage. |
References
- Chen, D.; Fisch, A.; Weston, J.; Bordes, A. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; Barzilay, R., Kan, M.Y., Eds.; Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 1870–1879. [Google Scholar] [CrossRef]
- Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online; 21–23 April 2021; Merlo, P., Tiedemann, J., Tsarfaty, R., Eds.; Association for Computational Linguistics: Vancouver, BC, Canada, 2021; pp. 874–880. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Stelmakh, I.; Luan, Y.; Dhingra, B.; Chang, M.W. ASQA: Factoid Questions Meet Long-Form Answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 8273–8288. [Google Scholar] [CrossRef]
- Rosenthal, S.; Sil, A.; Florian, R.; Roukos, S. CLAPnq: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems. Trans. Assoc. Comput. Linguist. 2025, 13, 53–72. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Goyal, S.; Choudhury, A.R.; Raje, S.M.; Chakaravarthy, V.T.; Sabharwal, Y.; Verma, A. PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020. [Google Scholar]
- Kim, S.; Shen, S.; Thorsley, D.; Gholami, A.; Kwon, W.; Hassoun, J.; Keutzer, K. Learned Token Pruning for Transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; Association for Computing Machinery: New York, NY, USA, 2022. KDD’22. pp. 784–794. [Google Scholar] [CrossRef]
- Guan, Y.; Li, Z.; Leng, J.; Lin, Z.; Guo, M. Transkimmer: Transformer Learns to Layer-wise Skim. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 7275–7286. [Google Scholar] [CrossRef]
- Kim, Y.; Lee, S. SparseFlow: Accelerating Transformers by Sparsifying Information Flows. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 5937–5948. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Yu, D.; Zhu, C.; Fang, Y.; Yu, W.; Wang, S.; Xu, Y.; Ren, X.; Yang, Y.; Zeng, M. KG-FiD: Infusing Knowledge Graph in Fusion-in-Decoder for Open-Domain Question Answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 4961–4974. [Google Scholar] [CrossRef]
- De Jong, M.; Zemlyanskiy, Y.; FitzGerald, N.; Ainslie, J.; Sanghai, S.; Sha, F.; Cohen, W.W. Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
- de Jong, M.; Zemlyanskiy, Y.; FitzGerald, N.; Sanghai, S.; Cohen, W.W.; Ainslie, J. GLIMMER: Generalized late-interaction memory reranker. arXiv 2023, arXiv:2306.10231. [Google Scholar]
- Huang, Y.; Han, X.; Sun, M. FastFiD: Improve Inference Efficiency of Open Domain Question Answering via Sentence Selection. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 6262–6276. [Google Scholar] [CrossRef]
- Berchansky, M.; Izsak, P.; Caciularu, A.; Dagan, I.; Wasserblat, M. Optimizing Retrieval-augmented Reader Models via Token Elimination. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 1506–1524. [Google Scholar] [CrossRef]
- de Jong, M.; Zemlyanskiy, Y.; Ainslie, J.; FitzGerald, N.; Sanghai, S.; Sha, F.; Cohen, W. FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 11534–11547. [Google Scholar] [CrossRef]
- Izacard, G.; Lewis, P.; Lomeli, M.; Hosseini, L.; Petroni, F.; Schick, T.; Dwivedi-Yu, J.; Joulin, A.; Riedel, S.; Grave, E. Atlas: Few-shot Learning with Retrieval Augmented Language Models. arXiv 2022, arXiv:2208.03299. [Google Scholar]
- Jang, E.; Gu, S.; Poole, B. Categorical Reparameterization with Gumbel-Softmax. arXiv 2017, arXiv:1611.01144. [Google Scholar]
- Min, S.; Michael, J.; Hajishirzi, H.; Zettlemoyer, L. AmbigQA: Answering Ambiguous Open-domain Questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2020; pp. 5783–5797. [Google Scholar] [CrossRef]
- Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. Natural Questions: A Benchmark for Question Answering Research. Trans. Assoc. Comput. Linguist. 2019, 7, 452–466. [Google Scholar] [CrossRef]
- Izacard, G.; Grave, E. Distilling Knowledge from Reader to Retriever for Question Answering. arXiv 2022, arXiv:2012.04584. [Google Scholar]
- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.T. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2020; pp. 6769–6781. [Google Scholar] [CrossRef]
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
- Langedijk, A.; Mohebbi, H.; Sarti, G.; Zuidema, W.; Jumelet, J. DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 4764–4780. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Dataset | Split | QAs | Words in Answer |
---|---|---|---|
ASQA | Train | 4353 | 73.3 |
Dev | 948 | 64.8 | |
CLAPNQ 1 | Train | 1954 | 53.0 |
Dev | 300 | 51.7 |
Model | ASQA | CLAPNQ | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
F1 | R-L | BS | TPQ | Speed | RR | F1 | R-L | BS | TPQ | Speed | RR | |
FiD [2] | 40.46 | 34.45 | 88.26 | 2423.02 | 1.00× | 100.00% | 30.69 | 27.68 | 86.57 | 1916.56 | 1.00× | 100.00% |
FastFiD [15] | 38.91 | 34.39 | 87.78 | 1390.71 | 1.74× | 8.23% | - | - | - | - | - | - |
Token Elimination [16] | 33.77 | 30.65 | 88.31 | 1496.88 | 1.62× | 10.00% | 25.00 | 20.37 | 86.46 | 993.94 | 1.93× | 10.00% |
Ours | 40.39 | 34.75 | 88.69 | 1416.14 | 1.71× | 13.06% | 30.25 | 27.46 | 86.90 | 1136.37 | 1.74× | 11.52% |
Model | ASQA | CLAPNQ | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
F1 | R-L | BS | TPQ | Speed | RR | F1 | R-L | BS | TPQ | Speed | RR | |
Ours | 40.39 | 34.75 | 88.69 | 1416.14 | 1.71× | 13.06% | 30.25 | 27.46 | 86.90 | 1136.37 | 1.74× | 11.52% |
Ours w/o importance score | 38.89 | 33.61 | 88.57 | 1320.07 | 1.84× | 10.97% | 28.39 | 25.93 | 86.66 | 1096.56 | 1.74× | 10.13% |
Ours w/o | 40.04 | 34.39 | 88.73 | 1356.12 | 1.79× | 11.65% | 28.97 | 26.26 | 86.69 | 1098.76 | 1.75× | 9.44% |
Model | ASQA | CLAPNQ | ||
---|---|---|---|---|
Tokens/sec | Peak GPU Memory (GB) | Tokens/sec | Peak GPU Memory (GB) | |
FiD [2] | 32.00 | 11.44 | 27.91 | 11.40 |
FastFiD [15] | 38.91 | 3.93 | - | - |
Token Elimination [16] | 32.34 | 6.80 | 30.43 | 6.80 |
Ours | 43.35 | 2.39 | 37.91 | 2.23 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, W.; Kim, G.; Kang, S. Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning. Mathematics 2025, 13, 2231. https://doi.org/10.3390/math13142231
Kim W, Kim G, Kang S. Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning. Mathematics. 2025; 13(14):2231. https://doi.org/10.3390/math13142231
Chicago/Turabian StyleKim, Wooseok, Gyunyeop Kim, and Sangwoo Kang. 2025. "Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning" Mathematics 13, no. 14: 2231. https://doi.org/10.3390/math13142231
APA StyleKim, W., Kim, G., & Kang, S. (2025). Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning. Mathematics, 13(14), 2231. https://doi.org/10.3390/math13142231