Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning

Kim, Wooseok; Kim, Gyunyeop; Kang, Sangwoo

doi:10.3390/math13142231

Open AccessArticle

Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning

by

Wooseok Kim

,

Gyunyeop Kim

^*

and

Sangwoo Kang

^*

School of Computing, Gachon University, 1342, Seongnam-daero, Sujeong-gu, Seongnam-si 13120, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(14), 2231; https://doi.org/10.3390/math13142231

Submission received: 28 May 2025 / Revised: 30 June 2025 / Accepted: 7 July 2025 / Published: 9 July 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Fusion-in-Decoder (FiD), a prominent retrieval-augmented generation model, has demonstrated outstanding performance in open-domain question answering by effectively leveraging multiple passages. However, processing multiple passages significantly increases computational costs at both encoder and decoder components. In particular, in Long-Form Question Answering (LFQA) scenarios, the decoder’s cross-attention computation scales proportionally with the length of the generated answer, severely impacting the overall inference speed. In this paper, we propose a novel dynamic token pruning mechanism to alleviate the computational bottleneck of the FiD decoder. Our method selectively identifies and removes tokens predicted to have low contributions to answer generation by jointly considering their contextual information and attention scores within the FiD encoder. The resulting pruned representations are then passed to the decoder, significantly reducing the cross-attention computations and thereby accelerating the overall inference process. Experimental evaluations on two LFQA benchmarks, ASQA and CLAPNQ, demonstrate that the proposed method achieves up to a 1.74-fold speed-up while maintaining minimal degradation in answer quality, effectively enhancing computational efficiency compared to the original FiD model.

Keywords:

long-form question answering; retrieval-augmented generation; fusion in decoder; token pruning; deep learning

MSC:

68T50

1. Introduction

Question answering (QA), a major task in natural language processing, aims to provide accurate and relevant answers to given questions and is widely used as an important measure for evaluating natural language understanding capabilities. In particular, open-domain question answering (ODQA), which leverages large-scale external knowledge sources such as Wikipedia to answer arbitrary questions, has become increasingly crucial due to its close relevance to real-world applications.

Typically, ODQA systems are based on a retriever–reader framework [1]. In this architecture, the retriever retrieves passages relevant to a given query from external knowledge sources, while the reader model generates an answer based on the retrieved passages. Recent advancements have enabled reader models to move beyond simply extracting answer spans; they now synthesize and reconstruct information from multiple sources to produce coherent, natural-sounding answers.

Fusion-in-Decoder (FiD) [2] is a representative generative reader model that utilizes an encoder–decoder architecture similar to T5 [3]. The FiD encoder independently processes each passage retrieved by the retriever along with the question, subsequently concatenating hidden states from all passages. The decoder then employs cross-attention on these combined hidden states to synthesize information from all retrieved passages to generate the final answer. Such a structure excels at integrating dispersed information across multiple passages, facilitating the generation of high-quality responses.

Although FiD tends to achieve better performance when incorporating a larger number of passages, this introduces substantial computational overhead for both encoder and decoder components. Specifically, the encoder must process multiple passages in parallel, which significantly increases memory usage and computational complexity as the number of input passages rises. Furthermore, for each token generated in the answer, the decoder repetitively executes cross-attention computations over all tokens from every passage received from the encoder. This auto-regressive decoding mechanism causes the computational cost of cross-attention to grow exponentially as the length of generated answers and the number of input passages increase. Consequently, this computational inefficiency is exacerbated in LFQA tasks, which require detailed and lengthy responses, leading to slower inference speeds and limitations in real-time applicability.

To address the computational inefficiencies of FiD, particularly the excessive cross-attention computations in LFQA scenarios, we propose a novel dynamic token pruning method that selectively removes tokens predicted to contribute minimally to answer generation during the encoding stage. The proposed approach jointly leverages the FiD encoder’s hidden states and attention scores to effectively identify and prune tokens with low contributions to the QA task. By reducing the number of tokens delivered to the decoder, our approach significantly decreases the computational complexity of cross-attention operations, thereby accelerating the entire inference process. Experimental results conducted on LFQA datasets, ASQA [4] and CLAPNQ [5], demonstrate that the proposed method achieves substantial improvements in inference speed compared to the baseline FiD model while minimizing degradation in answer quality.

2. Related Work

2.1. Token Pruning

Token pruning has emerged as one of the primary approaches to improve the computational efficiency of Transformer-based [6] models. Various token pruning techniques have been proposed in recent years. For example, POWER-BERT [7] analyzed that, due to the self-attention operations in the encoder, token representations become increasingly similar in deeper layers, leading to redundant information. Based on this observation, POWER-BERT proposed a method that selectively removes less significant tokens using a top-k approach based on attention scores, reducing redundant information and thereby improving inference speed.

LTP [8] pointed out several limitations of POWER-BERT’s top-k-based token pruning, including additional computational overhead caused by memory sorting and swapping operations, manual tuning required for the value of k, and difficulty in adaptively adjusting k based on input length. To address these issues, LTP introduced a layer-wise learnable threshold to prune tokens through a simple comparison between attention scores and the threshold, enhancing adaptability and efficiency.

Unlike previous studies that relied on attention scores, Transkimmer [9] proposed a pruning method based on a skim predictor utilizing hidden states, thereby performing token pruning based on contextually encoded token information.

SPARSEFLOW [10] aimed to mitigate inefficiencies arising from repeated transmission of similar information due to self-attention operations in Transformer encoders. It proposed a Mixture-of-Experts (MoE)-based token pruning method, transforming information flow from dense to sparse based on token positions.

Most token pruning studies have primarily focused on improving the computational efficiency of encoder models, such as BERT [11]. In contrast, this work extends token pruning methodologies to an encoder–decoder architecture, aiming to enhance the computational efficiency of the generative question answering model FiD [2].

2.2. Efficient FiD

Several studies have addressed the computational efficiency issues of the FiD model. For instance, KG-FiD [12] utilized knowledge graphs and graph neural networks (GNNs) to model relationships among passages relevant to a given query. It then optimized efficiency at the passage level by reranking highly relevant passages or pruning irrelevant ones, effectively reducing the number of input passages for FiD.

LUMEN [13] and GLIMMER [14] decomposed the FiD encoder into two parts: a memory encoder computed offline and a live encoder operating online. LUMEN precomputed passages offline via the memory encoder, while GLIMMER improved real-time encoder efficiency through reranking the precomputed representations.

FastFiD [15] trained an additional classifier at the final stage of the encoder to predict sentences containing answers. During inference, this classifier filtered out a subset of relevant sentences to pass to the decoder, thereby shortening the input sequence length at the sentence level and improving the computational efficiency of FiD.

Additionally, previous work [16] improved the efficiency of the FiD decoder by combining token pruning, retaining only the top-k% tokens with high attention scores during decoder cross-attention, with layer early-exit strategies to skip unnecessary computations.

Previous approaches have aimed to improve FiD efficiency from various perspectives, such as passage-level processing, sentence-level filtering, and internal computation optimization. In contrast, our study proposes a novel method that improves computational efficiency by dynamically pruning tokens within the encoder, thereby significantly reducing the volume of information passed to the decoder at the token level.

3. Method

3.1. Fusion-in-Decoder

We first clearly describe the operation of the FiD [2] model to clarify our proposed method.

FiD, proposed for ODQA, adopts a T5-based [3] encoder–decoder architecture. Specifically, it independently encodes multiple retrieved passages together with the given query in the encoder. The encoded representations from all passages are then concatenated into a single representation and passed to the decoder, which integrates information from all passages to generate the answer.

Given a question q and a set of K retrieved passages

P = {p^{1}, p^{2}, \dots, p^{K}}

, each passage

p^{i}

, consisting of a title

t^{i}

and context

c^{i}

, is combined with the question q to create K input sequences

X^{i}

, formulated as follows:

X^{i} = “ q u e s t i o n : q, t i t l e : t^{i}, c o n t e x t : c^{i} ”, 1 \leq i \leq K

. Special prefixes such as “question:”, “title:”, and “context:” are prepended to clearly delineate each segment.

Each input sequence

X^{i}

undergoes tokenization to generate token sequences consisting of N tokens per passage:

X^{i} = {x_{1}^{i}, \dots, x_{N}^{i}}

. Subsequently, each tokenized sequence

X^{i}

is independently passed through the encoder, resulting in hidden states

H^{i}

:

H^{i} = E n c o d e r (X^{i})

(1)

where

H^{i} = {H_{1}^{i}, H_{2}^{i}, \dots, H_{N}^{i}}

, and

H_{n}^{i} \in R^{d}

represents the hidden state of the n-th token in the i-th passage from the last encoder layer. Here, d denotes the dimension of the hidden states. For clarity, the hidden state of the n-th token in the i-th passage at a specific encoder layer l is represented as

H_{n}^{i, l}

.

The resulting encoded representations of the K passages,

H^{1}, H^{2}, \dots, H^{K}

, are concatenated into a single long sequence E:

E = [H^{1}; H^{2}; \dots; H^{K}] \in R^{K N \times d}

(2)

This concatenated sequence E is then used as the key and value in the cross-attention layers of the decoder. The decoder generates answers auto-regressively by performing cross-attention between previously generated tokens and the sequence E.

The theoretical computational complexity (FLOPs) of the FiD model is expressed as follows:

{FLOPs}_{F i D} = \underset{encoder self-attention}{\underset{︸}{O (L_{e} K N^{2} d)}} + \underset{decoder self-attention}{\underset{︸}{O (L_{d} T^{2} d)}} + \underset{decoder cross-attention}{\underset{︸}{O (L_{d} T K N d)}}

(3)

where

L_{e}

and

L_{d}

denote the number of FiD encoder and decoder layers, respectively, T denotes the length of the output sequence, and

K, N, d

represent the number of retrieved passages, the length of the input sequence per passage, and the hidden dimension, respectively.

However, from the perspective of actual inference latency, the decoder incurs significantly higher latency than the encoder. The encoder processes all K passages simultaneously in parallel, approximating its latency as

O (L_{e} N^{2} d)

. In contrast, the decoder sequentially generates T tokens auto-regressively. During each decoding step, the decoder repeatedly performs cross-attention computations over the entire concatenated encoder output E. According to FiDO [17], such repeated queries to key/value vectors induce frequent memory accesses, becoming a major bottleneck for decoder latency. Consequently, the decoder latency can be represented as

O (L_{d} T (K N + T) d)

. Hence, the overall inference latency of the FiD model can be expressed as follows:

{Latency}_{F i D} \approx \underset{\begin{matrix} one parallel pass \end{matrix}}{\underset{︸}{O (L_{e} N^{2} d)}} + \underset{\begin{matrix} T serial steps \end{matrix}}{\underset{︸}{O (L_{d} T (T + K N) d)}}

(4)

Consequently, the inference latency of the FiD model primarily stems from the decoder’s cross-attention computations, and this bottleneck is significantly exacerbated in LFQA scenarios, where the length of answers to be generated is long. Therefore, in this study, we propose a dynamic token pruning method applied during the encoder phase to alleviate the inference latency issue of FiD. Specifically, our proposed method aims to shorten the length of the encoder hidden states (

K N

) that are passed to the decoder’s cross-attention layers, thereby effectively reducing the overall inference time.

3.2. Layer-Wise Pruning Network

Figure 1 summarizes the architecture of our proposed method, which dynamically prunes tokens expected to have low contributions to answer generation, thereby accelerating FiD inference speed. The key idea is to progressively eliminate less informative tokens across encoder layers, reducing the volume of information passed to the decoder and alleviating the computational burden of the cross-attention operation.

The core component of our model is the pruning network, inserted at the output of each encoder layer. This network takes the hidden states and self-attention scores from the corresponding encoder layer as input and generates a pruning mask that determines whether each token is retained or discarded.

The pruning network utilizes two primary types of information. First, hidden states directly reflect contextually encoded semantic representations, capturing the informational content of each token. Second, attention scores indicate token importance from the target task.

Meanwhile, ATLAS [18] has noted that attention scores may overestimate the actual token contribution when the corresponding value vectors have a small norm. To address this issue, ATLAS proposed multiplying the attention score by the L2 norm of the value vectors. Following this approach, we compute token importance scores accordingly.

Specifically, at the l-th encoder layer, the average attention score

{\tilde{α}}_{n}^{i, l}

and the average L2 norm of the value vector

{∥{\tilde{v}}_{n}^{i, l}∥}_{2}

for the n-th token of the i-th passage are calculated as follows:

{\tilde{α}}_{n}^{i, l} = \frac{1}{N_{q} N_{h}} \sum_{m = 1}^{N_{q}} \sum_{h = 1}^{N_{h}} α_{m n}^{i, l, h}

(5)

{∥{\tilde{v}}_{n}^{i, l}∥}_{2} = \frac{1}{N_{h}} \sum_{h = 1}^{N_{h}} {∥v_{n}^{i, l, h}∥}_{2}

(6)

where

α_{m n}^{i, l, h}

is the attention score between the m-th query token and n-th key token in the h-th attention head at the l-th encoder layer for the i-th passage,

N_{h}

is the number of attention heads, and

N_{q}

denotes the length of the query sequence.

{∥v_{n}^{i, l, h}∥}_{2}

represents the L2 norm of the value vector for the n-th token in the h-th head at the l-th encoder layer for the i-th passage.

Finally, following the ATLAS approach, the token importance score

γ_{n}^{i, l}

of the n-th token at the l-th layer for the i-th passage is computed by multiplying the previously calculated average attention score

{\tilde{α}}_{n}^{i, l}

(Equation (5)) by the average L2 norm of the corresponding value vector

{∥{\tilde{v}}_{n}^{i, l}∥}_{2}

(Equation (6)):

γ_{n}^{i, l} = {\tilde{α}}_{n}^{i, l} \cdot {∥{\tilde{v}}_{n}^{i, l}∥}_{2}

(7)

This token importance score

γ_{n}^{i, l}

is used together with the hidden state to decide whether each token is pruned.

The pruning network consists of a layer normalization operation followed by a single linear layer. The pruning network at the l-th layer,

P r u n i n g N e t w o r k^{(l)}

, receives the hidden state

H_{n}^{i, l}

of each token and outputs a pruning score

z_{n}^{i, l}

, determining whether the token should be retained.

\begin{matrix} z_{n}^{i, l} & = P r u n i n g N e t w o r k^{(l)} (H_{n}^{i, l}) \\ = L i n e a r (L a y e r N o r m (H_{n}^{i, l})) \end{matrix}

(8)

Here,

z_{n}^{i, l}

is a 2-dimensional vector:

z_{n}^{i, l} = [z_{n, 0}^{i, l}, z_{n, 1}^{i, l}]

, where

z_{n, 0}^{i, l}

represents the score for discarding the token and

z_{n, 1}^{i, l}

represents the score for retaining it. Then, the previously computed token importance score

γ_{n}^{i, l}

(Equation (7)) is added as a bias to compute the final pruning score

{\tilde{z}}_{n}^{i, l} \in R^{2}

:

{\tilde{z}}_{n}^{i, l} = z_{n}^{i, l} + [- γ_{n}^{i, l}, + γ_{n}^{i, l}]

(9)

By adding and subtracting the token importance scores

γ_{n}^{i, l}

, the pruning network biases towards preserving tokens with higher importance.

Furthermore, since preserving query information is critical in QA tasks, tokens corresponding to the “question:” prefix in the input sequence

X_{i}

are excluded from pruning and always retained.

Generating discrete pruning masks from the pruning scores

{\tilde{z}}_{n}^{i, l}

typically complicates gradient-based optimization via backpropagation. To address this challenge, we employ the Gumbel–Softmax trick [19], which approximates the binary pruning mask

m_{n}^{i, l}

, indicating token retention or removal, as a continuous and differentiable operation:

m_{n}^{i, l} = Gumbel-Softmax ({\tilde{z}}_{n}^{i, l})

(10)

This approximation enables gradient flow during training, allowing end-to-end optimization of the pruning network parameters. Here,

m_{n}^{i, l} = [m_{n, 0}^{i, l}, m_{n, 1}^{i, l}]

, where represent the probabilities of discarding and retaining the token, respectively.

To ensure that a token, once pruned, remains pruned in subsequent layers, we apply a cumulative pruning mask

M_{n}^{i, l}

for each token:

M_{n}^{i, l} = M_{n}^{i, l - 1} ⊙ m_{n}^{i, l}

(11)

This cumulative masking strategy effectively prevents reactivation of previously pruned tokens. The initial cumulative mask

M_{n}^{i, 0}

is set to [0, 1] for all tokens, indicating they are all initially retained.

To ensure the effects of pruning, the cumulative mask

M_{n}^{i, l}

is utilized in two ways. First, it is directly applied to the output hidden state

H_{n}^{i, l}

, setting the representations of pruned tokens to zero:

{\hat{H}}_{n}^{i, l} = H_{n}^{i, l} \cdot M_{n, 1}^{i, l}

(12)

The masked hidden state

{\hat{H}}_{n}^{i, l}

is then passed as input to the next layer.

Second, the cumulative pruning mask

M_{n}^{i, l}

directly acts within the self-attention operation of the next encoder layer

l + 1

, ensuring attention weights assigned to pruned tokens become zero. This prevents unnecessary information flow from tokens identified as irrelevant.

3.3. Generation-Aware Pruning

The layer-wise pruning approach described earlier operates primarily based on the local importance of tokens and the internal information flow within the encoder. However, since the FiD encoder independently processes each (question q, passage

p^{i}

) pair

X^{i}

, it inherently struggles to effectively identify tokens from passages with less direct relevance to the question or incorporate a global perspective for answer generation that spans all passages.

To address these limitations and incorporate the decoder’s perspective into pruning decisions, we introduce an additional generation-aware pruning network (

P r u n i n g N e t w o r k^{(L)}

) positioned atop the final encoder layer. Although structurally similar to the layer-wise pruning networks, this network differs mainly in how it computes token importance and the scope of its operations. Specifically, to emulate the decoder’s focus at the onset of answer generation (i.e., when processing the first token, typically the <BOS> token), the network performs a Multi-Head Cross-Attention operation. While the layer-wise pruning computations (importance scores, pruning scores, and mask generation) were performed locally within each passage and each encoder layer, the generation-aware pruning operations are executed globally once on the concatenated encoder sequence.

In detail, we first obtain the last encoder output sequence after cumulative layer-wise pruning,

\hat{E} = [{\hat{H}}^{1}; \dots; {\hat{H}}^{K}] \in R^{K N \times d}

. We then mimic the decoder’s initial cross-attention step by performing Multi-Head Cross-Attention using the embedding of the <BOS> token,

e_{B O S}

, as the query and the concatenated encoder representations

\hat{E}

as both the key and value.

Based on the resulting cross-attention (attention scores and value vectors), we calculate a generation-aware importance score

γ_{n}^{B O S}

for each token n in the concatenated sequence

\hat{E}

. This score is calculated by adapting the principles outlined in Equations (5)–(7) for a global, single-pass context. Specifically, the attention is computed with respect to the single <BOS> token as the query, and the operations are performed over the entire concatenated sequence. Since this process occurs only once at the generation-aware pruning network

P r u n i n g N e t w o r k^{(L)}

, both the passage i and the layer l from the original equations used in the preceding layer-wise pruning are inapplicable in this section.

In this way, the

P r u n i n g N e t w o r k^{(L)}

allows evaluating token importance not only based on internal encoder information but also incorporating the decoder’s perspective at the beginning of generation.

Subsequently, for each token n in the concatenated sequence (where

1 \leq n \leq K N

), its representation

{\hat{E}}_{n}

and corresponding importance score

γ_{n}^{B O S}

are used as inputs to the generation-aware pruning network,

P r u n i n g N e t w o r k^{(L)}

. Following the principles of score calculation and biasing described in Equations (8) and (9), a final pruning score

{\tilde{z}}_{n}^{i, l}

is computed. The Gumbel–Softmax function, as per Equation (10), is then applied to these scores to produce a binary pruning mask

m_{n}^{(L)}

for each token.

This generated pruning mask is combined with the cumulative mask from the previous stage,

M^{(L - 1)}

, to form the final pruning mask

M^{(L)}

, following the principle of Equation (11).

To produce the final encoder representations for the decoder, the final pruning mask is applied to the concatenated sequence

{\hat{E}}_{n}

, following the principle of Equation (12), resulting in the final encoder representations

{\hat{E}}_{k e e p}

.

During training, soft pruning is employed, passing both the final encoder representations,

{\hat{E}}_{k e e p}

, and the final pruning mask,

M^{(L)}

, to the decoder. In contrast, during inference, hard pruning is applied, effectively transmitting a shortened sequence representation to the decoder, thereby significantly accelerating inference speed.

3.4. Training Strategy

To ensure that the proposed dynamic token pruning mechanism functions as intended and effectively integrates with the entire FiD [2] model during training, we employ the following training strategies.

3.4.1. Pruning Rate Control

To prevent situations where either too many tokens are pruned, causing critical information loss, or too few tokens are pruned, yielding limited efficiency improvement, we introduce a Pruning Rate Loss (

L_{r a t e}

). We set a target token retention rate

r^{(s)}

as a hyperparameter for each pruning step s, including

L_{e}

layer-wise pruning steps and one generation-aware pruning step, totaling

S = L_{e} + 1

. The actual token retention rate at step s is calculated as follows:

p_{k e e p}^{(s)} = \frac{1}{K N} \sum_{n = 1}^{K N} M_{n, 1}^{(s)}

(13)

where K is the number of passages, N is the number of input tokens per passage, and thus

K N

is the total number of input tokens.

M_{n, 1}^{(s)}

denotes the effective pruning mask (1 if retained, 0 if pruned) for the n-th token at pruning step s, considering all passages as a single concatenated sequence. We then compute the Mean Squared Error (MSE) between the actual retention rate

p_{k e e p}^{(s)}

and the target rate

r^{(s)}

, defined as follows:

L_{r a t e} = \frac{1}{S} \sum_{s = 1}^{S} {(p_{keep}^{(s)} - r^{(s)})}^{2}

(14)

This loss term guides the model towards maintaining the desired pruning proportion at each step.

3.4.2. Cross-Attention Alignment

To ensure that the generation-aware pruning step’s token importance scores

γ_{n}^{B O S}

closely align with the decoder’s actual importance scores

γ_{n}^{d e c}

, computed at the beginning of answer generation, we introduce a Cross-Attention Alignment Loss (

L_{K L}

). This loss encourages the pruning criteria at the generation-aware pruning step to align closely with the decoder’s information requirements, thus minimizing unnecessary information loss and enabling more effective pruning. Specifically, we measure the difference between these two distributions using the Kullback–Leibler Divergence (KL Divergence):

L_{K L} = \frac{1}{K N} \sum_{n = 1}^{K N} D_{K L} (stopgrad (γ_{n}^{d e c}) ∥ γ_{n}^{B O S})

(15)

Here,

γ_{n}^{d e c}

represents token importance scores derived from the decoder’s cross-attention distribution during the initial generation step (i.e., using the <BOS> token as a query). These scores are computed following the principle of Equation (7) and are subsequently averaged over all decoder layers. Since

γ_{n}^{d e c}

serves as a fixed target distribution, we apply the stop-gradient operation to prevent gradient flow into this tensor.

3.4.3. Overall Loss

The total loss function

L_{t o t a l}

, which serves as the final objective for model training, is constructed as a weighted sum of the primary QA loss

L_{Q A}

and the auxiliary pruning-related losses,

L_{r a t e}

and

L_{K L}

. We introduce hyperparameters

λ_{r a t e}

and

λ_{K L}

to balance the relative contributions of these auxiliary losses:

L_{t o t a l} = L_{Q A} + λ_{r a t e} L_{r a t e} + λ_{K L} L_{K L}

(16)

The QA loss

L_{Q A}

is typically defined as the Negative Log-Likelihood (NLL) over all tokens in the target answer sequence:

L_{Q A} = - \sum_{t = 1}^{T_{a n s}} l o g P_{θ} (y_{t} | y_{< t}, {\hat{E}}_{k e e p})

(17)

Here,

T_{a n s}

denotes the answer length,

y_{t}

is the t-th answer token,

y_{< t}

represents previously generated tokens before step t, and

{\hat{E}}_{k e e p}

denotes the encoder outputs after applying the final pruning mask

M^{(L)}

.

The full set of parameters

θ

, including both original FiD parameters and newly introduced pruning network parameters, are jointly optimized in an end-to-end manner to minimize this total loss

L_{t o t a l}

.

3.4.4. Weight Initialization

For stable initial training, we follow a similar strategy to Transkimmer [9] for pruning network initialization. Randomly initialized pruning networks may excessively prune tokens at early training stages, causing instability. In our preliminary experiments, this caused the model to be fed with mostly empty or meaningless inputs, and as a consequence, the training loss failed to converge.

To mitigate this, we initialize the pruning network parameters from a normal distribution

N (0, σ^{2})

with mean 0 and small variance

σ^{2}

. Additionally, we impose a large positive initial bias (e.g.,

b_{k e e p} = + 5

) on the score representing token retention and a large negative initial bias (e.g.,

b_{d i s c a r d} = - 5

) on the token discard score. This initialization strategy encourages the model to retain most tokens at the beginning of training, facilitating gradual learning of effective pruning behaviors.

By integrating these methodological components, the proposed dynamic token pruning mechanism enables FiD to effectively identify and prune tokens at the encoder stage, considering both the characteristics of the target task and anticipated information demands of the decoder. As a result, our method significantly enhances computational efficiency in the decoder, thus accelerating inference speed, especially in computationally intensive LFQA scenarios.

4. Experiments

4.1. Experimental Setup

4.1.1. Dataset

To evaluate both the effectiveness and computational efficiency of our proposed method in LFQA, we conducted experiments on two benchmark datasets: ASQA [4] and CLAPNQ [5]. Data statistics are specified in Table 1.

ASQA pairs the ambiguous factoid questions of AmbigQA [20] with crowdsourced, paragraph-length reference answers. AmbigQA goes beyond the classical Who/When/Where interrogatives and introduces additional sources of ambiguity—namely event references, properties, entity references, answer types, temporal dependence, and multiple sub-questions—so each item demands a comprehensive, multi-faceted answer.

CLAPNQ was created to benchmark retrieval-augmented generation (RAG) systems. Drawing from Natural Questions (NQ) [21], its curators filtered out queries with a short-form answer and kept only those that demand long answers. Each remaining question was paired with a long answer and a single gold passage, while a balanced set of unanswerable queries was deliberately included. In addition to the common What and Where prompts, the dataset introduces boolean, conjunctive, descriptive, and explanatory questions—each designed to elicit extended answers.

Since neither dataset officially provides a test set, all model evaluations were conducted using the provided validation sets, following prior work.

4.1.2. Training Details

For the baseline model, we employed a Fusion-in-Decoder (FiD) [2] initialized from the pretrained “t5-base” checkpoint, available via the HuggingFace Transformers library. In the retrieval phase, we adopted the FiD-KD [22] retriever, which leverages a Dense Passage Retriever [23] model pretrained on the NQ dataset. Due to GPU memory constraints, the number of passages retrieved per question was limited to a maximum of 50.

The proposed dynamic token pruning method was integrated into the same “t5-base”-based FiD architecture and trained under identical retrieval conditions. Training-related hyperparameters such as learning rate, batch size, pruning-related loss weights (

λ_{r a t e}

and

λ_{K L}

), and target pruning rates

r^{(s)}

were selected based on tuning conducted on subsets of each dataset’s validation set. Detailed hyperparameter configurations are provided in Appendix A. All training experiments were conducted on a single machine equipped with either two NVIDIA TITAN RTX GPUs or two NVIDIA GeForce RTX 3090 GPUs. Evaluation was performed under identical machine conditions equipped with two NVIDIA GeForce RTX 3090 GPUs.

4.1.3. Evaluation Metrics

Model evaluation was carried out from two main perspectives: answer quality and inference efficiency.

To measure answer quality, we adopted standard automatic evaluation metrics used in LFQA: ROUGE-L [24] and token-level F1 scores. The ROUGE-L metric computes an F1 score based on the longest common subsequence (LCS) between the generated answer and the reference answer. Additionally, we report token-level F1 scores, which reflect the harmonic mean of token-level precision and recall between the generated answer and the reference. In addition to these n-gram based metrics, we also report BERTScore [25], which evaluates the semantic similarity between the generated and reference answers using contextual embeddings. This allows for a more robust evaluation of answer quality by capturing meaning beyond simple lexical overlap.

For evaluating inference efficiency, we measured the average time required to generate answers for individual questions using the Time Per Question (TPQ) metric. TPQ was calculated in milliseconds (ms) per question, with a batch size set to 1. Furthermore, we evaluated the retention rate (RR), which represents the average ratio of tokens remaining after pruning in the encoder, calculated across the entire evaluation dataset.

4.2. Results

Table 2 shows the experimental results on the ASQA and CLAPNQ LFQA datasets. Compared to the FiD baseline, our proposed dynamic token pruning method resulted in a minor performance variation—approximately a 0.2% decrease in F1 scores and a 0.1% increase in ROUGE-L scores on the ASQA dataset, and approximately a 0.1% decrease in both F1 and ROUGE-L scores on the CLAPNQ dataset. In terms of inference speed, our method achieved up to a 1.71× improvement for ASQA and up to a 1.74× improvement for CLAPNQ over the FiD baseline. Analyzing the actual token retention rates, we observed that on average only about 12% of the tokens remained after pruning. Compared to prior FiD efficiency approaches [15,16], our method achieves comparable inference acceleration while significantly reducing the degradation in answer quality. Moreover, our method can be trained in an end-to-end manner without relying on any short-answer supervision. These results confirm that our dynamic token pruning effectively reduces inference costs by dynamically pruning tokens irrelevant to QA while preserving answer quality.

Table 3 illustrates ablation experiments conducted to analyze the influence of input information provided to the pruning network and the auxiliary loss functions. Specifically, we compared two pruning methods: one similar to Transkimmer [9], which utilizes only token hidden states, and our proposed method, which integrates attention-based importance scores as a bias. The experimental results confirmed that leveraging attention-based importance scores as biases significantly improved QA performance compared to using hidden-state information alone. Unlike LTP [8], which applies a learnable threshold directly to raw attention, we first translate each token’s average attention weight into a dynamic importance bias and add it to the hidden-state based pruning scores; this fusion retains high-importance tokens and prevents information loss. This result suggests that attention-based importance scores provide essential cues for preserving crucial information and identifying core tokens.

To evaluate the impact of auxiliary loss functions, we performed experiments excluding the Cross-Attention Alignment Loss

L_{K L}

. The results indicated that omitting the

L_{K L}

, designed to align the encoder’s cross attention distribution at the final pruning stage with the decoder’s actual cross-attention distribution, degraded the model’s capability to clearly differentiate essential tokens from non-essential ones, resulting in decreased QA performance.

4.3. Inference Efficiency

Table 4 compares the efficiency of the proposed method in terms of inference throughput and peak GPU memory. Our dynamic token pruning scheme processes roughly 10 more tokens per second than FiD, giving it the highest throughput among all baselines. It is also the most memory-efficient: its peak GPU usage is about one-fifth that of FiD, outperforming every other system in both speed and memory consumption.

4.4. Case Study

Figure 2 shows the actual tokens retained after applying our proposed pruning approach on the ASQA dataset. Figure 2a shows the results on a gold passage, clearly indicating retention of tokens directly relevant to QA, such as ‘playing’, ‘Charlie’, and ‘Always’. Conversely, Figure 2b illustrates fewer tokens retained from a negative passage. This highlights that our dynamic pruning method effectively preserves tokens directly relevant to QA, while successfully filtering out tokens from irrelevant passages. Consequently, our method demonstrates its effectiveness in preemptively eliminating irrelevant tokens at the encoder stage, maintaining QA performance, and accelerating inference speed.

5. Discussion and Limitations

We initially hypothesized that tokens would be gradually pruned as they sequentially passed through the encoder layers, ultimately retaining only tokens directly relevant to the QA task. However, the experimental results depicted in Figure 3 showed an unexpected pattern, with pruning primarily concentrated near the top encoder layers. This observation aligns with hierarchical information processing patterns reported in encoder–decoder interpretability studies such as DecoderLens [26]. According to such studies, the encoder follows a hierarchical pattern in which lower layers process superficial linguistic features, middle layers focus on local semantic information, and higher layers integrate fine-grained, task-specific representations. Given this, middle encoder layers might be limited in clearly distinguishing between essential and non-essential tokens from the final QA perspective. Consequently, the model appears to adopt a strategy of retaining most tokens in early to middle layers to minimize the risk of information loss while selectively removing unnecessary tokens at higher layers once information has been sufficiently refined and integrated. This behavior represents a reasonable trade-off learned by the model, effectively reducing the decoder’s cross-attention computational cost while maintaining QA performance.

Nevertheless, our study presents several limitations. The introduced pruning network and additional cross-attention computations result in a modest increase in model parameters and training time. Additionally, contrary to our intended design, the actual token pruning predominantly occurred at the top encoder layers. This finding implies that the exact form of the target-retention decay schedule (e.g., exponential vs. linear) may be a secondary factor; once the model learns its top-layer-focused pruning strategy, any reasonable, monotonically decreasing schedule could yield similar results. Therefore, the primary challenge is tuning the more influential hyperparameters, specifically where to insert pruning layers and what task-specific target-retention rates to set, in order to strike a better balance between efficiency and performance. Understanding the interaction among these factors remains an important avenue for future research.

6. Conclusions

In this paper, we proposed a novel dynamic token pruning approach designed to alleviate the high computational cost problem faced by decoder modules of retrieval-augmented generation models, specifically FiD [2], within the LFQA setting. The proposed methodology leverages contextual token representations and attention-based importance scores at the encoder stage to preemptively identify and prune tokens with low contributions to answer generation. Our method is seamlessly integrated into FiD and trained in an end-to-end manner alongside all model parameters. This encoder-level dynamic token pruning effectively reduces the volume of information passed to the decoder, thereby significantly accelerating inference speed.

Experimental evaluations conducted on two LFQA benchmark datasets, ASQA [4] and CLAPNQ [5], demonstrated that our dynamic token pruning method substantially improves computational efficiency while minimally affecting the answer generation quality compared to the original FiD model. These results indicate that our proposed approach effectively alleviates the computational bottleneck in the decoder, maintains minimal information loss, and can even lead to slight performance gains. Furthermore, the generalizability and end-to-end trainability of our token pruning approach suggest its potential applicability beyond T5-based [3] architectures and LFQA tasks, extending to diverse natural language generation tasks involving multiple-input processing such as document summarization and dialogue generation.

Author Contributions

Conceptualization, W.K. and G.K.; methodology, W.K.; software, W.K.; writing—original draft preparation, W.K.; writing—review and editing, G.K. and S.K.; visualization, W.K.; supervision, G.K. and S.K.; funding acquisition, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1A2C1005316) and in part by the Gachon University research fund of 2024 (GCU-202400460001).

Data Availability Statement

Our code is available at https://github.com/kws9208/dynamic_token_pruning (accessed on 28 May 2025). Publicly available datasets were used in this study: CLAPNQ Dataset [5] (https://github.com/primeqa/clapnq/tree/main/annotated_data, accessed on 28 May 2025); ASQA Dataset [4] (https://storage.googleapis.com/gresearch/ASQA/data/ASQA.json, accessed on 28 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Hyperparameter

The key hyperparameters used in our experiments are summarized in Table A1. All models were trained using the AdamW optimizer [27], and a linear scheduler was employed for learning rate scheduling. Training was conducted for a total of 20,000 steps. When constructing input sequences, the maximum token length per passage was limited to 250, and the maximum length of the generated answer was restricted to 128 tokens.

Table A1. Hyperparameter configuration for experiments.

Hyperparameter	ASQA/CLAPNQ
Learning rate	$1 \times 10^{- 4}$
Optimizer	AdamW
LR Scheduler	Linear
Warm-up steps	1000
Total training steps	20,000
Per-GPU batch size	1
Accumulation steps	32
Effective batch size	64
Max input length (passage)	250
Max output length	128
Number of passages (K)	50
Initial retention rate ( $r_{i n i t}$ )	0.9
Final retention rate ( $r_{f i n a l}$ )	0.3
Generation-aware retention rate ( $r^{(S)}$ )	0.05
Gumbel temperature ( $τ$ , initial)	1.0
Temp. retain steps	1000
Temp. reducing steps	2000
$λ_{r a t e}$	2.0
$λ_{K L}$	1.0

For training the proposed dynamic token pruning network, we set the target retention rate

r^{(s)}

at each pruning step s (total

S = L_{e} + 1

steps) to gradually decrease from an initial retention rate

r_{i n i t}

to a final retention rate

r_{f i n a l}

. This gradual reduction strategy preserves most tokens in the initial layers, facilitating stable training, and progressively enables the model to prune unnecessary tokens more aggressively at later layers. Specifically, an exponential decay was applied for the layer-wise pruning steps

s = 1, \dots, L_{e}

:

r^{(s)} = r_{i n i t} \cdot β^{s - 1}, w h e r e β = \frac{r_{f i n a l}}{r_{i n i t}}

(A1)

Through this approach, we encouraged increasingly aggressive pruning as the network progressed deeper into the encoder. The target retention rate for the final generation-aware pruning step

r^{(S)}

was set at 5%, ensuring that the model retains only the minimum essential information required for accurate QA generation. Additionally, the weights of the pruning-related auxiliary loss functions were set to

λ_{r a t e} = 2

and

λ_{K L} = 1

, respectively.

To enhance the stability of training the pruning network, we applied temperature scheduling for Gumbel–Softmax. Specifically, the temperature

τ

was initialized at 1.0, linearly decreased from step 1000 to step 3000 down to 0.1, and maintained at 0.1 thereafter. This schedule encourages exploration in the early stages of training and promotes increasingly deterministic pruning decisions as training progresses.

Appendix A. Error Analysis

Table A2 presents the representative cases uncovered in our error analysis. Our qualitative error analysis revealed that the proposed model’s generated answers frequently suffered from factual inaccuracies, significantly impacting their reliability. These issues primarily fall into two categories:

Incorrect Entity and Attribute Association: The proposed model often misidentified the core entity or provided inaccurate attributes (e.g., numerical data or roles) for the correct one. This suggests issues with precise information retrieval or the generation of unsupported facts.
Misinterpretation of Relational Information: Even when entities were correctly identified, the proposed model sometimes failed to accurately represent the relationships between them (e.g., familial or professional connections), leading to distorted contextual understanding.

Table A2. Error analysis.

Question

Answers

Who has scored most goals in
international football?

Gold: [“The record (85 goals) for the most international goals scored in men’s football is held by Iranian professional footballer Ali Daei. On the women’s side, this record (187 goals) belongs to the Canadian soccer player Christine Sinclair.”, “In total, 67 male footballers to date have managed to score at least 50 goals with their national team at senior level. Ali Daei has scored the most goals in men’s international football and is the only player to score over 100 goals in international men’s football with 109 goals. Christine Sinclair scored the most goals in international women’s football with 187 goals.”]
Ours: The player who has scored the most goals in international football is Alessandro Rosa Vieira, a Czech-Austrian professional footballer. He is also the world’s all-time leading goalscorer with 159 goals.

Who is fruma sarah in fiddler on the roof?

Gold: [’In the original 1964 Broadway musical cast of Fiddler on the Roof, the ghostly depiction of the late wife of Lazar Wolf, Fruma Sarah is played by Carol Sawyer Yussel. In the 1971 film version of the musical, Fruma Sarah is portrayed by Ruth Madoc. In the 2015–2016 Broadway Revival of Fiddler on the Roof, American singer and actress Jessica Vosk plays the character of Fruma Sarah.’, ’There are several versions of Fiddler on the Roof. The original Broadway production of the show, which opened in 1964, had the first musical theatre run in history to surpass 3000 performances. The character of Fruma Sarah, a ghostly depiction of the late wife of Lazar Wolf, is played by Carol Sawyer Yussel. In the 1971 film, Fiddler on the Roof, which theatrically released on 3 November 1971, by United Artists to critical and commercial success, Ruth Madoc played Fruma Sarah. In the 2015–2016 Broadway Revival of Fiddler on the Roof, American singer and actress Jessica Vosk played Fruma Sarah.’]
Ours: Fiddler on the Roof is a 1971 American musical comedy-drama film produced and directed by Norman Jewison. It is an adaptation of the 1964 Broadway musical of same name, with music composed by Jerry Bock, lyrics by Sheldon Harnick, and screenplay by Joseph Stein, based on stories by Sholem Aleichem. The film centers on Tevye the Dairyman and his three daughters, Fruma Sarah, who rises to warn Lazar Wolf about the possibility of marriage.

References

Chen, D.; Fisch, A.; Weston, J.; Bordes, A. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; Barzilay, R., Kan, M.Y., Eds.; Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 1870–1879. [Google Scholar] [CrossRef]
Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online; 21–23 April 2021; Merlo, P., Tiedemann, J., Tsarfaty, R., Eds.; Association for Computational Linguistics: Vancouver, BC, Canada, 2021; pp. 874–880. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Stelmakh, I.; Luan, Y.; Dhingra, B.; Chang, M.W. ASQA: Factoid Questions Meet Long-Form Answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 8273–8288. [Google Scholar] [CrossRef]
Rosenthal, S.; Sil, A.; Florian, R.; Roukos, S. CLAPnq: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems. Trans. Assoc. Comput. Linguist. 2025, 13, 53–72. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Goyal, S.; Choudhury, A.R.; Raje, S.M.; Chakaravarthy, V.T.; Sabharwal, Y.; Verma, A. PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020. [Google Scholar]
Kim, S.; Shen, S.; Thorsley, D.; Gholami, A.; Kwon, W.; Hassoun, J.; Keutzer, K. Learned Token Pruning for Transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; Association for Computing Machinery: New York, NY, USA, 2022. KDD’22. pp. 784–794. [Google Scholar] [CrossRef]
Guan, Y.; Li, Z.; Leng, J.; Lin, Z.; Guo, M. Transkimmer: Transformer Learns to Layer-wise Skim. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 7275–7286. [Google Scholar] [CrossRef]
Kim, Y.; Lee, S. SparseFlow: Accelerating Transformers by Sparsifying Information Flows. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 5937–5948. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Yu, D.; Zhu, C.; Fang, Y.; Yu, W.; Wang, S.; Xu, Y.; Ren, X.; Yang, Y.; Zeng, M. KG-FiD: Infusing Knowledge Graph in Fusion-in-Decoder for Open-Domain Question Answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 4961–4974. [Google Scholar] [CrossRef]
De Jong, M.; Zemlyanskiy, Y.; FitzGerald, N.; Ainslie, J.; Sanghai, S.; Sha, F.; Cohen, W.W. Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
de Jong, M.; Zemlyanskiy, Y.; FitzGerald, N.; Sanghai, S.; Cohen, W.W.; Ainslie, J. GLIMMER: Generalized late-interaction memory reranker. arXiv 2023, arXiv:2306.10231. [Google Scholar]
Huang, Y.; Han, X.; Sun, M. FastFiD: Improve Inference Efficiency of Open Domain Question Answering via Sentence Selection. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 6262–6276. [Google Scholar] [CrossRef]
Berchansky, M.; Izsak, P.; Caciularu, A.; Dagan, I.; Wasserblat, M. Optimizing Retrieval-augmented Reader Models via Token Elimination. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 1506–1524. [Google Scholar] [CrossRef]
de Jong, M.; Zemlyanskiy, Y.; Ainslie, J.; FitzGerald, N.; Sanghai, S.; Sha, F.; Cohen, W. FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 11534–11547. [Google Scholar] [CrossRef]
Izacard, G.; Lewis, P.; Lomeli, M.; Hosseini, L.; Petroni, F.; Schick, T.; Dwivedi-Yu, J.; Joulin, A.; Riedel, S.; Grave, E. Atlas: Few-shot Learning with Retrieval Augmented Language Models. arXiv 2022, arXiv:2208.03299. [Google Scholar]
Jang, E.; Gu, S.; Poole, B. Categorical Reparameterization with Gumbel-Softmax. arXiv 2017, arXiv:1611.01144. [Google Scholar]
Min, S.; Michael, J.; Hajishirzi, H.; Zettlemoyer, L. AmbigQA: Answering Ambiguous Open-domain Questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2020; pp. 5783–5797. [Google Scholar] [CrossRef]
Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. Natural Questions: A Benchmark for Question Answering Research. Trans. Assoc. Comput. Linguist. 2019, 7, 452–466. [Google Scholar] [CrossRef]
Izacard, G.; Grave, E. Distilling Knowledge from Reader to Retriever for Question Answering. arXiv 2022, arXiv:2012.04584. [Google Scholar]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.T. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2020; pp. 6769–6781. [Google Scholar] [CrossRef]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
Langedijk, A.; Mohebbi, H.; Sarti, G.; Zuidema, W.; Jumelet, J. DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 4764–4780. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]

Figure 1. Overall architecture of the proposed FiD with dynamic token pruning. h denotes hidden states, m the pruning masks, and a the mean attention scores. Layer-wise pruning networks progressively prune tokens, while a final generation-aware pruning network refines the mask using probe cross-attention with the BOS query. The pruned encoder output is then consumed by the decoder, reducing its cross-attention workload.

Figure 2. Examples of dynamic token pruning on the ASQA dataset: (a) in the gold passage, critical tokens directly relevant to question answering are retained, whereas (b) in the negative passage, most tokens are pruned. Colors indicate the specific pruning layer at which each token is pruned. Black tokens represent those retained and passed to the decoder’s cross-attention layer.

Figure 3. Layer-wise average token retention rates of the proposed pruning network. Most tokens remain preserved throughout the early and middle encoder layers, whereas intensive pruning occurs predominantly near the top layers, resulting in only approximately 12% of tokens retained on average at the final layer. Notably, <PAD> tokens are removed immediately after the first pruning step.

Table 1. Dataset statistics.

Dataset	Split	QAs	Words in Answer
ASQA	Train	4353	73.3
ASQA	Dev	948	64.8
CLAPNQ ¹	Train	1954	53.0
CLAPNQ ¹	Dev	300	51.7

¹ Only answerable samples of the CLAPNQ dataset are used.

Table 2. Results of Long-Form QA experiments; SPEED indicates the relative inference speedup over the baseline, while RR (retention rate) denotes the percentage of input tokens forwarded to the decoder out of the total input tokens. R-L corresponds to ROUGE-L, and BS denotes BERTScore. FastFiD results on CLAPNQ are omitted because the model requires supervision on short answers, which this dataset does not provide. The best scores in each column are highlighted in bold.

Model	ASQA						CLAPNQ
Model	F1	R-L	BS	TPQ	Speed	RR	F1	R-L	BS	TPQ	Speed	RR
FiD [2]	40.46	34.45	88.26	2423.02	1.00×	100.00%	30.69	27.68	86.57	1916.56	1.00×	100.00%
FastFiD [15]	38.91	34.39	87.78	1390.71	1.74×	8.23%	-	-	-	-	-	-
Token Elimination [16]	33.77	30.65	88.31	1496.88	1.62×	10.00%	25.00	20.37	86.46	993.94	1.93×	10.00%
Ours	40.39	34.75	88.69	1416.14	1.71×	13.06%	30.25	27.46	86.90	1136.37	1.74×	11.52%

Table 3. Ablation study on the effects of input features and auxiliary loss in pruning decisions. The best scores in each column are highlighted in bold.

Model	ASQA						CLAPNQ
Model	F1	R-L	BS	TPQ	Speed	RR	F1	R-L	BS	TPQ	Speed	RR
Ours	40.39	34.75	88.69	1416.14	1.71×	13.06%	30.25	27.46	86.90	1136.37	1.74×	11.52%
Ours w/o importance score	38.89	33.61	88.57	1320.07	1.84×	10.97%	28.39	25.93	86.66	1096.56	1.74×	10.13%
Ours w/o $L_{K L}$	40.04	34.39	88.73	1356.12	1.79×	11.65%	28.97	26.26	86.69	1098.76	1.75×	9.44%

Table 4. Inference efficiency comparison. Tokens/sec indicates tokens generated per second, and Peak Mem denotes the peak GPU memory usage (GB) during inference over the entire dataset. FastFiD results on CLAPNQ are omitted because the model requires supervision on short answers, which this dataset does not provide. The best scores in each column are highlighted in bold.

Model	ASQA		CLAPNQ
Model	Tokens/sec	Peak GPU Memory (GB)	Tokens/sec	Peak GPU Memory (GB)
FiD [2]	32.00	11.44	27.91	11.40
FastFiD [15]	38.91	3.93	-	-
Token Elimination [16]	32.34	6.80	30.43	6.80
Ours	43.35	2.39	37.91	2.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, W.; Kim, G.; Kang, S. Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning. Mathematics 2025, 13, 2231. https://doi.org/10.3390/math13142231

AMA Style

Kim W, Kim G, Kang S. Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning. Mathematics. 2025; 13(14):2231. https://doi.org/10.3390/math13142231

Chicago/Turabian Style

Kim, Wooseok, Gyunyeop Kim, and Sangwoo Kang. 2025. "Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning" Mathematics 13, no. 14: 2231. https://doi.org/10.3390/math13142231

APA Style

Kim, W., Kim, G., & Kang, S. (2025). Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning. Mathematics, 13(14), 2231. https://doi.org/10.3390/math13142231

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning

Abstract

1. Introduction

2. Related Work

2.1. Token Pruning

2.2. Efficient FiD

3. Method

3.1. Fusion-in-Decoder

3.2. Layer-Wise Pruning Network

3.3. Generation-Aware Pruning

3.4. Training Strategy

3.4.1. Pruning Rate Control

3.4.2. Cross-Attention Alignment

3.4.3. Overall Loss

3.4.4. Weight Initialization

4. Experiments

4.1. Experimental Setup

4.1.1. Dataset

4.1.2. Training Details

4.1.3. Evaluation Metrics

4.2. Results

4.3. Inference Efficiency

4.4. Case Study

5. Discussion and Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Hyperparameter

Appendix A. Error Analysis

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI