Next Article in Journal
Evaluating the Societal Impact of AI: A Comparative Analysis of Human and AI Platforms Using the Analytic Hierarchy Process
Previous Article in Journal
Artificial Intelligence in Ovarian Cancer: A Systematic Review and Meta-Analysis of Predictive AI Models in Genomics, Radiomics, and Immunotherapy
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CacheFormer: High-Attention-Based Segment Caching

Département of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT 06604, USA
*
Author to whom correspondence should be addressed.
Submission received: 10 March 2025 / Revised: 3 April 2025 / Accepted: 13 April 2025 / Published: 18 April 2025

Abstract

:
Efficiently handling long contexts in transformer-based language models with low perplexity is an active area of research. Numerous recent approaches like Linformer, Longformer, Performer, and Structured state space models (SSMs), have not fully resolved this problem. All these models strive to reduce the quadratic time complexity of the attention mechanism while minimizing the loss in quality due to the effective compression of the long context. Inspired by the cache and virtual memory principle in computers, where in case of a cache miss, not only the needed data are retrieved from the memory, but the adjacent data are also obtained, we apply this concept to handling long contexts by dividing it into small segments. In our design, we retrieve the nearby segments in an uncompressed form when high segment-level attention occurs at the compressed level. Our enhancements for handling long context include aggregating four attention mechanisms consisting of short sliding window attention, long compressed segmented attention, dynamically retrieving top-k high-attention uncompressed segments, and overlapping segments in long segment attention to avoid segment fragmentation. These enhancements result in an architecture that outperforms existing SOTA architectures with an average perplexity improvement of 8.5% over similar model sizes.

1. Introduction

Deep convolutional neural networks (CNNs) were fundamental in revolutionizing the field of computer vision. Similarly, the pioneering induction of the transformer [1] architecture in natural language processing [2] has resulted in the AI revolution with large language models (LLMs) such as ChatGPT [3], Bard [4], and Llama [5,6] among others have yielded impressive performances. The transformer uses a simple similarity computation in the form of an inner product on the learned positional encoded embeddings of a sequence of n input words. If the matrix Q and K contain rows representing the embedding of each word ( 1 × d ) , then A = s o f t m a x ( Q K T ) referred to as the “attention”, which contains the dot product similarity of each input word with every other word in the input sequence. If there are n words being input, referred to as the context, then Q , K R n × d , and A   R n × n .
Like parallel feature maps in a CNN, each layer in the transformer divides the attention calculation into parallel heads. The output from a transformer layer has the same dimensionality as the input and is obtained by a simple matrix computation of ( A × V )     R n × d   where V    R n × d   is similar to K and contains rows of learned position encoded embeddings of input words. For language models, where text generation is carried out based on a given context, the attention matrix is masked triangularly so that future tokens are not visible in the training process. Multiple layers of transformer blocks are used before feeding the result of the last layer to a classification head. Because attention computation in each head is O n 2 , for long contexts, this becomes a computational bottleneck. Many approaches have been proposed in the past years to reduce the quadratic time complexity of attention to either linear or sub-quadratic complexity. Some of the notable works include Transformer-XL [7], Linformer [8], Longformer [9], Reformer [10], Performer [11], Perceiver-AR [12], LaMemo [13], and ∞-former [14] among others. We provide a brief background to the above-mentioned approaches used in reducing the attention complexity. Then we elaborate on the Transformer LS [15] that we further enhance in this work.
To handle long sequences efficiently, transformer models employ approximations like sparse, low-rank, and linear attention, which reduce computational cost but can compromise accuracy. Sparse attention [16,17] approximates full attention by only computing attention between a subset of token pairs that drastically cuts computation costs. However, this might lead to missing crucial long-range dependencies, as it assumes discarded tokens are insignificant (e.g., Longformer). Therefore, the selection of the attention pattern is critical to balancing efficiency and context capture.
Low-rank attention approximates the full attention matrix using smaller matrices, reducing complexity to O ( n r ) from O ( n ^ 2 ) . Here the performance is dependent on the context capture by the lower rank ‘ r ’ (e.g., Transformer LS [15], Linformer [8], and Reformer [10]). This method assumes the full attention matrix’s essential information can be captured by a lower-dimensional representation, balancing computational efficiency with potential information loss.
Linear attention achieves O ( n ) attention complexity via kernel tricks, but its transformations limit accuracy on long sequences, as this relies on the assumption that the simplified computation retains essential information (e.g., Performer [11], linear transformers [18]). This trade-off between speed and expressiveness necessitates careful selection of the approximation method to maintain performance on long-range tasks. Hence, as described above, the trade-off between computation and performance remains a key challenge in transformer architectures.
One of the effective designs towards overcoming the above challenge was proposed in Transformer LS. Although Transformer LS performs efficient compression on the input sequence, this compression results in segment fragmentation that leads to these key shortcomings:
  • It reduces the dimensionality of the input sequence resulting in a loss of context;
  • The input sequence is divided into smaller, potentially non-overlapping segments. Since information is isolated within segments, it disrupts the natural flow and continuity of information.
These two factors make it more challenging for the language model to capture overall context and relationships between distant elements in the sequence. Our CacheFormer architecture specifically makes the following contributions that can be summarized as follows:
  • We develop an innovative attention mechanism where the highly attentive segments are dynamically cached and retrieved in an uncompressed form. This enhances the model’s performance by retrieving the most relevant information;
  • Long attention uses chunked segments in many existing models; however, this results in loss of information due to segment fragmentation. We improve on this shortcoming by implementing an overlapping attention mechanism via projections of segments that have an s / 2 overlap with the adjacent segments;
  • We effectively combine the short attention, segment-based compressed long attention, highly attentive dynamically cached attention, and the overlapping segment attention. This results in an architecture that can efficiently handle long sequences and leads to an improved performance in language modeling.
Long-context language models excel in tasks that require deep understanding of contextual information, such as document summarization, complex question answering, and code analysis. These models enable more accurate and nuanced analysis of lengthy documents in legal, scientific, and medical fields. Further, they serve well in creative content generation and advanced customer support by preserving crucial context and improving reasoning. Our work further enhances the long context handling for the above applications.

2. Background and Related Work

An important earlier work in handling long contexts was presented by Transformer-XL. The authors divided the context into segments and used segment-level recurrence and a corresponding positional encoding to allow it to handle longer contexts. It achieved impressive results on the perplexity and BPC at that time. Linformer [8] accomplished O(n) complexity through linear self-attention. The authors demonstrate that the attention is typically low rank and thus can be approximated by a low-rank matrix. Here, from the original Q ,   K , and V matrices   R n × d , K , and V are projected to lower dimension matrices where K , V     R k × d where k < n. Thus, attention A = Q K T     R n × k . The output ( A × V )     R n × d , i.e., same as the original transformer. Since k is fixed, the attention complexity is O(n).
Although Linformer [8] reduced the attention complexity significantly, especially if k < < n , note that it cannot be effectively used in autoregressive training and generation, as the projection of Q compresses the information along the context, making the masking of attention for future tokens invalid. However, for classification problems where masking of attention is not needed, their architecture is effective in reducing complexity.
Another approach introduced by Longformer [9] used sparse attention patterns instead of the full dense attention. The authors proposed sliding window attention, where tokens attended only to the nearby past, a dilated sliding window, and a mix of global and sliding window attention where some tokens attend to all tokens while others only attend to nearby tokens. For autoregressive modeling Longformer [9] used dilated sliding window attention. Another notable work in reducing the attention complexity was performed by Reformer [10]. The authors’ key idea was to use locality-sensitive hashing, which reduces the attention complexity to O ( n   l o g   n ) . Note that because of the hashing process, the architecture is not suited for autoregressive modeling.
A different approach to reduce the attention complexity was taken by Performer [11], where the attention is decomposed as a product of non-linear functions of original query and key matrices referred to as random features. This allows the attention to be encoded more efficiently via the transformer query and key matrices. Further efficient handling of long contexts accomplished by Perceiver AR [12] divided the input sequence into smaller key/value and query components. These components underwent cross attention in the first layer with a latent   R l × d , where l is the size chosen in splitting the input sequence into the query part. The remaining layers operate on the l × d size instead of the usual n × d size as in a standard transformer. Although this cross attention on the partitioned input sequence results in efficient handling of long sequences, because of the reduced query size, the equivalent effect is more like a sliding window attention.
More recently, a different approach to handling long contexts was proposed via structure state space models. Structured State space sequence model (SSMs) [19] proposed an architecture that was based on a new parameterization that can be computed much more efficiently. A variation of the state space approach proposed by Mega [20] uses a single-head gated attention mechanism equipped with exponential moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. They also present its variation with linear time complexity for handling long sequences. Further progression on the state space models yielded better results as demonstrated in Hungry hungry hippos [21] and Mamba [22], who achieved a very low perplexity score. Most recently xLSTM [23] introduced exponential gating and parallelization in LSTMs to achieve extended memory. Some of the model sizes consisted of several billion parameters. We outperform the smaller version of these models with a similar size to ours on the perplexity metric, as shown in Table 1 and Table 2.
An interesting concept in handling long sequences was presented by Transformer LS [15]. Here a sliding window approach is used in handling near-term attention, while a set of compressed segments for the entire past context is used as long-term attention. Both short and long attention are combined in the overall attention. The slight drawback of the approach is that the longer context is effectively used in compressed form and thus may lose some key contextual information in being able to generate the output in an autoregressive environment.
The challenges in long-range sequence modeling are primarily related to the computational cost of self-attention, the difficulty of learning and maintaining long-range dependencies, and the challenge of evaluating long-range understanding. This makes the learning diluted or less precise as the model is unable to capture the global context that limits the model’s ability to handle longer sequences. The current approaches still have limitations when it comes to handling global context, as explained above.
We address this problem by further augmenting the long–short attention by using uncompressed highly attentive segments. Since long–short attention divides the context into equal size segments before projecting each segment to a smaller size, there is potential for a loss of information due to segment fragmentation. We also improve this aspect by using overlapping segments and augment this to the existing long–short model. Thus, our enhanced long–short architecture involves four components in the overall attention, a sliding window attention, long attention based on compressed segments, long attention based on overlapping segments, and uncompressed segmented attention for few high-attentive segments beyond the sliding window part. We describe the details of our design in Section 3. For completeness, we summarize the composition of a transformer, followed by the ideas of a long–short transformer, that we build upon in our work.
A recent architecture, termed the bi-directional transformer architecture (BiXT) [24], uses cross attention between different segments of the long sequence, in both directions. This enables a more comprehensive understanding of the entire sequence to capture long-range dependencies. However, this cross-attention mechanism is complex, which adds to the computational overhead and still contributes to some information loss at the boundaries between segments.
LongVQ [25] presents the potential to improve computational efficiency and long-range dependency handling. It uses vector quantization (VQ) that inherently results in some information loss during the quantization process. LongVQ also uses segmentation and therefore results in fragmentation that could affect performance on long-range tasks requiring fine-grained details.
In another recent development (attention tensorization [26]), the key innovation is that the input is a higher-dimensional tensor transformation of a simple sequence, enabling the capture of more complex relationships and dependencies among the sequences. However, working with high-dimensional tensors is computationally expensive, and the model’s increased complexity could lead to overfitting.
Capturing long context using innovative attention methods is a very active area of research in language modeling. In the latest research, Deepseek [27] demonstrates remarkable similarity to our work, where they combine three attention mechanisms to enhance their long-range performance and implement the ‘top-k’ technique to discard the less relevant segments. Moreover our ‘top-k’ retrieval implementation is superior to that of the Deepseek model as we retrieve the most similar segments in uncompressed form.

3. Canonical Transformer

In normal multi-headed attention, if Q ,   K ,   a n d   V   R n × d are the query, key, and value transformations of the input embeddings with a sequence length of n and embedding dimension of d, then the scaled dot-product attention in the i -th Head H i     R n × d k is given as follows:
H i = A t t e n t i o n ( Q W i Q ,   K W i K ,   V W i V ) = S o f t m a x Q W i Q ( K W i K ) T d k   V W i V = A i V W i V
where d k = d / h is the dimension of each head. The output in each transformer layer is obtained by catenation of the output of all heads and transformed further via this projection matrix:
W o R d × d   a s   L a y e r j = C o n c a t   ( H 0 , H 1 ,     H h 1 ) W o
After feeding the embedding of a sequence of one hot encoded word, x (with position encoding PE added) through p transformer layers, a classification layer is used at the output of the last layer to decide the output produced by the transformer. For autoregressive text generation, the classification layer’s final output is equal to the size of the dictionary of unique words in the corpus:
o u t = c l a s s i f i e r [ l a y e r p 1 ( l a y e r p 2 ( l a y e r 0 ( e m b e d d i n g x + P E ( x ) ) ) ) ]

4. Long–Short Transformer

Transformer LS [15] aggregated the local attention around a smaller window (sliding window), with a projection of the full sequence attention to a smaller size, so that we can efficiently handle long sequences without the quadratic attention complexity. For short attention, the approach here is to use a segment-level sliding window attention, where the input sequence is divided into disjoint segments with length w (e.g., w = 128, and sequence length is 1024). For non-autoregressive applications, all tokens within a segment attend to all tokens within its home segment, as well as w/2 consecutive tokens on the left and right side of its home segment (zero-padding when necessary), resulting in an attention span over a total of 2w key-value pairs. This is depicted in Figure 1.
For each query Q t at the position t within the i t h head, the 2w key-value pairs within its window are K t ~ ,   V t ~ R 2 w × d . The short attention A ¯ s i R 2 w × d k is then given by the following equation:
A ¯ s i = s o f t m a x Q W i Q K ~ i T d k  
Execution-wise, the segment-level sliding window attention (referred to as short attention) is more time efficient than the per-token sliding window attention where each token attends to itself and w tokens to its left and right, and its memory consumption scales linearly with sequence length. For auto-regressive applications, the future tokens in the current segment are masked, and only the previous segment is used.
The compression is performed on the feature dimension initially through a projection with dimensions, p → ( d k   ×   r ), where d k is the embedding dimensionality and r is the target length. The dynamic projection matrix ‘ P i ’ is computed by multiplying p with n length key ( K ) → ( n × d k ) where ( r n ) ,   i.e., ( K × p ) . This product results to ‘ P i ’ with dimensions ( n × r ) . The transpose of this projection matrix P i T is applied to the key vector → ( p T   ×   K ) . This product results in a modified key ( K ) ¯ , with dimensionality → ( r × d k ) thereby compressing its sequence length. Similar compression is performed for the value vector as well. This is a standard dimensionality-reduction technique illustrated in Figure 2 and is used in popular models like Performer and Transformer LS.
For long attention, the key and value transformations for the input sequence are first divided into segments of fixed size s and then projected to a smaller dimension r, where the projection P l i   R n × r .
Mathematically, the long attention A l i ¯ (in each head i ) as followed by the long–short transformer can be described as follows:
P l i = S o f t m a x K W i P ,   K ¯ l i = P l i T K W i K ,   V ¯ l i = P l i T V W i V
A ¯ l i = s o f t m a x Q W i Q K ¯ l i T d k
The output of in the i t h head is the following:
H ¯ i = A ¯ l i P l i T V W i V
Note that the long attention is effectively performed on a compressed form of K and V, as the projection causes the input sequence of size n to be compressed to size r. This results in full attention now being replaced with the implicit product of two low-rank matrices P l i T ¯ R r × n and Q W i Q R n × d , and thus the computational complexity of long attention is reduced from O n 2 to O ( r n ) .
The long–short transformer [15] integrates the short and long attentions into a single attention. While the short attention can attend to the most recent input, the long attention is in compressed form. Further, the long attention is based on segmentation of the input sequence that may suffer from segment fragmentation as the information in each segment is compressed via the projection mechanism.

5. CacheFormer: Enhanced Long Attention Transformer

The long-term attention in the existing long–short transformer is performed at a compressed level (projection to r causes an effective compression of the input context). Therefore, one of our enhancements is to augment the long attention with an attention that is based on a subset of highly attentive uncompressed segments.

5.1. CacheFormer Long Attention with Segment Caching

The subset of segments that are selected for attention at the uncompressed level is completely dynamic and obtained by the vector magnitude of the compressed segment-wise attention. In simple words, we examine the segment-wise long attention A ¯ l i   as given by Equation (6). Since A ¯ l i R n × r , and if there are n s segments, then each row in A ¯ l i contains a set of row vectors of size r / n s , as denoted by segmented attention A s e g i ¯ in Equation (8). Here rows represent the original query, and columns represent the target compressed length keys. The resulting attention learns the similarity between the two.
The magnitude of each vector a i , j R 1 × r / n s in Equation (8) indicates the attention of word i to the j t h compressed segment in the long attention. This phenomenon is also explained in Appendix A.1.
A s e g i ¯ = a 1,1 a 1,2 . . . . a 1 , n s . . . . . . . . . . . . a n , 1 . . a n , 2 . . . . . . . . . . a n , n s
A s e g a v g i = t o p k ( ( t = 1 p A s i ¯ t , : ) / p )   t o p k ( ( t = p + 1 2 p A s i ¯ t , : ) / p ) . . . . t o p k ( ( t = n p n A s i ¯ t , : ) / p )
For execution efficiency, we average the segment attention vectors A s e g i ¯ in p consecutive rows starting from [ a 1,1   a p , 1 ] , [ a p + 1,1   a 2 p , 1 ] , and the sequence continues until [ a n p + 1,1   a n , 1 ] is reached. This results in a segment attention matrix A s e g a v g i   R m × n s , where m = n/p. Then we choose top-k segments by magnitude of each vector in each row of the segment attention matrix A s e g a v g i . Note that each entry in the segment attention matrix, A e g s a v g i [ i , j ] , indicates the segment number that has high attention to the sequence of p words positioned from [ ( ( i 1 ) × p ) to ( i × p )] in the input context, as shown in Equation (9). Rather than using these attentive segments in compressed form, we extract them from the segmented K and V matrices before performing any compression on them. The example in Appendix A.2 can be accessed for further explanation.
As in cache memory design (in computer architecture), in case of a cache miss, we not only retrieve the needed data from the RAM but also bring a few consecutive following words, as there is high probability that these may be needed in the near future. In the case of segments that we determine most attentive (by the top-k order), we also retrieve u consecutive segments.
To clarify our approach, if the sequence length is n = 1024 and the long attention segment size = 16, then there will be 64 segments in the uncompressed K and V matrices. If the projection size r = 256 (ratio of 1024/256 = 4), then each segment of size 16 will be compressed to a size of 4, resulting in a long attention matrix A l i ¯ of size 1024 × (64 × 4), i.e., 1024 × 256. If we choose to average p = 32 consecutive rows in A l i ¯ , and take the magnitude of each of the 1 × 4 vectors in each row (corresponding to the 64 segments), then the segment attention matrix A s e g a v g i will be 32 × 64. Taking the index of top k entries in each row of A s e g a v g i will give us the index of the most attentive k segments to the corresponding set of 32 words in the input sequence. Assembling these top-k attentive segments and one segment before and one segment after the attentive segment (if u = 3) will result in 15 segments per row. If k = 5 is chosen in top-k and u = 3 which indicates using (u    1 ) of many nearby segments for each attentive segment. Thus, the cache K and V matrices K c ,   V c   R ( n / p ) × ( k × u ) (e.g., 32 × (15 × 16) = 32 × 240 in this case) contain the most-attentive 15 segments in uncompressed form. Note that we stack the K c p’ times to match the dimensionality with Q. From the most attentive k × u segments in K c , we can obtain the cache attention A ¯ c i     R n × k × u   as follows:
A ¯ c i = s o f t m a x ( Q W i Q K c K c . . . . K c T ) / d k
Further pictorial representation is available in Appendix A.3.

5.2. CacheFormer Long Attention with Overlapping Segments

In addition to the original long attention in the long–short transformer that uses the projections on each segment, we augment the existing long attention by using overlapping segments (with 50% overlap in augmented long attention), as shown in Figure 3. The motivation behind the overlap is to provide context continuity and reduce the effect of segment fragmentation that occurs in long attention. Zero-padding in the beginning segment is added to ensure the same dimensionality for the overlapped long segment attention. Here all the keys ( K ) are transformed through a projection matrix ( P o i T ) that consists of the second half of the previous segment and the first half of the following segment. The overlapped long segment attention A ¯ o i   R n × r similar to Equation (5) is given below and is further explained in Appendix A.4:
P o i = S o f t m a x K W i P o ,   K ¯ o i = P o i T K W i K ,   V ¯ o i = P o i T V W i V
A ¯ o i = S o f t m a x Q W i Q K ¯ o i T d k
H ¯ o i = A ¯ o i   P o i T V W i V

5.3. Aggregated Attention in CacheFormer

The final attention in our enhanced architecture is obtained by aggregating the four attentions discussed earlier:
(1)
The segment-based compressed long attention, A ¯ l i   R n × r   as proposed in Transformer LS;
(2)
The short attention, A ¯ s i   R n × 2 w   that uses segment-wise sliding window in Transformer LS;
(3)
Our cache attention based on dynamic retrieving of uncompressed high-attention segments, A ¯ c i   R n × ( k × u × s ) ;
(4)
Our overlapping segment-based compressed attention, A ¯ o i   R n × r .
We add the two similar-sized long and overlapping attentions, A ¯ l i and A ¯ o i , and ǁ indicates the catenation of different-sized attentions, A ¯ s i and A ¯ c i . Thus, the final enhanced attention A e i   R n × f   where f = 2 w + r + k × u × s is expressed as follows:
A e i = [ A ¯ s i ǁ ( A ¯ l i + A ¯ o i ) ǁ A ¯ c i ]
  • w is the window size in short, i.e., sliding window attention;
  • r is the compressed projection target size in the long attention;
  • t o p k factor is for retrieving top-k attentive segments;
  • u − 1 is the number of neighboring segments to retrieve for cache attention;
  • s is the segment size in long attention.
For example, in t o p   k = 5, u = 3; segment size in short attention is w = 128; segment size in long attention is s = 16; and compression target length of r = 256. Hence, for an input sequence length of 2048, the size of our combined attention matrix is 2048 × 752. The time complexity of the different components in CacheFormer’s attention is as follows:
  • For the short attention A ¯ s i O ( w × n ) , where w is the sliding window size;
  • For both long and overlapping long attentions, i.e., A ¯ l i ,   A ¯ o i , → O ( p × m × n ) , where p is the compressed output size from each of the m long segments;
  • For cached attention A ¯ c i O ( k × u × s × n ) , where ( k × u ) is the number of the top attentive segments, and s is the long attention segment size.
Since the dominant term in the above four components is the long attention, the overall time complexity of our enhanced attention is O ( p × m × n ) . Effectively, this is very close to the sliding window attention. To further elaborate on our attention computation in Equation (14), note that the dimensionality of short sliding attention A ¯ s i in the LS Transformer is n × 2 w , and its compressed long attention’s, A ¯ l i dimensionality is ( n × r ) . During our caching mechanism, we augment attentions A ¯ o i and A ¯ c i with dimensionalities ( n × r ) and n × k × u × s , respectively, to the long–short attention. Since A ¯ l i and A ¯ o i deal with sequence lengths compressed to similar dimensions, they have similar shapes. Therefore, we can sum up the two attention matrices along the similar dimensions to conserve size and overall attention complexity, whereas our caching attention A ¯ c i and A ¯ s i have different shapes; hence, they cannot be summed up. and concatenation is the only choice. One can refer to Figure A5 and Figure A6 in Appendix A.5 for pictorial representation.

6. Results

Perplexity is a key metric in natural language processing (NLP) that measures how well a model predicts text. It is calculated as the exponential of the average negative log-likelihood per token. Lower perplexity indicates better predictions. Instead of focusing on the absolute best results for perplexity and BPC, which often are achieved through extremely refined training schedules and large model sizes, we focus on the improvements over the baseline, i.e., Transformer LS. Therefore, the results we show are more accurate reflection of the architectural improvements of our design. The baseline architecture is also programmed by us, and the enhancements we propose are programmed in the same implementation and can be selectively turned on or off to see the contribution of each enhancement. We also use similar training schedules for the different architectures being compared. Table 1 shows the perplexity results performed on a WikiText-103 dataset. It uses a sequence length of 1024, short attention segment size of 128, long attention segment size of 16, compression of the long sequence by a factor of 4, i.e., r = 256, and different values of k in top k cache attention, and neighboring segments’ retrieval u of 1 or 3, which indicates that the segment before the attentive segment and the one after are retrieved.
Table 1. CacheFormer outperforms Transformer LS across all configurations.
Table 1. CacheFormer outperforms Transformer LS across all configurations.
ModelModel SizePerplexity
Long-Short Baseline122.52 million23.74
CacheFormer (k = 3, u = 1)122.52 million23.31
CacheFormer (k = 5, u = 1)122.52 million22.75
CacheFormer (k = 7, u = 1)122.52 million21.32
CacheFormer (k = 5, u = 3)122.52 million21.26
Note that our enhanced architecture does not cause any increase in the number of model parameters over the baseline long–short transformer. The models used for results in Table 1 have 12 layers, 12 heads, and an embedding size of 768 (for all architectural variations). For a sequence length of 1024 (which is same as used in GPT-2), using seven segments (k = 7, u = 1) yielded considerable improvement in perplexity. Increasing k beyond 7 did not seem to considerably reduce perplexity further.
Since we have two major enhancements of cache attention and overlapping segment-based attention over the baseline, Table 3 shows an ablation study of the effects of each architectural improvement. An ablation study uses controlled experiments to systematically remove or modify components of a proposed model, thereby isolating and quantifying their individual contributions to overall performance.
Figure 4 depicts the 64 attention vectors for each segment (from compressed long attention after averaging p = 256 rows) corresponding to the 64 segments during the beginning of training. The highest top-k magnitude vectors then determine the segment to use in uncompressed form for our cache attention. The darker red color depicts higher magnitude attention vectors, while the blue color indicates lower magnitude vectors.
Table 4 shows the BPC results on the enwik-8 dataset, a benchmark utilized for character-level language modeling, where the models predict the next character in a sequence rather than the next word. Bits per character (BPC) quantifies a language model’s predictive accuracy by measuring the average number of bits needed to encode each character in a text sequence, with lower values indicating more precise predictions. In Table 4, the 23 million model uses eight layers, eight heads, and an embedding size of 512. The 34.88 million models used 12 layers.
It is interesting to note that the relative improvement in BPC by our enhanced architecture is less pronounced as compared to the perplexity improvements. This could be attributed to the fact that majority of improvements are attributed to cache attention, which uses a few highly attentive uncompressed segments in long attention. This benefits the perplexity, which is a measure of the model’s prediction capability, but BPC not as much, as BPC is more of a compression-efficiency measure of the model.
Table 2. CacheFormer outperforms several modern language models with comparable model size on perplexity.
Table 2. CacheFormer outperforms several modern language models with comparable model size on perplexity.
ArchitectureModel Size (Millions)Perplexity
Long–Short (Baseline)122.5223.74
Transformer-XL (Standard)15124
∞-former16024.22
LaMemo15123.77
H3 (Hungry Hungry Hippos)12523.7
Llama12523.16
Mamba12522.49
xLSTM [7:1]12521.47
CacheFormer with Overlapping Segments and Enhanced Caching (k = 7, u = 1)122.52 21.32
Our models for CacheFormer were implemented on a computer using NVIDIA RTX 4090 GPU. The 122.52 million parameter models used an embedding dimension of 768 with 12 layers and 12 attention heads in each layer. These model parameters were chosen so that we can have a fair comparison with equivalent-size models in the reported literature. Increasing the embedding dimensions and/or the number of layers results in a bigger model size with better language modeling capabilities if trained on enough training data. Adam optimizer was used with an initial learning rate of 1 × 10−4. We used a batch size of eight and trained our models for 500,000 iterations. Our GitHub repository provides all the necessary codes to reproduce the results reported in the paper.
We use the perplexity metric on the popular WikiText-103 dataset that is designed to train and assess large language models in tasks that require capturing long-term dependencies. It consists of over 100 million tokens extracted from verified “good” and “featured” articles on Wikipedia.
Table 3. Ablation study of CacheFormer’s architectural enhancements.
Table 3. Ablation study of CacheFormer’s architectural enhancements.
Long–Short baselineCacheFormer with overlapping segments onlyCacheFormer with cache attention only (k = 7, u = 1)CacheFormer with overlapping segments and cache attention (k = 7, u = 1)
Model Size 122.52 million122.52 million122.52 million122.52 million
Perplexity 23.7423.4721.6721.32
Table 4. Comparison of BPC on the enwik-8 Benchmark.
Table 4. Comparison of BPC on the enwik-8 Benchmark.
ModelModel SizeBPC
Long-Short Baseline23 million1.192
CacheFormer (k = 7, u = 1)23 million1.188
Long-Short Baseline34.88 million1.173
CacheFormer (k = 7, u = 1)34.88 million1.167

7. Discussion

Since the uncompressed segments to be used in our cache attention design are dynamically decided based on the input sequence, the execution time increases as more segments (i.e., higher k) are used. When we use a sequence length of 1024, compression r = 256, k = 7, u = 1, and a short attention segment size of 128, then the size of aggregated attention (short, long, cache, overlapping) is 1024 × 624. Since our cache attention mechanism, is completely dynamic and uses the most attentive segments in uncompressed form, we average the attention vectors over p rows (to improve efficiency of execution) as given by Equation (9).
If we use a sequence length of 1024 and an average over 256 rows, then the segments determined by our cache attention mechanism part way through the training of the model appear as shown in Table 5. Note that to implement the autoregressive behavior, the input sequence cannot attend to a future segment. Our implementation guarantees that the input sequence can only attend to a previous segment. For example, when attending to words 768–1023 in the input sequence, the maximum segment that the cache attention can use is 47 (if the long segment size is 16, then there are 64 segments in the 1024 size sequence).
“Lost in the middle” [28], one of the important recent papers in handling long contexts, has indicated that current language models do not robustly make use of information in long input contexts. They studied different models and concluded that “performance is often highest when relevant information occurs at the beginning or at the end of the input context and significantly degrades when models must access relevant information in the middle of long contexts”.
Note that our cache attention model addresses this aspect nicely in the sense that it uses attentive segments dynamically regardless of whether they are needed at the beginning or the middle of input context. For example, the last row in Table 5 indicates the highest attentive segments that are used. Segments 32, 35, and 37 are relatively in the middle of the input context. When we determine the most attentive segment to use in our cache attention, if the neighboring segment parameter count u > 1, then as we look at the segment index of the next or previous index, a duplicate may occur as the next segment may already be one of the highly attentive segments. Similarly, if the highly attentive segments belong to a future segment, we replace them with one of the allowed segments. Since information segmentation should not occur, the segment we select to be added is the one that is contiguous to an existing high-attention segment.

8. Conclusions

Handling long contexts in an efficient manner without loss of performance is an important area of research in language models. Although many approaches have been recently proposed to address this problem, we present a new innovative solution that is motivated by the cache and virtual memory concepts in computer architecture. In such designs, if there is a cache or page miss, the needed data are retrieved from the disk or RAM. We handle long contexts by diving them into small segments. From the magnitude of the compressed attention vectors, we determine the most attentive segments and then use these in uncompressed form.
Like the cache memory design, we also use consecutive segments near the high-attention segments to improve the language model’s predictive performance. Our results concerning perplexity indicate significant improvement over the baseline architecture that uses short and long compressed attention.
For the BPC, the cache attention mechanism does not show remarkable improvement on the baseline. We conjecture that the BPC that favors compression capability does not benefit from the relevant segment usage that our model provides, which is helpful in model prediction capability. Another advantage of our approach is that the use of high-attention segments is dynamic and depends on the input sequence. Thus, if the model needs to use information in the middle or anywhere in the input context, it is provided in uncompressed form via the high-attention determination of the compressed segments.
As demonstrated in Table 2, CacheFormer outperforms several similar-sized SOTA language models by 8.5% on average. In our future work, we plan to work on enhancing the efficiency of our implementation that aggregates the four different attention mechanisms. This will enable us to train on larger models and work on diverse task-specific applications of our design.

9. Limitations

The only shortcoming of our approach we feel is that the dynamic segment attention is relatively slow during training. We partially overcome this by initially pretraining the model without dynamic attention and then fine tune it on our dynamic cached attention. Our future work involves applying the cache attention to reduce the model complexity of large language models. Further, we are in the process of creating a hierarchical cache design so that very long contexts can be efficiently handled.
Further, our model sizes and datasets were constrained by computational resources available to us. We used GPU RTX 4090 and therefore could not use larger datasets, such as PG-19, and run larger models with larger embedding size, layers, and heads.

Author Contributions

Conceptualization, S.S. and A.M.; methodology, S.S.; software, S.S.; validation, S.S. and A.M.; formal analysis, S.S. and A.M.; investigation, S.S. and A.M.; resources, S.S.; data curation, S.S.; writing—original draft preparation, S.S. and A.M.; writing—review and editing, S.S. and A.M.; visualization, S.S.; supervision, A.M.; project administration, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

The authors received no financial support for this research.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code is accessible via our GitHub repository: https://github.com/sushantsing/CacheFormer (accessed on 12 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Further Details on CacheFormer

In our caching protocol, we compress and dynamically retrieve the most relevant compressed segments for any given input. Based on the design constraints, an appropriate amount of input sequence compression is performed. Thereafter, the sequence is split into the desired segments, and we choose the most similar segments for each query and retrieve them in the original uncompressed form. It ensures that only the most relevant information is being picked. This not only helps in reducing the context size, but it also enables the preservation of key information. This enhanced caching attention technique is explained in greater detail in the subsequent sections.

Appendix A.1. Enhanced CacheFormer Attention

Consider the length of the input sequence to be 1024 tokens that need to be compressed and down-projected to 256 tokens. Here we choose to divide the row into ( n s ) 64 segments. This will yield a compression ratio ( r / n s ) of 4. The attention matrix will be of the size A s e g i ¯ R n × r . Therefore, for n s segments, each row in A s e g i ¯ will consist of row vectors with size r / n s .
Further, the magnitude of the vector a i , j R 1 × r / n s will represent the attention of the i t h word token to the j t h compressed segment in the long attention, as shown in Figure A1. Thereafter, we compute the root mean square for each of the ( 1 × 4 ) sized attention vectors a i , j ; hence, the dimension across each row is downsized from 256 to 64. We use this size for the subsequent attention-processing steps, as demonstrated in the following section.
Figure A1. Downsized compression of attention matrix along K c , V c .
Figure A1. Downsized compression of attention matrix along K c , V c .
Ai 06 00085 g0a1

Appendix A.2. Averaging in CacheFormer’s Segment Caching

Attention computation and top-k segment retrieval across all 1024 rows turned out to be computationally cumbersome and time-intensive. Therefore, to achieve execution efficiency, we averaged all 1024 input vectors across p consecutive rows for the previous attention matrix A s e g i ¯ R n × r , where p is a hyperparameter.
This segment attention matrix is further reshaped and compressed into A s e g a v g i R m × n s , where m = n/p = 32, as shown in Figure A2. This implementation was key for our model to achieve superior results outperforming other popular language models of similar size, as mentioned in Table 2, and resulted in a faster run time as well.
Figure A2. Averaged compression of attention matrix along the input length.
Figure A2. Averaged compression of attention matrix along the input length.
Ai 06 00085 g0a2

Appendix A.3. Top-k Retrieval in CacheFormer’s Segment Caching

Post the compression and averaging, the top k most similar segments were chosen to be retrieved according to the order of the attention magnitude between the modified input and key/value matrices. These segments were picked corresponding to each row m , which is an averaged input sequence of 32 consecutive words (averaged down from 1024) from the segment attention matrix A s e g a v g i . The hyperparameter k is chosen based on the performance needs and based on that value along with the k t h segment, we also extract one segment before and after the k t h attentive segment. Therefore, we define u as the hyperparameter that regulates the number of adjacent segments around k that need to be retrieved from the sequence. For instance, with k = 5 and u = 3 , this will result in a total of 15 uncompressed extracted segments of length 16 from each row, as shown in Figure A3.
Figure A3. Enhanced attention matrix after top-k retrieval.
Figure A3. Enhanced attention matrix after top-k retrieval.
Ai 06 00085 g0a3

Appendix A.4. Overlapping Segments in CacheFormer’s Long Attention

As discussed earlier, the segmentation of input into chunks leads to the fragmentation of long-term information. This becomes a challenge in building long-term dependency. This issue has not been addressed in prior transformer-based language models. Therefore, we augment the long attention with segments with a 50% overlap to maintain the continuity of data, as shown in Figure A4. The model is trained with the overlapping data as the query that needs to learn the original chunks as key and values.
Figure A4. Long attention with overlapping segments.
Figure A4. Long attention with overlapping segments.
Ai 06 00085 g0a4

Appendix A.5. Aggregated CacheFormer’s Enhanced Attention

Thereafter, we add the overlapping attention A ¯ o i to the long cache attention A ¯ l i , which have similar shapes. The sliding window (short) attention A ¯ s i and our caching attention A ¯ c i are concatinated to the above summed attention, as pictorially demonstrated in Figure A5.
Here ǁ indicates the catenation of different attentions; w is the window size in short, i.e., sliding window attention; r is the projection size in compressing the long attention; k is the t o p k factor in retrieving high-attention top-k segments, s is the segment size in long attention, and u determines the number of segments to be retrieved adjacent to the top k t h one.
Figure A5. Complexity of CacheFormer’s enhanced attention.
Figure A5. Complexity of CacheFormer’s enhanced attention.
Ai 06 00085 g0a5
Finally, the figure below illustrates the four attention mechanisms that are simultaneously aggregated and successfully inducted in our model architecture.
Figure A6. CacheFormer’s aggregated enhanced attention.
Figure A6. CacheFormer’s aggregated enhanced attention.
Ai 06 00085 g0a6

References

  1. Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the NIPS 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  2. Singh, S.; Mahmood, A. The NLP cookbook: Modern recipes for transformer based deep learning architectures. IEEE Access 2021, 9, 68675–68702. [Google Scholar] [CrossRef]
  3. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
  4. Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
  5. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
  6. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
  7. Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.; Salakhutdinov, R. Le and Ruslan Salakhutdinov. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the ACL 2019, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
  8. Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-Attention with Linear Complexity. In Proceedings of the NIPS 2020, Virtual, 6–12 December 2020. [Google Scholar]
  9. Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. In Proceedings of the EMNLP, Online, 16–20 November 2020. [Google Scholar]
  10. Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer: The Efficient Transformer. In Proceedings of the ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  11. Choromanski, K.M.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. In Proceedings of the ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  12. Hawthorne, C.; Jaegle, A.; Cangea, C.; Borgeaud, S.; Nash, C.; Malinowski, M.; Dieleman, S.; Vinyals, O.; Botvinick, M.; Simon, I.; et al. General-purpose, long-context autoregressive modeling with Perceiver AR. In Proceedings of the ICML 2022, Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
  13. Ji, H.; Zhang, R.; Yang, Z.; Hu, Z.; Huang, M. LaMemo: Language Modeling with Look-Ahead Memory. In Proceedings of the NAACL 2022, Seattle, WA, USA, 10 July 2022. [Google Scholar]
  14. Martins, P.H.; Marinho, Z.; Martins, A. ∞-former: Infinite Memory Transformer. In Proceedings of the ACL 2022, Dublin, Ireland, 22–27 May 2022. [Google Scholar]
  15. Zhu, C.; Ping, W.; Xiao, C.; Shoeybi, M.; Goldstein, T.; Anandkumar, A.; Catanzaro, B. Long-short transformer: Efficient transformers for language and vision. In Proceedings of the NIPS 2021, Virtual, 6–14 December 2021. [Google Scholar]
  16. Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating Long Sequences with Sparse Transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar]
  17. Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient content-based sparse attention with routing transformers. In Proceedings of the ACL 2021, Bangkok, Thailand, 1–6 August 2021. [Google Scholar]
  18. Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the ICML. PMLR 2020, Virtual, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
  19. Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. In Proceedings of the ICLR 2022, Virtual, 25–29 April 2022. [Google Scholar]
  20. Ma, X.; Zhou, C.; Kong, X.; He, J.; Gui, L.; Neubig, G.; May, J.; Zettlemoyer, L. Mega: Moving average equipped gated attention. In Proceedings of the ICLR 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  21. Fu, D.Y.; Dao, T.; Saab, K.K.; Thomas, A.W.; Rudra, A.; Ré, C. Hungry hungry hippos: Towards language modeling with state space models. In Proceedings of the ACL 2023, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
  22. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the ICML 2024, Vienna, Austria, 21 July 2024. [Google Scholar]
  23. Beck, M.; Pöppel, K.; Spanring, M.; Auer, A.; Prudnikova, O.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. xLSTM: Extended Long Short-Term Memory. In Proceedings of the NIPS 2024, New Orleans, LA, USA, 10–16 December 2024. [Google Scholar]
  24. Zhang, H.; Liu, J.; Zhou, S.; Wang, X.; Lyu, M.R. Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers. In Proceedings of the NIPS 2024, New Orleans, LA, USA, 10–16 December 2024. [Google Scholar]
  25. Liu, Z.; Wang, L.; Li, S.; Wang, Z.; Lin, H.; Li, S.Z. LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory. In Proceedings of the IJCAI 2024, Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar]
  26. Feng, A.; Ying, R.; Tassiulas, L. Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning. In Proceedings of the EMNLP 2024, Miami, FL, USA, 12–16 November 2024. [Google Scholar]
  27. Yuan, J.; Gao, H.; Dai, D.; Luo, J.; Zhao, L.; Zhang, Z.; Xie, Z.; Wei, Y.X.; Wang, L.; Xiao, Z.; et al. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. arXiv 2025, arXiv:2502.11089. [Google Scholar]
  28. Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the middle: How language models use long contexts. In Proceedings of the ACL 2024, Bangkok, Thailand, 11–16 August 2024. [Google Scholar]
Figure 1. Segment-based sliding window attention.
Figure 1. Segment-based sliding window attention.
Ai 06 00085 g001
Figure 2. Segmented long attention with compressed segments.
Figure 2. Segmented long attention with compressed segments.
Ai 06 00085 g002
Figure 3. Long attention with overlapping compressed segments.
Figure 3. Long attention with overlapping compressed segments.
Ai 06 00085 g003
Figure 4. Attention vectors from CacheFormer’s compressed long attention.
Figure 4. Attention vectors from CacheFormer’s compressed long attention.
Ai 06 00085 g004
Table 5. Most attentive segments used by CacheFormer’s attention part way in training.
Table 5. Most attentive segments used by CacheFormer’s attention part way in training.
Input SequenceTop-k Attentive Segments
(k = 7, u = 1)
Comments
0–255 words[−1, −1, −1, −1, −1, −1, −1]No cache segments are used to prevent future token leakage
256–511 words[7, 8, 11, 12, 13, 14, 15]Maximum segment allowed = 15
512–767 words[7, 8, 27, 28, 29, 30, 31]Maximum segment allowed = 31
768–1023 words[8, 29, 32, 35, 37, 44, 47]Maximum segment allowed = 47
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Singh, S.; Mahmood, A. CacheFormer: High-Attention-Based Segment Caching. AI 2025, 6, 85. https://doi.org/10.3390/ai6040085

AMA Style

Singh S, Mahmood A. CacheFormer: High-Attention-Based Segment Caching. AI. 2025; 6(4):85. https://doi.org/10.3390/ai6040085

Chicago/Turabian Style

Singh, Sushant, and Ausif Mahmood. 2025. "CacheFormer: High-Attention-Based Segment Caching" AI 6, no. 4: 85. https://doi.org/10.3390/ai6040085

APA Style

Singh, S., & Mahmood, A. (2025). CacheFormer: High-Attention-Based Segment Caching. AI, 6(4), 85. https://doi.org/10.3390/ai6040085

Article Metrics

Back to TopTop