A Simplified Query-Only Attention for Encoder-Based Transformer Models

Department of Electronics Engineering, Chosun University, 309 Pilmundae-ro, Dong-gu, Gwangju 61452, Republic of Korea
Interdisciplinary Program in IT-Bio Convergence System, Chosun University, Gwangju 61452, Republic of Korea
Centre for Human Brain Health, School of Psychology, University of Birmingham, Birmingham B15 2TT, UK
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(19), 8646;
Submission received: 23 July 2024 / Revised: 28 August 2024 / Accepted: 20 September 2024 / Published: 25 September 2024


Transformer models have revolutionized fields like Natural Language Processing (NLP) by enabling machines to accurately understand and generate human language. However, these models’ inherent complexity and limited interpretability pose barriers to their broader adoption. To address these challenges, we propose a simplified query-only attention mechanism specifically for encoder-based transformer models to reduce complexity and improve interpretability. Unlike conventional attention mechanisms, which rely on query (Q), key (K), and value (V) vectors, our method uses only the Q vector for attention calculation. This approach reduces computational complexity while maintaining the model’s ability to capture essential relationships, enhancing interpretability. We evaluated the proposed query-only attention on an EEG conformer model, a state-of-the-art architecture for EEG signal classification. We demonstrated that it performs comparably to the original QKV attention mechanism, while simplifying the model’s architecture. Our findings suggest that query-only attention offers a promising direction for the development of more efficient and interpretable transformer-based models, with potential applications across various domains beyond NLP.

1. Introduction

Transformers have revolutionized Natural Language Processing (NLP), enabling machines to accurately understand and generate human language [1]. The transformer-based models, like the generative pre-trained transformer (GPT) [2,3,4,5] or Gemini [6,7], have empowered users worldwide to interact with artificial intelligence (AI) and conveniently access information. A recent multi-modal language model can recognize not only text but also images or videos [8]. Furthermore, the state-of-the-art model allows us to communicate in real-time with voice instead of text [9]. These transformer-based models leverage self-attention mechanisms to effectively capture the interactions between words in a sentence, moving beyond traditional sequential methods. The attention mechanism allows the transformer to understand a context better and perform well in NLP tasks, such as machine translation [10,11,12,13,14], summarization [13,14,15], and question answering [4,6,13,14]. As a result, the transformer has become the most dominant architecture in NLP [14,16,17,18,19].
The transformer is widely used not only in NLP but also in various fields, such as image processing [20,21,22,23], speech recognition [18,24,25], and neural signal processing [26,27,28,29]. In this regard, modified versions of transformer algorithms have been proposed so that the model can adequately extract features from different data types [18,19,20,21,26,27,28,29]. The vision transformer (VIT) outperformed recent convolutional neural network (CNN) algorithms by splitting an image into patches and feeding the patches to an encoder of a standard transformer [20]. The conformer could capture an audio sequence’s local and global dependencies by combining CNNs and transformers [18]. Similarly, the EEG conformer performed excellently by combining CNNs and transformers for neural signal processing [26]. The D-FaST applied attention to frequency, space, and time domains [29].
The transformer is composed of encoders and decoders [1]. The encoder consists of multi-head attention and a feed-forward network, and the decoder comprises masked multi-head attention, multi-head attention, and a feed-forward network. The feed-forward network is a conventional neural network, and multi-head attention is the core of the transformer. Attention is calculated from query, key, and value inputs by computing the dot product, scaling, softmax, and another dot product. Multi-head attention performs the process several times to take different attention views. Masked multi-head attention computes attention by masking a part of the input data to prevent the use of future data as input.
Numerous modified versions of the transformer algorithms have been proposed, using only the encoder part for classification [20,26,27,28,29]. Most of them use query (Q), key (K), and value (V) to calculate the attention as initially proposed [20,26,27,28,29]. The K and V can be redundant in the encoder. However, many modified algorithms for the encoder-based transformer still use all the Q, K, and V components.
In this paper, we propose a simplified query-only attention mechanism to enhance the efficiency and interpretability of transformer models by removing redundant elements. The transformer, using only a Q, performed similarly to the transformer using Q, K, and V, even in a simple structure. To evaluate the proposed method, we applied the simple attention and conventional attention mechanisms to the equal transformer model and compared their accuracy. To show that the proposed method plays the same role as the previous method, even in simple structures, we applied the simple attention method and the previous attention method to the equal structure of the transformer. We compared the performance of the two cases. An EEG conformer was used as the transformer structure because the state-of-the-art model performs well by extracting local features with a convolutional layer and recognizing global relations with attention. The open datasets (BCI competition IV dataset 2a and b) were used, which are commonly used for evaluation [30].

2. Related Works

Transformers were proposed for sequence-to-sequence tasks, such as machine translation, in which a sentence in one language is converted into another [1]. The transformer architecture consists of an encoder and a decoder. The encoder is responsible for encoding the input sequence into a compact representation, while the decoder generates the output sequence (e.g., a translated sentence) from this representation. The attention mechanism plays a crucial role by allowing the model to focus on different parts of the input sequence at each step of the output generation, modeling the relationships within and between the input and output sequences.
Owing to their exceptional performance in NLP, transformers have been widely adopted across domains beyond text processing [20,26,27,28,29,31,32,33,34]. In computer vision, where convolutional neural networks (CNNs) were traditionally dominant, the vision transformer (ViT) introduced a novel approach by dividing an image into patches and processing them using a transformer without any convolutional layers [20]. The ViT could achieve superior performance compared to state-of-the-art (SOTA) CNNs. This breakthrough led to the widespread adoption of transformers in image recognition tasks [31,32,33,34].
Transformers have also been increasingly applied in neural signal processing [26,27,28,29]. For instance, the EEG conformer model combined convolutional layers and a transformer [26]. The EEG conformer extracts features from EEG signals through convolutional layers and processes them within a transformer to predict the user’s intent. The EEG conformer has shown superior performance compared to previous state-of-the-art models in this domain. Many BCI studies have employed transformers for various classification tasks [27,28,29].
In many classification tasks, such as image recognition and neural signal processing, the encoder is often used independently, without the decoder. This is because the encoder compresses the high-dimension input sequences into the low-dimension outputs, whereas the decoder expands the low-dimension inputs into the high-dimension output sequences [20,26,27,28,29]. As a result, the encoder-based transformer does not require attention between input and output sequences, as used in sequence-to-sequence tasks like translation. In the encoder, the Q, K, and V vectors are all derived from the same input sequence through linear transformations, while in the decoder, Q, K, and V are generated from the input and output sequences [1]. Consequently, the Q, K, and V vectors in the encoder can be similar to each other, leading to redundancy. Despite this, many applications use distinct Q, K, and V components, even when only the encoder is utilized.
In this paper, we propose a simplified query-only attention mechanism for encoder-only transformer models. Our experiments demonstrated that the query-only attention mechanism achieves good performance comparable to traditional attention, which uses Q, K, and V, even in a simple structure. The proposed mechanism saves computational time and memory by eliminating the need to generate and learn the K and V components. This approach can be applied to most transformer models that rely solely on the encoder.

3. Materials and Methods

Unlike previous attention mechanisms that utilize all three Q, K, and V vectors, our proposed method employs only the Q vector, reducing computational complexity and model complexity. To evaluate the proposed method, we applied simple attention and previous attention to the EEG conformer structure [26]. We utilized the open-source EEG conformer codes to implement simple attention. The EEG conformer codes are available at (accessed on 20 September 2024). The data are available at (accessed on 20 September 2024).

3.1. Dataset

The proposed method was evaluated with two widely used EEG datasets, BCI competition IV dataset 2a and 2b. Details of the data can be found in paper [30]. The sampling rate was 250 Hz for both datasets. The signals were bandpass filtered between 0.5 and 100 Hz and notch filtered at 50 Hz in both cases.
Dataset I: BCI competition IV 2a was acquired from nine subjects performing a cue-based motor imagery task with four classes (left hand, right hand, both feet, and tongue). The EEG signals were recorded using 22 Ag/AgCl electrodes. The left mastoid was a reference, and the right mastoid was a ground. Two sessions were conducted on separate days for each subject. Each session included six runs, with 48 trials per run (12 trials per class), resulting in 288 trials per session. Each trial began with a fixation cross on a black screen and a brief auditory cue (t = 0 s), as shown in Figure 1a. After two seconds (t = 2 s), a directional cue (arrow) appeared for 1.25 s, indicating the desired motor imagery. The subjects performed the imagery task until the fixation cross disappeared at t = 6 s. No feedback was provided during the trials.
Dataset II: BCI competition IV 2b consists of EEG data from nine subjects performing a cue-based motor imagery task with two classes (left and right hand). Three bipolar electrodes (C3, Cz, and C4) were recorded. The left mastoid was a reference, and the Fz electrode was a ground. For each subject, five sessions were recorded. The first two sessions were acquired without feedback, and the last three sessions were measured with online feedback.
The two sessions without feedback measured six runs, with 20 trials per run (60 trials per class) for each person. A fixation cross was presented on a screen at the beginning of each trial (t = 0 s), as shown in Figure 1b. After two seconds, a beep sound was given (t = 2 s). After three seconds from the beginning (t = 3 s), a directional cue (arrow) appeared for 1.25 s, indicating the desired motor imagery. The subjects were instructed to imagine the corresponding hand movement for 4 s.
The three sessions with online feedback recorded four runs, with 40 trials per run (80 trials per class) session for each person. A gray smiley was presented on a screen at the beginning (t = 0 s), as shown in Figure 1c. A beep sound was given after two seconds (t = 2 s). The directional cue was presented from 3 s to 7.5 s, instructing the motor imagery.

3.2. Transformer Architecture

We applied the proposed simple attention and the previous attention to the structure of the EEG conformer and compared the accuracies. The EEG conformer is a state-of-the-art deep learning model for EEG signal classification [26]. It extracts local and global features using convolution and attention modules. The EEG conformer consists of three main modules: a convolutional module, a self-attention module, and a classifier module, as illustrated in Figure 2.
(1) Convolutional module: This module consists of two convolutional layers and one pooling layer. The first convolutional layer extracts temporal features using 40 kernels (1 × 25) with a stride of (1, 1). The second convolutional layer plays a role as a spatial filter, using 40 kernels (22 × 1) with a stride of (1, 1). The number of electrode channels was 22. The average pooling layer reduces the amount of feature dimension using a kernel (1 * 75) with a stride of (1, 15). Consequently, the convolutional module effectively captures the local spatio-temporal features of EEG signals.
(2) Self-attention module: This module utilizes a self-attention mechanism to capture global interactions. Self-attention learns relationships between distant EEG signals, enabling more accurate classification.
The previous self-attention used query (Q), key (K), and value (V) to calculate the attention, as in Figure 2a. The previous attention is computed using Equation (1).
A t t e n t i o n Q ,   K ,   V = s o f t m a x Q K T d k V
where d k denotes the length of Q, K, and V. A single-layer, feed-forward network generates the Q, K, and V from processed EEG signals. These values can be considered latent vectors of the input (EEG). Therefore, the K and V are redundant with the Q, although the Q, K, and V might differ slightly. The Q can substitute the K and V.
The proposed query-only attention mechanism calculates attention scores using only the Q vector, as in Figure 2b. It can be formulated using Equation (2).
A t t e n t i o n Q = s o f t m a x Q Q T d k Q
In the encoder, the Q, K, and V are linear combinations of the same input X. Therefore, the Q, K, and V can be represented as A X , B X , and C X , respectively. The expression Q K T V in Equation (1) can be rewritten as A X X T B T C X . If we represent Q in Equation (2) as D X , where D is a matrix trained from a different linear network, then Q Q T Q becomes D X X T D T D X . Thus, when A equals D and B T C equals D T D , Equations (1) and (2) become equivalent. Therefore, in this case, Equation (1) can be entirely replaced by Equation (2). However, a matrix D that mathematically satisfies the conditions A = D and B T C = D T D may not always exist, and A, B, and C are not fixed matrices with specific values. The matrices can be changed based on the initial conditions and the training iterations. Therefore, a D matrix may be found that produces outputs similar to those generated by the flexible combination A X X T B T C X .
The replacement makes the model efficient and allows for a more intuitive understanding. Figure 2 illustrates the procedures of the previous and the proposed attention.
Multi-head attention performs a self-attention h times to capture the diverse interactions. In this study, the h was 10. The calculation of the multi-head attention was repeated N times. In this case, the N was 6.
(3) Classifier module: This module classifies the category of the EEG signals, based on the features extracted from the previous modules. Two fully connected layers were applied. A cross-entropy loss function was used for model training.

3.3. Evaluation

We evaluated our proposed method on an EEG conformer model and compared its performance to the original attention mechanism. We applied the same training and testing methods with the EEG conformer study for consistency [26]. For dataset I, the first session of the experiment was used for model training, and the second was used for testing. The EEG signals were segmented from 2 to 6 s based on the visual cue for each trial that corresponded to movement imagination. For dataset II, the first three sessions were used for training, and the last two sessions were used for testing. The signals were epoched from 3 to 7 s based on the visual cue for each trial. The epoched signals were used as input for the EEG conformer model for both datasets I and II. The data augmentation method was used because a large dataset was required to train the model. Segmentation and reconstruction (S&R) were performed in the time domain for the data augmentation. The number of segments, Ns, in S&R was 8. An Adam optimizer was used to train the EEG conformer. The parameters were 0.0002, 0.5, and 0.999 for the learning rate, β1, and β2, respectively. The number of self-attention computations (N) and multi-heads were 6 and 10, respectively. The number of epochs and the batch size were 2000 and 72, respectively. We evaluated the prediction accuracy for each epoch of training and testing. The average accuracy of all the epochs and a kappa value were shown for evaluation. A paired-sample t-test was performed between the prediction accuracies of the previous and the proposed methods. The kappa value can be calculated using Equation (3).
k a p p a = p o p e 1 p e
where p o denotes the average accuracy of all the epochs and p e means the accuracy of the random guesses.

4. Results

The results indicate that our method achieves a performance that is comparable to the original, while also simplifying the model’s architecture. The accuracy transitions on the dataset I were similar between the previous attention and the simple attention, as shown in Figure 3. The prediction accuracies rapidly increased, according to the training in both methods. Then, the accuracy was saturated. The blue and red lines show the average accuracy among the subjects. The sky blue and light red bounded lines illustrate the standard deviation among the subjects. The X-axis shows the number of training epochs. The Y-axis represents the prediction accuracy. The accuracy of the previous method is depicted in Figure 3a. The results of the proposed methods are shown in Figure 3b.
As shown in Table 1 and Table 2, the accuracy across the subjects was similar between the previous and simple attention methods. Table 1 and Table 2 depict the average accuracy of all the epochs for each subject on datasets 1 and 2, respectively. PrevConformer means the EEG conformer with previous attention. SimpleConformer implies the EEG conformer with simple attention. The SimpleConformer shows similar accuracy to the PrevConformer, despite its simple structure. Table 1 and Table 2 also represent the accuracy of state-of-the-art models [26], such as FBCSP [35], ConvNet [36], EEGNet [37], and DRDA [38]. The results show that the transformer model that is based on attention mechanisms outperforms the other machine learning algorithms.
The prediction accuracies of the conventional and proposed methods were not statistically different in either dataset I or II. For dataset I, the ConvConformer and the SimpleConformer accuracies were 75.82 ± 10.33 and 76.13 ± 10.09, respectively. For dataset II, the ConvConformer and the SimpleConformer accuracies were 84.72 ± 9.59 and 84.62 ± 10.45, respectively. The p-values of the paired-sample t-tests for datasets I and II were 0.513 and 0.770, respectively.

5. Discussion

This study proposed a simplified query-only attention mechanism for encoder-based transformer models. The primary goal was to improve the model’s efficiency and interpretability by reducing the complexity of the attention mechanism, without compromising performance. Our experimental results indicate that the query-only attention mechanism can achieve a performance that is comparable to the traditional QKV attention mechanism, which uses separate query, key, and value vectors.

5.1. Model Simplification and Its Implications

The simplification of model architectures has often driven advancements in technology. Using only the attention mechanism, the original transformer achieved a superior performance by eliminating the recurrence and convolution [1]. Similarly, the performance of AlphaGo, the first algorithm to win over humans in the game of Go, was improved by simplifying the architecture. AlphaGo employed separate policy and value networks [39]. After AlphaGo’s improvement, AlphaGo Zero, AlphaZero, and MuZero utilized a unified network for policy and value [40,41,42].
The proposed query-only attention mechanism could simplify the architecture by eliminating the redundant K and V vectors while achieving a performance that is comparable to the traditional model. As a result, the number of linear models to generate Q, K, and V vectors could be reduced by one-third. It implies that the linear model’s training time and memory space also be decreased by one-third.
The relatively complex structure of conventional attention mechanisms can make it challenging to understand the underlying principles of the algorithm. The meanings and roles of the Q, K, and V vectors differ between the encoder and decoder, which can further complicate intuitive understanding. Particularly in the case of encoder-based transformers, the Q, K, and V vectors are generated through the same process from the same input, making it unclear what each of these vectors signifies and why different values are necessary. In contrast, the proposed method simplifies this process by replacing the ambiguous K and V vectors with Q, as illustrated in Figure 2, potentially making the algorithm more straightforward to comprehend. The Q vector can be considered as a transformed input. Therefore, the proposed method clarifies that attention values are calculated based on the relationships within the input itself. In conclusion, the proposed attention mechanism enhances interpretability by replacing the unnecessary and ambiguous K and V values with the interpretable Q value. This enhanced intuitiveness could facilitate the development of new algorithms.

5.2. Limitation

While the experimental results showed that the proposed attention was comparable to the traditional attention, it should be noted that the experiments were evaluated only with the EEG conformer and EEG datasets. The temporal relationships in EEG signals are relatively local. Therefore, additional experiments with other types of data and models are required in future work. Moreover, query-only attention cannot be applied to encoder–decoder- or decoder-based models; it can only be applied to encoder-based models.

6. Conclusions

In this paper, we proposed the simplified query-only attention mechanism to improve the efficiency and interpretability of encoder-based transformer models. Our method, which utilizes only the Q vector for attention calculation, significantly reduces the model’s complexity while maintaining its performance. The experimental results on an EEG conformer model demonstrate that the proposed query-only attention mechanism performs similarly to the original QKV attention mechanism, while simplifying the model’s architecture. These findings suggest that query-only attention offers a promising approach for developing more interpretable transformer-based models. Future work could involve applying our proposed method to a broader range of transformer-based models and comparing its performance to other attention mechanisms. The query-only attention mechanism could also be used to develop new transformer models.

Author Contributions

Conceptualization, H.-g.Y. and K.-m.A.; methodology, H.-g.Y.; software, H.-g.Y.; validation, H.-g.Y.; writing—original draft preparation, H.-g.Y. and K.-m.A.; writing—review and editing, H.-g.Y. and K.-m.A.; supervision, H.-g.Y. and K.-m.A.; funding acquisition, H.-g.Y. All authors have read and agreed to the published version of the manuscript.


This research was supported by the Basic Science Research Program of the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 2017R1A6A1A03015496).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We used open EEG data for our evaluation. The data are shared at (accessed on 20 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.


Figure 1. Experimental paradigm. (a) Paradigm of dataset I. (b) Paradigm without feedback of dataset II. (c) Paradigm with feedback of dataset II.
Figure 1. Experimental paradigm. (a) Paradigm of dataset I. (b) Paradigm without feedback of dataset II. (c) Paradigm with feedback of dataset II.
Applsci 14 08646 g001
Figure 2. EEG conformer structure. The convolutional module is illustrated in sky blue, the self-attention module in light green, and the fully connected layer module in orange. (a) EEG conformer with the conventional attention mechanism using query (Q), key (K), and value (V) vectors. (b) EEG conformer with the proposed simplified attention mechanism using only a query (Q).
Figure 2. EEG conformer structure. The convolutional module is illustrated in sky blue, the self-attention module in light green, and the fully connected layer module in orange. (a) EEG conformer with the conventional attention mechanism using query (Q), key (K), and value (V) vectors. (b) EEG conformer with the proposed simplified attention mechanism using only a query (Q).
Applsci 14 08646 g002
Figure 3. Accuracy transition on dataset I. The X-axis indicates the number of training epochs, and the Y-axis represents prediction accuracy. (a) The results from the EEG conformer with the conventional attention mechanism. The blue line represents the average accuracy across subjects, with the sky-blue shaded area showing the standard deviation. (b) The results from the EEG conformer with the simplified attention mechanism. The red line represents the average accuracy among subjects, with the light red shaded area indicating the standard deviation among subjects.
Figure 3. Accuracy transition on dataset I. The X-axis indicates the number of training epochs, and the Y-axis represents prediction accuracy. (a) The results from the EEG conformer with the conventional attention mechanism. The blue line represents the average accuracy across subjects, with the sky-blue shaded area showing the standard deviation. (b) The results from the EEG conformer with the simplified attention mechanism. The red line represents the average accuracy among subjects, with the light red shaded area indicating the standard deviation among subjects.
Applsci 14 08646 g003
Table 1. Prediction accuracy of dataset I.
Table 1. Prediction accuracy of dataset I.
Table 2. Prediction accuracy of dataset II.
Table 2. Prediction accuracy of dataset II.
