4.1.1. Multi-Head Self Attention
The attention mechanism has been applied in deep learning for a long time, for example, the article [
27] simulates the human characteristics of processing visual information with some focus, and combines RNN with attention to get better image classification results at that time. And the article [
28] proposes the use of the Attention mechanism for simultaneous translation and alignment on machine translation tasks, and their work is the first to apply the Attention mechanism to the field of NLP. The attention mechanism mentioned above can effectively improve the generalization ability of the model. Recently a new DL network, Transformer, which is composed entirely of self-attentive mechanism [
26]. It combines the advantages of convolutional neural networks and recurrent neural networks and is highly unified, and has achieved outstanding results in the fields of natural language processing [
29] and image processing [
30].
The Transformer-based net model is mostly based on two modules, Encoder and Decoder, which were first conceptualized from the translation task in natural language processing. In this article, we will only use Encoder. So we will explain the self-attentive mechanism used and the Encoder composed by it.
The essence of the Attention Mechanism is to identify which features in the input are important for the target and which features are not by generating a weighting coefficient to weight the input to sum up for a given target. In order to implement the Attention Mechanism, we treat the input data as <Key, Value> key-value pairs and calculate the similarity coefficient between Key and Query based on the query value Query in the given task objective to obtain the weight coefficient corresponding to the Value value, and then use the weight coefficient to weight the Value value to obtain the output. Before the appearance of the Transformer-based model, attention was not a unified model, it was just a mechanism by which Query, Key and Value were sourced in different ways in different application domains, meaning that different domains had different implementations. For convenience, Q, K, V is used later to refer to Query, Key, Value.
The self-attentive mechanism, proposed by the Google team and applied to the Transformer language model, can be used separately in encoding or decoding. Compared to the original attention mechanism, it focuses more on the internal connection of the inputs, the difference being that Q, K and V come from the same data source, i.e., Q, K and V are derived from the same matrix through different linear transformations, it is showed in
Figure 2. The output of Self-attention is calculated as follows:
A is the attention matrix,
is the scaling factor. Equation (
3) is named scaled dot-product attention, the purpose of scaling is to alleviate the gradient disappearance problem caused by softmax.
For example, let the input sequence be
, and the output sequence
, then the process of obtaining
from
X by self-attentiveness is shown in
Figure 3. First the input sequence is mapped to generate different
,
, and
mapping vectors for each element, and each element corresponding to
and all
is obtained through a series of calculations to obtain the attention weighting of that element corresponding to each other element, which is combined with the corresponding
v. The final sum of the above results is the output corresponding to that element. The other elements in the output sequence
B are calculated in the same way. And this process is fully computable in parallel, without waiting for the previous time slice to be computed like LSTM or RNN. At the same time, the attention mechanism considers all elements of the input sequence during the operation, and its computation is global.
The
Table 2 summarizes the computational complexity of different kinds of networks for an input sequence of length n and dimension d, the number of sequential operands, and the path lengths required for the interaction of two elements of distance
n.
From the above discussion, the disadvantage that traditional sequence models such as LSTM must be sequential is well solved, because the computation of Q, K, and V is synchronized for all positions on the sequence, which can also be said to be parallelized processing.
Analogous to the role of using multiple filters simultaneously in CNNs, here multi-headed attention is used to increase the diversity of the network like
Figure 4. Multi-headed attention allows the model to jointly focus on information from different locations and different subspace representations.
And Equation (
4) shows the mathematical expressions of the output of multi-head attention:
4.1.2. Composition of the Encoder and Its Improvement with GLU
The original Encoder in [
26], with the multi-headed attention mentioned above as the core unit, has the structure shown in the
Figure 5.
The Encoder block consists of a multi-headed attention, and a feedforward network(FFN). A positional embedding is first appended to the input sequence in the original Encoder, because there is no absolute positional relationship between words and words, image blocks and image blocks as natural language processing and image processing tasks, so it is necessary to artificially assign a trainable location information. The embedded sequences, which comes from the embedding transformation of each word in NPL and from the mapping of image blocks in image processing, are first fed into the multi-headed attention layer after appending over the location encoding, mapped from the input sequence to the attention-weighted output sequence, and fed into the feed-forward network. The feedforward network in the original encoder consists of two fully connected layers, which are defined as in Equation (
5).
However, when performing the related task on wireless signals, since the signals are sampled in temporal order, the wireless signals themselves contain location information and no additional signals need to be added manually. Therefore, the N Encoder layers used in this paper directly use the feature matrix extracted by the convolution layer as the input.
Gate linear unit (GLU) is presented in the article [
31] and its structure, although simple, is defined as Equation (
6):
In article [
31], this gating structure makes CNNs more competitive in modeling methods for language models compared to recurrent neural networks, taking full advantage of the fast speed of parallelized computation of CNNs. According to the article [
32], in terms of its computation, the self-attention mechanism is self-learning which part of the features need to be paid attention to by the data, while the convolutional network is considered to specify the convolutional kernel to learn the features in a fixed spatial range, so the CNN can be seen as a special attention mechanism to some extent. Therefore, we try to use GLU, which is used in CNN to obtain contextual information, to replace FFN in Encoder so that it reinforces contextual relations as
Figure 6.
We found that the first linear transformations of GLU and FFN are similar without considering the activation function, so we tried to replace FFN using GLU and replace the activation function from sigmoid to GELU, a stronger activation function for the task of processing sequences, and the transformed
is formulated as follows.
Compared with the original FFN, though the trainable weight matrix is changed from two to three. But by reducing the second dimension of U and V with the first dimension of wo by , the amount of parameters of is reduced by compared to FFN.
The effect of this item on the improvement of the encoder will be discussed in the
Section 5.