*3.2. Transformer and Attention*

Before explaining our transformer approaches, we will first introduce the concept of the transformer model and attention. The transformer was designed for sequence-tosequence tasks. It uses stacked self-attentions to encode contextual information of input sequence. Attention is a mechanism which enables a model to focus on relevant parts of the input sequence to enhance the meaning of the word of interest [32]. The inputs to the transformer model are word embedding vectors. The model weighs these vectors according to their neighboring context within the sentence. For example, in the sentence, "He swam across the river to the other bank", the word, 'bank' has a contextualized vector which is closer to the meaning of 'sloping raised land' rather than 'a financial institution' by focusing on the words "swam" and "river".

The attention provides contextualized representation for each word and captures relatedness between other words occurred in the sequence. BERT processes input tokens through transformer encoder blocks and returns a hidden state vector for each token. These hidden state vectors encapsulate information about each input token and the context of the entire sequence.

The attention score, as represented by Equation (1), is computed after creating a query(*Qi*), key(*Ki*), and value(*Vi*) embedding vector for each token in a sentence. The calculation involves three parts: (1) computing the attention score between query and key using a dot-product similarity function, (2) normalizing the attention score using softmax, and (3) weighting the original word vectors according to surrounding context using the normalized attention weights.

$$\begin{aligned} Q\_i &= QW\_i^{\mathcal{Q}}, K\_i = K W\_i^K, V\_i = V W\_i^V\\ head\_i &= Attack(\mathcal{Q}\_i, K\_i, V\_i) = softmax\left(\frac{\mathcal{Q}\_i K\_i^T}{\sqrt{d\_k}}\right) V\_i\\ softmax(s\_i) &= \frac{\varepsilon^{\mathcal{I}}}{\sum\_{i=1}^n \varepsilon^{\mathcal{I}}}\\ Multi\_{head(\mathcal{Q}, K, V)} &= \text{Concat}(head\_1, head\_2, \dots, head\_h) \mathcal{W}^O \end{aligned} \tag{1}$$

In Equation (1), *dk* is the dimension of query/key/value and *n* is the sequence length. The matrix multiplication *QK<sup>T</sup>* computes the dot product for every possible pair of queries and keys. If two token vectors are close (similar) to each other, their dot product is going to be big. The shape of each matrix is *n* × *n*, where each row represents the attention score between a specific token and all other tokens in the sequence. The softmax and multiplication with value matrices represents a weighted mean and <sup>√</sup>*dk* is a scaling factor. With multi-headed self-attention, multiple sets of *Q*/*K*/*V* weight matrices are used to reflect different representation of the input sequence.

As a result, the attention operation helps focus more on the values associated with keys that have higher similarities and capture important contextual information in the sequence. It produces a contextualized representation of the whole sequence and can be interpreted as connection weights between each word token and all other words in a given sequence. Figure 3 shows how to compute multi-head self-attention for an example sentence: "concomitant administration of other @DRUG\$ may potentiate the undesirable effect of @DRUG\$." In the case, "concomitant" might be highly associated with "administration" by the self-attention. The outputs of the attention mechanism are concatenated before being further processed and fed to a FFNN (feed-forward neural network). The transformer encoder takes the input sequence and maps it into a representational space. It generates *dembed*-dimensional vector representation for each position of the input, as shown in Figure 3, which is then sent to the decoder.

**Figure 3.** Visualization of multi-head self-attention for an example sentence.

In addition to word embedding, transformer also employs positional embedding to represent a token's positional information. This allows for parallel processing with causal masking, which restricts the use of future information during training by masking future tokens that appears after the current position in the input. The positional embedding vector to each input token can be easily computed using sine and cosine functions with Equation (2), where dmodel represents the dimension of the input embedding vector.

The transformer consists of a stacked encoder and decoder, both of which are built with two sublayers: multi-head self-attention layers as mentioned earlier and fully connected

FFN (FeedForward Neural Network) layers. The FFN consists of two linear transformations with the ReLU (Rectified Linear Unit) activation as shown in Equation (3). To prevent the model from losing important features of input data during training, residual connections, as shown in Equation (4), are employed around each of the sub-layers, followed by layer normalization:

$$\text{PE}\_{\text{(pos,2i)}} = \sin\left(\frac{pos}{10000^{\frac{2i}{d\_{\text{model}}}}}\right), \text{PE}\_{\text{(pos,2i+1)}} = \cos\left(\frac{pos}{10000^{\frac{2i}{d\_{\text{model}}}}}\right) \tag{2}$$

$$\text{FFN}(\mathbf{x}) = \max(0, \mathbf{x}\mathbf{W1} + \mathbf{b1})\mathbf{W2} + \mathbf{b2} \tag{3}$$

$$\text{LayerNorm}(\mathbf{x} + \text{Sublayer}(\mathbf{x})) \tag{4}$$

Besides the two sub-layers, the decoder has an additional sublayer called multi-head cross attentions, which considers the relationship between the output of the encoder and the input of the decoder. The output of the encoder is transformed into a set of *K* and *V* vectors and utilized in the cross-attention. The cross attention adopts *Q* matrix from the self-attention layer of decoder and *K* and *V* matrix from the encoder, respectively. Unlike its operation in the encoder, the self-attention layer in the decoder is modified to prevent positions from attending to subsequent positions by masking. This masking ensures that the predictions for position *i* can depend only on the known outputs at positions less than *i*.

In practice, the encoder maps an input sequence to a sequence of continuous contextual representation. Given the input representation, the decoder auto-regressively generates an output sequence, one element at a time, using the previously generated elements as additional input when generating the next.

#### **4. Methods**

In this section, we first describe three transformers used as baseline models and introduce proposed models, BERTGAT and T5slim\_dec for relation extraction.

#### *4.1. Baseline Methods*

As baseline models for our research on interaction extraction, we employed three types of transformer: BERT (encoder-only) [8], GPT3 (decoder-only) [9], and T5 (encoder– decoder) [10]. First, BERT is bidirectional transformer which uses only encoder block of the transformer. For a detailed structure and implementation, please refer to the study [22]. BERT is pretrained on two unsupervised tasks: (1) masked language model (MLM), where some of the input tokens are randomly masked and the model is trained to predict the masked tokens and (2) next sentence prediction (NSP), where the model is trained to predict whether one sentence follows another, as shown in Figure 4. It uses WordPiece tokenizer and has a special classification token '[CLS]' in the first token of every sequence which corresponds to the aggregated whole sequence representation.

**Figure 4.** Pretraining methods of transformers.

We initialized the model with SCIBERT [23] for drug-related relationship extraction in order to leverage the domain specific knowledge and then fine-tuned all of the parameters using labeled ChemProt and DDI dataset. SCIBERT has the same architecture as BERT but was pretrained on scientific texts, which consist of 1.14 million papers from the computer science domain (18%) and the broad biomedical domain (82%), sourced from Semantic Scholar [33]. In addition, in-domain WordPiece vocabulary on the scientific corpus was newly constructed. Ultimately, we fed the special '[CLS]' token vector of the final hidden layer into a linear classification layer with softmax output to classify the interaction types.

Secondly, we employed the text-to-text transfer transformer (T5) [10], which is an encoder–decoder model. In the research, the authors experimented with various types of transformers and demonstrated that the encoder–decoder transformer architecture, combined with the denoising (masked language modeling) objective, yielded the best performance for most NLP tasks. T5 was pretrained with self-supervision through a learning objective called span-based language masking, in which a set of consecutive tokens are masked with sentinel tokens and the target sequence is predicted as a concatenation of the real masked spans, as shown in Figure 5. The tokens for pretraining were randomly sampled, and dropped out 15% of tokens in the input sequence. It used SentencePiece tokenizer [34] to encode text.

**Figure 5.** T5's pretraining scheme.

In general, encoder-only model such as BERT are easily applicable to classification or prediction tasks by using the '[CLS]' token, which provides a summary representation of the entire input sentence. On the contrary, T5 treats every text processing problem into a text-to-text generation problem that takes text as input and produce new text as output. Therefore, our relation classification problem is treated as a generation task for interaction types. Initially, we used the pretrained parameters of the SciFive [30] model and then finetuned it on our specific dataset in relation extraction tasks. The SciFive model was retrained on various text combination, which consisted of the C4 corpus [35], PubMed abstracts, and PMC full-text articles, to optimize the pretrained weights from T5 in the context of biomedical literature. Consistent with the original T5 model [10], SciFive learned to generate a target text sequence for a given text input sequence using a learning objective known as span-based mask language modeling. The output sequence is generated during the decoding phase by applying beam search algorithm. This involves maintaining the top *n* probable output sequences at each timestep and finally generating the output sequence with the highest probability.

Finally, we employed GPT-3 (Generative Pretrained Transformer) [9] which utilizes constrained self-attention where every token can only attend to its left context. As a decoder-only transformer, it was pretrained on a diverse range of web text to predict the next token in an autoregressive manner given the preceding text. It can generate words only conditioned on the left context, so it cannot learn bidirectional interactions.

Previous pretrained models have a limitation in that they need additional large, labeled datasets for a task-specific fine-tuning process to achieve desirable performance. Thus, GPT2 was designed as a general language model for various NLP tasks without the need for extensive fine-tuning. It is capable of performing downstream tasks with little or no fine-tuning, including zero-shot and few-shot learning scenarios, where only a few labeled

examples are available for fine-tuning. However, the results were not satisfactory in some tasks. They still need fine-tuning on task-specific labeled data to improve the performance.

In contrast, GPT-3 increased the capacity of transfer language models to 175 billion parameters, thereby allowing the model to utilize its language skills to comprehend tasks with a few examples or natural language instructions. GPT-3 has demonstrated strong performance across a wide range of downstream tasks with a meta-learning technique called 'in-context learning', which allows a language model to develop a broad set of skills and policies for tasks and pattern recognition abilities during unsupervised pretraining. This enables the model to rapidly adapt to a desired task during inference time. Its largescale, autoregressive language model trained on a massive amount of text data has a deep understanding of the rich context of language and enables the model to generate text, which is similar to human writing.

To achieve this, example sequences for various tasks are used as text input to the pretrained model. For instance, sequences for addition can provide a context for performing arithmetic addition, while error correction sequences can demonstrate how to correct spelling mistakes. Given the context, the model can learn how to perform the intended task and utilize the language skills learned during the pretraining phase.

Recently, OpenAI announced ChatGPT (GPT-3.5) and GPT-4, generative AI models based on reinforcement learning from human feedback (RLHF) and ultra-language models, which have shown very impressive results in generating responses. In this paper, we partially evaluated the potential of GPT-3 on relation extraction using GPT-Neo 125 M and GPT-Neo1.3B models [36] which are dense autoregressive transformer-based language models with 125 M and 1.3 billion parameters trained on 8 million web pages.
