*4.3. T5 with Non-Autoregressive Decoder: T5slim\_dec*

As mentioned earlier, T5 [10] converts all text-based language tasks into text-to-text format. As a result, our interaction classification problem is transformed into a relation type generation task, where the model generates a corresponding interaction label between the mentioned entities for a given input sentence. For example, the output label, "DDI-effect" is tokenized as '<s>', '\_ DD', 'I',' –', 'effect', '</s>' and "AGONIST" is as '<s>', '\_AG', 'ON', 'IST', and '</s>' in T5. These tokens correspond to decoder's inputs. Similar to the encoder, the decoder input of target sequence is also embedded, and its positional encoding is added to indicate the position of each word. The self-attention layer in the decoder only allows earlier position tokens to attend to the output sequence by masking future position tokens. This means that the decoder generates output tokens auto-regressively, predicting one token at a time based on the previous tokens, as shown in Equation (7), until a special end symbol, '</s>', is reached indicating the decoder has completed its output. For a given input sequence X, the target sequence Y with a length *m* is generated through a chain of conditional probabilities based on the left-to-right sequential dependencies, where *y*<*<sup>i</sup>* denotes the target tokens up to position *i*.

$$P(Y|X) = \prod\_{i=1}^{m} p(y\_i | y\_{$$

The model learns to predict the next token in a sentence more accurately, as it uses teacher forcing to feed the decoder with the actual target tokens from the ground truth data instead of with its own generated previous tokens, during the training phase. The output sequence is generated by searching for the most likely sequences of tokens. By incorporating beam search, T5 can produce more coherent, accurate, and contextually appropriate text outputs. However, to perform classification task under the text-to-text framework, the target label is treated as output text, which is typically a single word or short string. Thus, the autoregressive task, typically used for generating sequences of output text, is not required for class inference. In our work, the output of T5 corresponds to single interaction string, which represents a label such as "DDI-effect" or "AGONIST". The decoder generates output tokens, each of which represents a specific class from a limited set of class labels. As mentioned in Liu et al.'s study [36], the decoder parameters in T5 model are highly under-utilized for the classification task, in contrast to the typical encoder–decoder models where the decoder layers account for more than half of the total parameters. As a result, when there is only one output token, the decoder has limited previously generated tokens as inputs, which reduces the role of the self-attention mechanism. In such cases, most of the information is passed from the encoder to the decoder and is processed in the cross-attention layer.

Thus, we removed the self-attention block in the decoder, as shown in Figure 6b and tailored the T5 model to fit our interaction-type classification task in a non-autoregressive manner. This approach is inspired by the EncT5 model [39], an encoder-only transformer architecture which reuses T5 encoder layers without code changes. However, we still retained the cross-attention layers to take into account the relationships between the input sentence and output interaction category. The cross-attention plays a role in combining two embedding sequences of the same dimension. It transfers information from an input sequence to the decoder layer to generate output token, which represents the interaction label. The decoder processes the representation of the input sequence through the crossattention mechanism, yielding a new context-sensitive representation. The embedded vector of the interaction label serves as the query, while the output representation of the encoder is used as both the key and value for the inputs in the cross-attention layer.

For this, we add target labels to vocabulary sets to handle these as whole tokens rather than separated tokens. We also opt for more lexically meaningful labels such as 'ACTIVATOR', 'AGONIST', 'AGONIST-ACTIVATOR', and 'AGONIST-INHIBITOR' instead of generic labels such as "CRP:1" or "CRP2". The model will learn the mapped embedding for this token and the learned embedding will then determine how to optimally pool or aggregate information from the encoder. Finally, the decoder's output is fed into a linear classifier (a fully connected layer), which transforms the high-dimensional context representation into the size of the number of possible labels. The linear classifier generates decoder\_output\_logits, which represent the raw and unnormalized output values associated with each label in the vocabulary. The decoder\_output\_logits are passed through softmax function to convert them into a probability distribution over the entire set of possible labels. The label associated with the highest probability is selected as the output text. We will refer to this model as T5slim\_dec. Figure 6b presents the overall architecture of T5slim\_dec.

Figure 8 visually compares the operational mechanisms of T5 and T5slim\_dec, highlighting their differences. As shown in the Figure 8, T5 generates one token at a time based on the input sequence and the previously generated token in the auto-regressive decoding process. For each step of this process, the model calculates decoder\_output\_logits for all tokens in vocabulary. The token with the highest probability is selected and included in the output sequence and then combines the tokens to form the final readable output text.

**Figure 8.** Comparison with T5 and T5slim\_dec Models.

#### **5. Results and Discussion**

#### *5.1. Experimental Setup*

In this section, we discuss the results of transformers we suggested in the previous section and how they can be interpreted in comparison to previous studies. All codes were implemented with HuggingFace's transformers [40] which is a platform that provides APIs and many libraries to access and train state-of-the-art pretrained models. It is available from the HuggingFace hub. We utilized the AdamW optimizer in conjunction with the cross-entropy loss function for training models.

The experimental results were obtained in a GPU-accelerated computing environment using an NVIDIA Tesla V100 32 GB GPU and Google Colab Pro+ with an NVIDIA A100 SXM4 80 GB GPU. To evaluate the model performance, accuracy and F1-score are adopted for evaluation metrics. The accuracy means the proportion of correctly predicted data out of the total data and F1-score is the harmonic mean of precision and recall, designed to balance the two values, as in Equation (8).

$$\begin{array}{l}Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \\ Precision = \frac{TP}{TP + FP} \text{ Recall} (sensitivity) = \frac{TP}{TP + FN} \\ F1 - score = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{array} \tag{8}$$

#### *5.2. Baseline Models*

We will begin by presenting the experimental result for the baseline models. In case of encoder–transformer, 'SCIBERT-uncased' pretrained model [23] which has the same structure used in BERT [8] were utilized. The model was trained from scratch using the SCIVOCAB, a new WordPiece vocabulary on scientific corpus using the SentencePiece library. Unlike BERT, the model allows maximum sentence length up to 512 tokens. In our relation classification the final vector of the '[CLS]' token was fed into a linear classification layer with softmax outputs to classify interactions. According to the original SCIBERT study [23], the model achieved a micro F1-score of 0.8364 on the ChemProt dataset. However, in our own experiments, we observed a slightly lower performance with 0.8169. In classification tasks for which every case is guaranteed to be assigned to exactly one class, micro-F1 is equivalent to accuracy.

For T5 [10], our tasks were fine-tuned using 'SciFive-large-Pubmed\_PMC' pretrained model [30]. The model was first initialized with pretrained weights from the base T5 model and then re-trained on C4 [35], PubMed abstracts, and PMC full-text articles. It has 24 decoder/encoder layer and 16 heads. The input length, target length and dmodel are 512, 16, and 1024, respectively. SciFive [30] used the SentencePiece model [34] for the base vocabulary. Its relation extraction performances on ChemProt and DDI sets were reported as 0.8895 and 0.8367 (micro F1-score), respectively. In our experiment, SciFive pretrained model demonstrated performances of 0.9100 and 0.8808 for the same set. The number of beams was set to 2 during the decoding phase.

In case of GPT-3 model, it is one of the largest generative language models with 175 billion parameters, trained on a massive text data set. It is capable of generating high-quality text on a wide range of tasks. However, GPT-3 is not open-source and is only available through OpenAI's API. Therefore, for our experiment, we fine-tuned our tasks using EleutherAI's pretrained models instead. EleutherAI has released several open-

source language models called GPT-Neo which perform similarly to GPT-3 but with fewer parameters. Nevertheless, the GPT-NeoX-20B still has 20 billion parameters and requires a large amount of RAM to load the model as well as high-quality computing power to run efficiently. In this experiments, smaller models, such as GPT-Neo1.3b and GPT-Neo125M, were to reduce resource requirements. For future work, the performance of ChatGPT or GPT-4 will be evaluated in the context of biomedical relation extraction to further explore their potential in this domain. Table 5 presents the number of entities in the datasets.

**Table 5.** The number of entities.


### *5.3. Results of the Proposed Models*

Table 6 displays the overall performances (accuracy) of the five attempted methods including BERTGAT and T5slim\_dec. To simplify parsing and reduce the unnecessary complexity caused by multi-word entity terms in a sentence, entities were masked as entity classes with special @CHEM\$ (chemical), @PROT\$ (protein), and @DRUG\$ (drug) tokens. The term "entity masking" in Table 6 indicates those entity replacements. Experiments were conducted on both original datasets as to which entity mentions are kept and datasets with masked entity names. In general, entity masking is known to be beneficial in the generalization capabilities of relation extraction models by encouraging them to focus on context rather than specific entity mentions. This results in better performance when dealing with new and unseen entities and mitigates the risk of overfitting. In Table 6, it is shown that entity masking in DDI interaction extraction proved to be somewhat effective. On the other hand, in the interaction extraction in ChemProt, using the actual tokens of entities rather than their classes resulted in better performance. One possible reason for this is that the training and evaluation datasets are extracted from the same domain and similar entities are likely to appear more frequently, which can contribute to better performance when not masking entities.


**Table 6.** Experimental results.

Note that although the ChemProt corpus contains 10 types of relation group classes, only 5 relation types (CPR:3, CPR:4, CPR:5, CPR:6, and CPR:9) were designated to be evaluated in the BioCreative task. In this experiment, two evaluations were conducted: one

using the group classes of the CPR-format to which interaction types belong and the other using actual relation types instead of the group classes directly. Consequently, recognizing the interaction class group led to a higher F1-score.

In the case of DDI, the '4classes' in Table 6 indicates that the training and testing were conducted on the four classes (advice, effect, mechanism, int) following the 2013 DDIExtraction shared task evaluation. On the other hand, '5classes' refers to the results of training and testing on the five classes, including 'DDI-false'. In the table, '-false' indicates the accuracy of interaction labels excluding the cases where the gold label is 'DDI-false' during evaluation. In practice, because there were many instances of DDI-false and they were relatively easier to predict, the model achieved a higher F1-score on the 5classes evaluation.

Even though, BERTGAT showed some improvement compared to BERT using entity classes, the performance was still not satisfactory. One reason, the parser is more likely to encounter parsing errors when faced with the complicated biomedical entities and expression. Although the attention mechanism used in GAT allows the model to consider indirectly connected nodes as well as directly connected nodes and BERT's context representation was used as input feature vector for each node, which make it robust to parsing errors, this method partially depends on a correct parse tree to extract crucial information from sentences. Thus, the accurate performance gain of this approach can be accessed on the availability of human-annotated parses for both training and inference. Currently, the effect of incorporating dependency tree information into pretrained transformer remains uncertain. The BERTGAT was experimented only on ChemProt datasets due to the parsing problem.

Another reason could be that the multi-head attention model based on tokens implicitly encodes syntax well enough since it allows the model to learn from input sequence in multiple aspects simultaneously, with each head collecting information from a different subset. This multi-head structure enables the model to analyze the input from various perspectives and make more accurate predictions without restriction of external dependency structure. Thus, implicit syntactic knowledge within sentences might be learned well by transformer models based solely on tokens.

As a result, T5slim\_dec exhibited the best performances on both the ChemProt and DDI datasets and T5 model fine-tuned with SciFive also demonstrated good performances on the datasets. Specially, T5slim\_dec demonstrated noticeable improvements in F1-score, compared to the original T5 model. It showed a 6.36% increase from 0.8223 (F1-score) to 0.8746 on the ChemProt task and a 2.4% increase from 0.89 to 0.9115 on the DDI task. The results indicate that the T5slim\_dec model is performing well on the interaction classification task by tailoring the decoder structure.

Tables 7 and 8 show the F1-scores per interaction type. In addition, macro F1-score, micro F1-score, and weighted F1-score were considered as evaluation metrics as well as standard F1-score. Analyzing these metrics can provide a more comprehensive understanding of the models' performances in multiclass classification by taking into account different aspects of class distribution and the relative importance of each class. In terms of per-class recognition rate, 'DDI-int' had the lowest recognition rate in the DDI dataset while "DOWN-REGULATOR' had lowest recognition rate in the ChemProt dataset. One possible reason for the low performance, the 'DDI-int' relation have relatively fewer instances (5.6%) in the DDI corpus compared to other relations. Similarly, the classes 'AGONIST-ACTIVATOR', and 'AGONIST-INHIBITOR' and 'SUBSTRATE\_\_PRODUCT-OF' appeared infrequently in the training dataset, with only 10, 4, and 14 occurrences, respectively. This limited number of examples in the training data may impact the model's ability to accurately recognize related interactions.


**Table 7.** F1-score per DDI type.

**Table 8.** F1-score per ChemProt interaction.


Additionally, Figure 9 shows that 'DDI-int' was frequently confused with 'DDI-effect' or 'DDI-false'. The reason may be that this type is assigned when a drug–drug interaction appears in the text without any additional information, which can lead to potential confusion. As shown in Figure 10, 'DOWNREGULATOR' interactions in ChemProt dataset were frequently misclassified as different interaction types belonging to the same class group, such as 'INDIRECT-DOWNREGULATOR' or 'INHIBITOR', as 'AGONIST-ACTIVATOR' was often misclassified as 'AGONIST' with the same CRP group. Since there might be similarities among them related to their interactions. This makes it difficult for the model to distinguish between them. For example, the 'DOWNREGULATOR' represents a chemical that decreases a protein's activity, while the 'INHIBITOR' refers to a chemical that suppresses a specific protein's function. Both classes have a similarity in that they both decrease or inhibit a protein's activity.

#### *5.4. Comparisons with Other Systems*

We also compared T5slim\_dec, which showed the best performance, with other previous studies in terms of per-class F1-score per for DDI extraction. As shown in Table 9, T5slim\_dec outperformed other two approaches for DDI interaction extraction across all DDI types on the '4classes' evaluation. Additionally, in the '5classes' evaluation, our model performed well compared to others, except for 'DDI-int'. Since there were limited studies reporting per-class F1-score, few comparisons were presented in Tables 9 and 10. Zhu et al. [28] constructed three different drug entity-aware attentions to get the sentence representations by using external drug description information, mutual drug entity information, and drug entity information, based on BioBERT. Sun et al. [41] proposed a recurrent hybrid convolutional neural network for DDI extraction and introduced an improved focal loss function to handle class imbalance in the multiclass classification task.

**Figure 9.** Confusion matrix for T5slim\_dec on DDI test dataset.

**Figure 10.** Confusion matrix for T5slim\_dec on ChemProt test dataset.



**Table 10.** Comparisons of per-class F1-scores with other method (ChemProt dataset).


Table 10 shows the comparison of per class F1-score in the ChemProt dataset. Asada et al. [26] encoded sentence representation vectors by concatenating the drug knowledge graph embedding with word token embedding. The knowledge graph embedding took into account various external information, such as hierarchical categorical information, interacting protein information, related pathway information, textual drug information, and drug molecular structural information. Our T5slim\_dec model achieved better classification results for all ChemProt interaction types compared to the current state-of-the-art (SOTA) system [26]. T5slim\_dec model with previous systems on DDI and ChemProt relation extraction. Based on the evaluation metric F1-score, our system showed very promising performance in both interaction extraction tasks.

Consequently, T5slim\_dec effectively extracted drug-related interactions compared to previous state-of-the-art systems without utilizing external information for entities, simply by tailoring the encoder–decoder transformer architecture to suit the classification task and by not tokenizing the decoder input.

Finally, Table 11 shows an overall performance comparison of our T5slim\_dec model with previous systems on DDI and ChemProt relation extraction. The notation 'CPR' indicates that the model determines an interaction type by CPR class group, as mentioned earlier. Our experiments showed that SciFive [30], a T5 model trained on large biomedical corpora for domain-specific tasks, performed competitively on both DDI and ChemProt datasets, achieving an accuracy of 0.90 for the 4classes of DDI and 0.91 for the CPR class group of ChemProt. According to our knowledge, SciFive is a state-of-the-art system for drug-related interaction extraction.


**Table 11.** Comparisons with previous SOTA systems.

As a result, our T5slim\_dec model outperformed SciFive with an accuracy of 0.91 for the 4class classification and 0.95 for the 5class classification in the DDI dataset. Additionally, our model achieved an accuracy of 0.94 for the CPR-based class group and 0.87 for 13 interaction types. As shown in the table, encoder-only transformers such as BioBERT, SCIBERT, PubMedBERT, BioM-BERT, and BioLinkBERT exhibited lower performance than encoder– decoder transformer models such as T5 and T5slim\_dec. Moreover, the PubMedBERT + HKG model, which leverages external knowledge, also showed strong classification accuracy.

#### *5.5. Limitations*

In this section, we will address several limitations that need to be considered for future improvements. The BERTGAT model encoded dependencies between tokens by converting each tree into a corresponding adjacency matrix. Although the model utilized an attention mechanism to calculate the importance of words within the input graph structure and incorporated BERT's contextualized representation as embedding feature vectors for input graph nodes, it still requires more sophisticated techniques for incorporating syntactic and semantic information to enhance biomedical relation extraction performance. This is further complicated by errors in the dependency tree which can potentially introduce confusion in relation classification, emphasizing the need for a method that is robust to such issues. Even though the attention mechanism used in GAT allows the model to consider indirectly connected nodes and capture complex relationships in the graph, it is necessary to develop strategies that effectively address these challenges.

In addition, as shown in Figure 10, the T5slim\_dec occasionally misclassifies terms with opposite meanings, such as confusing ACTIVATOR with INHIBITOR and AGONIST with ANTAGONIST. This indicates a need for further in-depth research and investigation regarding negation handling to improve the model's performance in such cases.

Furthermore, due to computing limitations, we were unable to fully validate the performance of GPT-3 in this study, and GPT-Neo1.3b did not outperform the T5 model. Recently, ultra-large language models such as ChatGPT (GPT-3.5) and GPT-4 have demonstrated remarkable performances in text generation. Therefore, further research to explore the potential of ChatGPT or GPT-4 APIs on biomedical interaction extraction is needed.

Finally, the transformer models we proposed were currently designed to perform sentence-level relation extraction, even though transformers can handle multiple sentences simultaneously by using [SEP] to separate them. Thus, they have limitations in handling *n*-ary relation or cross-sentence *n*-ary relation extraction tasks, as there could be more than two entities across multiple sentences.

#### **6. Conclusions**

In this work, we demonstrated the effectiveness of transfer learning that utilizes transformer models pretrained on a large-scale language dataset and fine-tuned the parameters on relation extraction task dataset.

Although we did not compare the performance of high-capacity parameter models such as GPT-3 or GPT-3.5 (Instruct GPT, ChatGPT) on the relation extraction task, the encoder–decoder transformer T5 consistently demonstrated strong performance in drugrelated interaction classification.

We proposed T5slim\_dec, a modified version of T5 for interaction classification tasks by removing the self-attention layer from the decoder and adding the target labels to the vocabulary. As a result, T5slim\_dec can handle the target labels as whole tokens rather than requiring them to be predicted sequentially in an autoregressive manner. The model demonstrates the effectiveness for DDI and ChemProt interaction extraction tasks and achieved improved classification performance compared to state-of-the-art models.

The relation extraction can be a challenging task for transformer models when dealing with complex sentence structures. This difficulty arises from several factors, including long or nested sentences, entities spanning multiple sentences, and domain-specific language structure. To address this difficulty, we incorporated explicit syntactic information to enhance context vector representation of a sentence using structural information of the sentence. We presented BERTGAT to augment the transformer with dependency parsing results. However, that model did not demonstrate a significant performance improvement and additional research is required.

The proposed DDI extraction method can be applied to pharmacovigilance and drug safety surveillance by identifying potential adverse drug interactions. The ChemProt extraction can be utilized in drug discovery and development by facilitating the identification of potential protein targets for new drugs.

**Author Contributions:** Conceptualization, S.K., J.Y. and O.K.; methodology, S.K., J.Y. and O.K.; software, S.K. and J.Y.; validation, S.K., J.Y. and O.K.; formal analysis, S.K. and J.Y.; investigation, S.K.; resources, S.K., J.Y. and O.K.; data curation, S.K.; writing—original draft preparation, S.K.; writing—review and editing, S.K., J.Y. and O.K.; visualization, S.K. and J.Y.; supervision, S.K.; project administration, S.K.; funding acquisition, S.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2020R1I1A1A01073125).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.
