*4.2. Self-Attention Using Dependency Graph: BERTGAT*

In this section, we describe BERTGAT to encode the syntactic structure with graphattention network (GAT) [15]. It leverages the overall graph structure to learn complex relationships between entities, enabling the classification of various types of relationships. In general, dependency trees provide a rich structure to be exploited in relation extraction. Parse trees can have varying structures depending on the input sentences, which may differ in terms of length, complexity, and syntactic construction. Thus, organizing these trees into a fixed-size batch can be difficult. Unlike linear sequences, where tokens can be easily aligned and padded, the hierarchical structure of parse trees complicates this process. In sequence models, padding is used to create equal-length inputs for efficient batch processing. However, for parse trees, padding is not straightforward, as it involves adding artificial tree nodes that might disrupt the tree's structure and introduce noise to the model. Due to these difficulties, it is usually hard to parallelize neural models working on parse trees.

On the contrary, models based on the SDP (shortest dependency path) between two entities are computationally more efficient, but they might exclude crucial information by removing tokens outside the path. In addition, some studies stated that not all tokens in the dependency tree are needed to express the relation of the target entity pair. They have utilized SDP [37] or subtree rooted at the lowest common ancestor (LCA) of the two entities [14] to remove irrelevant information. However, SDP can lead to loss of crucial information and easily hurt robustness. For instance, according to the research by Zhang et al. [14], in the sentence "She was diagnosed with cancer last year, and succumbed this June", the dependency path 'She←diagnosed→cancer' is not sufficient to establish that cancer is the cause of death for the subject unless the conjunction dependency to succumbed is also present. In order to incorporate crucial information off the dependency path, they proposed a path-centric pruning strategy to keep nodes that are directly attached to the dependency path.

To address the issue, we here adapt the graph attention network to consider syntactic dependency tree structure by converting each tree into corresponding adjacency matrix. The graph attention [15] is jointly considered in self-attention sublayer to encode the

dependency structure between tokens into vector representations. That helps to capture relevant local structures of dependency edge patterns that are informative for classifying relations by considering the relationships between each node and its neighbors, assigning greater weights to more important neighboring nodes. This approach allows for more effective learning of node representations of graph data, ultimately helping to represent node features more accurately.

For this, the Stanford dependency parser [38] is utilized to retrieve universal dependencies for each sentence. A dependency tree is a type of directed graph where nodes correspond to words and edges indicate the syntactic relations between the head and dependent words. In this work, if there is a dependency between node *i* and node *j,* then its opposite direction of dependency, node *j* and node *i* is also included. The dependency types of edge such as 'subj' and 'obj' are not considered. A self-loop is also considered for each node in the tree. Since BERT takes as subword units generated by tokenizer instead of word-based linguistic tokens of a parse tree, we introduce additional edges to handle unit differences. Figure 6a shows the architecture of BERTGAT.

**Figure 6.** BERTGAT and T5slim\_dec architecture.

Given a graph with *n* nodes, we can represent the graph with an *n* × *n* adjacency matrix A, where *Aij* is 1 if there is a direct edge going from node *i* to node *j*. The encoder consists of two sublayers: multi-head self-attention layer and multi-head self-graph attention layer. The final hidden layer of the encoder is fed into a linear classification layer to predict a relation type, which is followed by a softmax operation. That is, the output layer is one-layer task-specific feed-forward network for relation classification.

The output of the BERT model is a contextualized representation for each word in the given text, which is expressed as the hidden state vector of each word. This output vector contains contextual information about the corresponding word. The input to GAT consists of a set of the hidden state vectors obtained from BERT, h = {*h*1, *h*2,..., *hV*}, which serve as the initial feature vectors for each token in the text.

The GAT layer in Figure 6a produces a new set of node features, h = {*h* 1, *h* 2,..., *h <sup>V</sup>*}, as its output and *V* is the number of nodes. The Equations (5) and (6) are used to obtain GAT representation. In this study, we follow the formulation of the Graph Attention Network (GAT) as proposed in the original paper by Veliˇckovi´c et al. (2018) [15]. The GAT model is defined by Equations (5) and (6).

In the beginning, a shared linear transformation, parameterized by weight matrix **w** is applied to each node to transform the input features into higher-level features. Here, **w** is a learnable linear projection matrix. Subsequently, a self-attention mechanism *a* is

performed on the nodes and attention coefficients *e* are computed for every pair of nodes. To calculate the connection importance of node *j* to node *i*, the masked attention coefficient *ei*,*<sup>j</sup>* is computed according to Equation (5) only when *j* is a neighbor of node *i* in the graph. N*<sup>i</sup>* represents the set of *i* 's one-hop neighbors, including the *i* node itself, as a self-loops are permitted.

While the multi-head self-attention layer in Figure 6a uses a scaled dot product function as a similarity function, the GAT layer uses a one-layer feedforward neural network denoted as *a* after concatenating the key and query. The scoring function *e* computes a score for every edge (*j,i*), which indicates the importance of the neighbor *j* to the node *i*. It assigns negative value if there is no connection and then the resulting *αi*,*<sup>j</sup>* is normalized with softmax, as shown in Equation (5). It makes the coefficients easily comparable across different nodes. In the equation, the attention mechanism *a* is a singlelayer FFNN, parametrized by a weight vector **a** and LeakyReLU nonlinearity activation function is applied where T represents transposition and || is the concatenation operation.

$$\begin{aligned} e\_{i,j} &= a \left( \mathbf{w}h\_{l}, \mathbf{w}h\_{j} \right), j \in \mathcal{N}\_{i} \\ a: &\text{LeakyReLU}(\text{Linear}\left(\text{concat}\left(\mathbf{w}h\_{l}, \mathbf{w}h\_{j}\right)\right)) \\ a\_{i,j} &= \text{softmax}\left(e\_{i,j}\right) = \frac{\exp\left(e\_{i,j}\right)}{\sum\_{k \in \mathcal{N}\_{l}} \exp\left(e\_{i,k}\right)} = \frac{\exp\left(\text{LeakyReLU}\left(\mathbf{a}^{\text{T}} \left|\mathbf{w}h\_{l}\right| \left|\mathbf{w}h\_{j}\right|\right)\right)}{\sum\_{k \in \mathcal{N}\_{l}} \exp\left(\text{LeakyReLU}\left(\mathbf{a}^{\text{T}} \left|\mathbf{w}h\_{l}\right| \left|\mathbf{w}h\_{k}\right|\right)\right)} \end{aligned} \tag{5}$$

The normalized attention coefficients *α* are used to compute a weighted sum of the corresponding neighbors and to select its most relevant neighbors, as shown in Equation (6). It utilizes the attention mechanism to aggregate neighborhood representations with different weights. That is, each node gathers and summarizes information from its neighboring nodes in the graph. The aggregated information and value is combined and serves as the final output representation for every node. In this way, a node iteratively aggregates the information from its neighbors and updates the representation. To perform multi-head attention, *K* heads are used. Here, *σ* refers to the ReLU activation function and *αi*,*<sup>j</sup> <sup>k</sup>* means normalized attention coefficients computed by the *k*-th attention mechanism. Finally, we use averaging and activation function and then add a linear classifier to predict for the relation type.

$$\begin{aligned} h'\_i = \sigma \Big( \sum\_{j \in \mathcal{N}\_i} a\_{i,j} \mathbf{W} \mathbf{h}\_j \Big) h'\_i = ||\_{k=1}^{\mathcal{K}} \sigma \Big( \sum\_{j \in \mathcal{N}\_i} a\_{i,j} \mathbf{}^k\_i \mathbf{W}^k h\_j \Big) (multi\\_head) \\\ h'\_i = \text{LeakyReLU} \left( \frac{1}{\mathcal{K}} \sum\_{k=1}^{\mathcal{K}} \sum\_{j \in \mathcal{N}\_i} a\_{i,j}^k \mathbf{W}^k h\_j \right) \end{aligned} \tag{6}$$

Figure 7 visualizes an example of graph self-attention for an entity node "Sympathomimetic Agents" in the sentence, "Concomitant administration of other Sympathomimetic Agents may potentiate the undesirable effects of FORADL." The interaction type between the two entities, Sympathomimetic Agents" and "FORDAL" is classified as "DDI-effect". In the Figure, (a) displays the sentence's dependency structure, (b) shows the same dependency structure in the form of a graph, (c) presents the adjacency table reflecting the dependency relationships among words, and (d) illustrates the transformation of the vector representation of node 5, "sympathomimetic agents" through graph attention. In addition, this model can incorporate off-connection but useful information by employing a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of GAT sublayer is LayerNorm(x + GAT\_Sublayer(x)), where x is the output of BERT's self-attention sublayer.

Thus, this model reflects both contextual relatedness and syntactic relatedness between tokens. In addition, the GAT model applies attention to the features of each node's neighbors to combine them and create a new representation of the node. Therefore, by utilizing attention weights that reflect the importance of edge connections, the neighbor information includes not only directly connected nodes but also indirectly connected nodes, effectively capturing local substructures within the graph.

**Figure 7.** An example of multi-head graph-attention network.
