Before introducing our proposed event extraction model, we would like to provide a more comprehensive explanation of the terms mentioned in this paper. The input document D = {S1…Si…Sn} consists of n sentences, where Si represents the i-th sentence. The goal of the DEE task is to extract m event records from document D, where each event record consists of an event type j, multiple event arguments a, and the corresponding event roles k, where a∈E, j∈J, k∈K. E represents the candidate entity set, while J and K denote the predefined sets of event types and argument roles, respectively.
3.3. Construction of Heterogeneous Graph
We construct a heterogeneous graph G to capture the interactions between sentences and entities, which includes entity, sentence, and document nodes. For a given graph G = (V, E), where V represents the set of nodes and E represents the set of edges. In graph G, the interactions among entities, entities and sentences, and documents and sentences, are all modeled and represented, as shown in
Figure 3.
For an entity node e, its initial embedding is the average pooling value of the word vectors within the entity, specifically represented as , where gj represents the vector for word j, and the mean denotes the average pooling operation. For a sentence node s, its initial embedding is the maximum pooling value of all the word vectors in the sentence plus the position embedding value of the sentence, specifically represented as , where Max denotes the maximum pooling operation, and SentPos(s) represents the position embedding of the sentence.
The document node is jointly represented by the initial embedding vectors of the sentences and entities in the document. The operation involves applying a multi-head attention mechanism to transform the input sentence embedding vectors and entity embedding vectors into query matrix
Q, key matrix
K, and value matrix
V through a linear mapping layer. Next, the
Q,
K, and
V tensors split into m attention heads, where m represents the number of heads. Matrix multiplication and scaling operations are applied to the query and key matrices of each head, resulting in new attention tensors. Then, the softmax function is utilized to weigh the attention tensor and allocate it to the value matrix, generating the final attention-based output tensor. Finally, the output results of the m attention heads are concatenated and passed through a fully connected layer for mapping, generating the embedded features based on attention computation. The multi-head attention mechanism model can be represented by the following formula:
where
FC represents the fully connected layer,
m denotes the number of attention heads, and
Attention denotes the attentional mechanism as follows:
where
dk represents the dimensionality of the query vectors.
After performing multi-head attention calculations on the initial embedding vectors of sentences, a weighted pooling is applied along the document dimension to obtain the overall sentence-level document vector. As for the initial embedding vectors of entities, the vectors of each word are averaged to create a matrix of equally sized word embedding. Then, multi-head attention calculations are performed on this matrix, and the results are weighted and summed along the document dimension to gain word-level document vectors. Finally, by concatenating these vectors, we get the feature representation of the entire document node. The feature vector representation of the document node is as follows:
where
Xsent consists of all sentence embedding vectors in a document, and matrix
Xtoken consists of all entity embedding vectors in the document. Attention ( ) refer to the multi-head attention operations, and [,] represents the concatenation between the vectors.
Our model includes five types of edges: Sentence–sentence edges (S–S), sentence–entity edges (S–E), intra-entity edges within the same sentence (E-E intra), inter-entity edges for the same entity across different sentences (E-E inter), and document–sentence edges (doc–s). Sentence nodes are connected through S–S edges to capture long-range dependencies between individual sentences in the document. The S–E edges connect sentences with all entities within them, modeling the context of entities in the sentences. The E-E intra edges connect different entities within the same sentence, indicating that these entities may be related to the same event. The E-E inter edges connect the same entity across all sentences, allowing the tracking of the entity’s occurrences at different positions.
Document–Sentence Edge (doc–s): By connecting the document node with the sentence nodes, we have achieved an interaction between the document and sentences. As a result, the document node can attend to information from all other nodes, facilitating the fusion of textual information from different levels and better modeling long-distance dependency relationships between sentences. Our heterogeneous graph enables the simulation of the interaction between sentences and entities from a global perspective and strengthens the connections between documents and sentences while better capturing event information within the document.
We apply multi-layer graph convolutional networks to model global interactions. For each node
i with its feature representation, the node representation at the l-th layer can be computed using the following formula:
where
R represents all edge relation types, and
Nir represents the set of all neighboring nodes connected to node
i via relation type
r.
Ci,r is a normalization constant used for normalization.
Wlr denotes the weight matrix corresponding to each edge relation type
r, and Relu represents the activation function. We then derive the final hidden state of node
i by combining the output features
hi(l) of node
i in each layer of GCN along the column direction. The combined column vector is linearly transformed using the learnable weight matrix
Wa to learn the final hidden state
hi of node
i.
where
hi(0) is the initial embedding representation of node
i, and
L is the number of GCN layers. Finally, we get sentence embedding vectors and entity embedding vectors. In this way, sentences and entities interact in a context-aware manner for representation.
3.5. Argument Extraction
We adopted an ordered expansion tree [
32] to decode documents containing multiple event records and extract specific types of event records. After detecting the event types, we performed argument role detection by dynamically adjusting the detection order based on the obtained vector representations and labels of all candidate entities in the text. We first detected roles with fewer arguments and gradually transitioned to argument roles with more arguments. The process of filling event records, as shown in
Figure 4, analyses five argument roles. The Company Name, Highest Trading Price, and Lowest Trading Price have only one event argument, while the Repurchased Shares and Closing Date argument roles have two event arguments. When filling in event records, prioritize identifying argument roles with fewer associated arguments and gradually shift towards roles with more event arguments. Therefore, these two argument roles will be detected last.
Specifically, when extracting arguments, we start from a virtual root node and expand based on the reordered sequence of argument roles. We introduce the Tracker module [
32], and the path from the root node to the leaf node is an event record. For the i-th record path represented by an entity sequence U
i = [E
i1, E
i2, …], the Tracker utilizes LSTM to encode this vector and adds the event type embedding. The compressed information is then stored in the global memory G
i for sharing across different event types, as shown in
Figure 5. It represents the decoding process, wherein records are extracted in a revised sequence following the adjustment of argument role detection. Two event records have already been retrieved from the document. By harnessing the contained information within the real-time Tracker and its globally tracked memory, accurate forecasts can be made regarding the argument roles associated with each event. In this prediction, Entity B is assigned to Role1, and the subsequent argument roles are forecasted accordingly, commencing from this particular child node until a comprehensive event record is extracted.
During the inference process, we predict the k-th role by incorporating the feature of the argument role into the entity representation.
where
Rolek is the embedding of the k-th role,
E is the feature matrix of the entities at the previous time, and
refers to the feature matrix at the current time.
Next, the Tracker concatenates the entity feature matrix
, sentence feature matrix
S, current record path
Ui, and global memory
Gi. It then utilizes the Transformer to update the feature information. The updated features include specific role information for all candidate entities globally.
Lrecord is represented as:
where
N is the node in the event record tree, and
yi,n refers to its golden label sequences. If the i-th entity is the next event argument of node
n, then
yi,n = 1, otherwise
yi,n = 0.