In this section, we first present the overall architecture of DEEDP. We then introduce the detailed implementations of every module that consists of DEEDP.
4.2. Sentence-Level Encoder
Transformer Encoder. Following most pre-training language models, such as BERT [
29], MBERT [
30], MLRIP [
31], SpanBERT [
32], and document-level event and event element joint extraction models such as Doc2EDAG [
8] and MMR [
16], we applied Transformer [
14] as our basic encoder, which utilizes a multi-head attention and masking strategy to capture the lexical and syntactic information for each token in the input sequence, and the corresponding contextual representation for each token can be generated.
Input Embeddings. Modelling more features of the input tokens facilitates down-stream tasks. Compared with Doc2EDAG, MMR, and DEEB-RNN [
33], we modeled a part-of-speech (POS) feature, entity type feature, and document topic feature to represent each token of the input sentence, rather than using only the token and position features. Formally, given a document
containing
sentences, sentence
can be denoted as a sequence of tokens
, where
is the length of the sentence, and
denotes the representation for the
jth token of the
. The representations of the token sequence were fed into the Transformer Encoder,
TF-1, and the final hidden state
of the
could be used for down-stream tasks, which can be calculated as follows:
where
,
is the final hidden state for the
jth token of
and
is the final hidden state of
. Then
was fed into a CRF layer for sequence labelling. After that, we could obtain the candidate event arguments.
Given a token, its input representation is the sum of the embeddings for token, segment, position, POS, entity type, and document topic. A visualization of the input embedding is illustrated in
Figure 3.
Entity Recognition. Entity recognition aims to extract all candidate elements for the events described in a given document and is always taken as a sequence tagging task. In this proposed work, we performed entity recognition at the sentence level and followed GIT and Doc2EDAG, which added a CRF layer to the final Transformer Encoder to conduct sequence tagging for the input sentences. To be specific, given a sentence
, we applied a Transformer encoder (
TF-1) to encode it and the final hidden states
were fed into the CRF layer to conduct entity recognition with the BIO (Begin, Inside, Other) schema. For training, we minimized the sequence tagging loss as follows:
where
is the golden-label tagging sequence for
. With respect to inference, we applied the Veterbi decoding algorithm to obtain the golden label sequence.
4.3. Document-Level Encoder
Document-level contextualized representations for entities and sentences take advantage of the global semantic of the document and can benefit DEE.
Entity & Sentence Embedding. To acquire the sentence-level representations of entities, we aggregated all word representations by performing a
mean pooling operation over consecutive tokens that mention
. Specifically, given entity
and its span in sentence
, which starts at the
kth token and ends at the
tth token, the representation for
can be represented as:
where
MeanPool(·) denotes the
mean pooling operation and
denotes the representation for
. The variables
are tensors for the start token and end token of entity
, which were used to calculate the sentence-level representation for
. Similarly, we performed a
mean pooling operation for all entities contained in document
, then obtained a series of sentence-level entity representations
, with
being the number of entities contained in document
.
For sentences, we also performed a mean pooling operation over the tokens covering a sentence and obtained the sentence-level sentence representations , where denotes the representation for sentence .
Document-level Encoding. To obtain the document-level contextualized representations for all sentences and entities, we applied a Transformer encoder,
TF-2, to model the information interaction between them, enabling the awareness of document-level contexts. Following Doc2EDAG [
8], we added sentence position embeddings to the sentence representation to inform the sentence order in the given document before feeding them into
TF-2. Document-aware representations could be obtained through document-level encoding:
where
TF-2(·) is used to obtain the document-level representations for entities and sentences contained in the given document, which takes the sentence-level representations of entities and sentences as inputs. The variable
denotes document-level entity representation, and
denotes document-level sentence representation. As entities may be mentioned by different word spans in a document, we performed a
max pooling operation over all the mention embeddings referring to the same entity to obtain a fused embedding. Then we acquired the distinct document-aware context representation
.
In this study, we performed event type classification tasks at the document-level encoding stage. We first applied a
mean pooling operation over
to obtain the document representation
, and then fed
into multiple feed-forward networks (FFN) to perform event type predictions. Concretely, the event
can be predicted by:
where
indicates the probability for event type
t, which is calculated by the
softmax(·) function. The variable
represents learnable parameters for predicting the event type
. Variables
and
denote the pre-defined event type set.
For training
TF-2, we minimized the following loss function:
Equation (6) represents the application of multiple binary classification tasks to construct the loss function, where represents whether event type is contained in the document, which is usually seen as Golden Label. is the indicator function.
4.4. Sentence-Document Feature Fusion
At this stage, we introduced a novel network to enhance document-aware representations for entities by integrating the sentence-document features. We stacked an advanced Transformer encoder, named
Transformer-
M, and Transformer encoder,
TF-3, together to achieve this goal, where
Transformer-
M helped to extract more sentence-level features and the
TF-3 was applied to integrate different kind features.
Figure 2 depicts the novel network architecture.
Transformer-M. To explicitly model the syntactic features and aggregate sentence-aware and document-aware representations for each entity, we designed a novel encoder based on Transformer, named
Transformer-
M (
TF-
M).
Figure 4d depicts the network architecture of
TF-
M. We added a Syntactic Feature Attention mechanism,
SF-
ATT, to Transformer, which was used to capture the syntactic features of the input sentences and enhance the representations for each token contained in the sentences. In addition, we concatenated the embeddings of each token used in
TF-1 and the document-level representation of
as the input of
TF-
M, which helped us to make full use of the information of both the document-level and local sentence to model the long dependency. Specifically, the input embedding of the
jth token can be calculated as:
where
is the
token input embedding used in
TF-1;
is the document-level representation for
; and
represents the input embedding vector of the
token for
, which is fed into TF-M to obtain feature enhanced representation.
denotes the concatenation operation.
As shown in
Figure 4a, for the given document and sentences
, we first applied the Spacy tool to construct the dependency relation for each token
and its parent. We were then able to obtain the dependency paths and the link matrix could be constructed with them. Subsequently, we were able to construct a syntactic feature mask matrix for the input sentence, which was used in
TF-
M. To be specific, in the
ith layer of
TF-
M, the Syntactic Feature Attention and Multi-head Attention (
MH-
ATT) hidden states were taken together to enhance the representations for each token in the input sentence; the former mainly focuses on the syntactic relations of
and its parent, and the latter is responsible for modelling the whole lexical and semantic features for all tokens in the sentence, which can be formalized as follows:
where
and
denote the hidden state calculated after the
kth layer of MH-ATT and SF-ATT for the
jth token of
, respectively. The variables
denote the hidden state calculated with the
kth layer. Then
and
.
were integrated with the hidden state calculated by the MH-ATT to enhance the representation for the token
.
The inputs were fed into
SF-
ATT to capture their explicit syntactic features, and the calculation can be formalized as follows:
where
denotes the encoding vector of
SF-
ATT for token
, which corresponds to the
;
l denotes the length of
;
and
are the attention weight of token
and the linear transformation of token embedding
, which are used to calculate the attention value for the token
. The
Masking function in Equation (13) restrains the dependency relation among the input tokens. The equation indicates that only linked tokens and the current token itself are involved to update the token embedding, and this is controlled by the masking matrix
, where
represents that there is a dependency path between the
ith and
jth token, and − represents none.
Figure 4a illustrates the dependency paths for the input tokens. Similar to
,
and
are independent linear transformations for token embeddings.
Feature integration. As shown in
Figure 2, the final hidden states of the
TF-
M and the document-aware representations of entities were fed into
TF-3. Then we were able to obtain the multi-turn and multi-granularity representations for the entities. Formally, given a sentence
,
TF-
M encodes it to obtain
, where
, then
and
are fed into
TF-3, and then we can obtain the integrated representations
for each recognized entity in the input document.
where
denotes the multi-granularity encoded document-aware and local feature enhanced representation for entity
. As the enhanced representation could model more local features and long-distance dependency information, it would facilitate the event expanding, which will be discussed in
Section 4.5. We were able to obtain fused representations
for each token and entity in the document using Equation (15).
4.5. Event Expanding
In realistic scenarios, the number of event records described in a document is unknown in advance, thus we need to perform event detection as shown in
Section 2.2 and then fill arguments for each specific event roles as pre-defined in the event table. In this work, we performed event expanding to fulfil the event roles filling task as previously described [
8,
16]. For each triggered event, the event expanding subtask can be formalized as a set of binary classification tasks, i.e., predicting to expand (1) or not (0) for all candidate entities. The expanded event path state information can distinctly guide the remaining event roles filling task; thus, an event memory mechanism is designed to memorize the extracted event paths and their arguments. To take advantage of the useful current states, such as the event path state, processed contexts, and event role, the memory tensor
and fused entity tensor
were concatenated, and then a trainable event role indicator embedding was added to form a new tensor for the next event role prediction. The tensor was fed into the fourth Transformer module,
TF-4, to facilitate the event path prediction. Finally, the context-aware entity tensor
, the final hidden states of
TF-4, was fed into a linear layer to conduct the event expanding classification.
Figure 5 illustrates the event expanding process, which can be formalized as follows:
where
denotes the memory tensor, which is used to integrate the current path and history contexts information. We initialized it with the document-level sentence tensor
and updated it when expanding by filling either the associated entity tensor or the zero-padded one for the
argument.
denotes the enriched entity representations, which were fed into a linear layer to conduct the path-expanding classification, and
is a learnable parameter for role prediction.