1. Introduction
Textual Summarization (TS) is the task of compressing a text into a short form (the summary) that preserves all the relevant information, where the compression rate is chosen by a domain expert. There exist two approaches to TS:
extractive TS where sentences are extracted from the input text and re-arranged to create the summary, and
abstractive TS where the summary is generated word-by-word. With the ever-growing adoption of neural networks (NNs) by Natural Language Processing (NLP) techniques and tasks, researchers have found that NN-based encoder–decoder (ED) models [
1,
2] are well-suited to generating abstractive summaries. In such models, the encoder is trained to select the salient information from the text, whereas the decoder is trained to condense such information into a fixed-length summary. To the best of our knowledge, Refs. [
3,
4,
5] represent the first research works that have successfully adopted NNs for abstractive TS.
The main drawback of using NN-based models for TS is their inability to capture the salience of each sentence. They are unable to discern the relevant and informative sentences (or words) in a summary from those that are secondary for the context, treating them equally. For instance, suppose that a document talks about both a big snow storm and a man who started to sell snow online; suppose that the correct summary is related to the online snow selling. The network may decide to focus on the snow storm, instead of the online snow selling, creating a summary of this topic and completely missing the content of the article. We think that this problem is generated by the attention mechanism [
6,
7] of the neural network, i.e., a score distribution over the document words that the network uses to focus on the relevant ones. Since such a mechanism has been defined for machine translation where the focus is limited to about three words at time, in our opinion, it is not able to capture the salient words present in longer texts (such as those of summarization).
To solve this drawback, we examined the recent works on neural network models based on knowledge bases (KBs) [
8,
9,
10,
11,
12,
13,
14]. In these research works, the knowledge graph created using an external KB was explored by the neural network to extract the relevant information (i.e., it selected one or more vertexes) via the attention mechanism, which focused on the relevant entities of the graph. These models have shown very high performance, taming the problems related to large graphs (e.g., millions of vertexes) and missing the connections between vertexes, for which the neural networks are able to exploit latent information.
However, adopting an external KB in the summarization task was a challenge. We could construct it using the entities that were present in the document to summarize, but the neural network could have the above-described problem, selecting external entities that were neither relevant nor correct for the summary; thus, we decided to create the graph using the sentences present in the input document. The graph selected the sentences that were fundamental to constructing the summary and from which the salient words (i.e., entities) could be copied into the output.
Our contribution is twofold:
a latent knowledge graph based on document sentences, which was changed according to the generated summary. In detail, the model calculated the sentence-level attention scores using the graph, which were used by the word-level attention to select the relevant entities from the sentences;
an inference-time score function that removed repetitions and attention errors, integrating two existing techniques: coverage vectors [
6,
7,
15] and reinforcement learning.
We tested our proposed model on the CNN/Dailymail dataset, obtaining interesting results both in the selection of the relevant sentences and the creation of summaries that contained all the salient information of the input document.
The remainder of the article is composed of the following:
Section 2 describes the latest research works in neural-based knowledge bases and summarization;
Section 3 describes the proposed model, the sentence-level attention, and the inference-time score function;
Section 4 reports the results obtained using the proposed model. The article concludes with
Section 5.
2. Related Works
Our model belongs to the recent research field that exploits document contents to generate a better summary. In this context, to the best of our knowledge, there are two research directions: (1) retrieve, rank, and rewrite summarization models [
16,
17,
18] and (2) content selection methods [
19,
20,
21,
22,
23,
24]. In the former, models are trained to select sentences from the input document or from an external resource (retrieve and rank part) that are salient, i.e., that express the information in a very concise manner. The selected sentences are then rewritten (substituting, adding, and deleting words) to generate the summary. In the latter, researchers have defined layers, masks, and network structures to uncover salient sentences and words while removing redundant information. In this context, our proposed model fuses the sentence graph proposed in Tan et al. [
20] with the attention re-score method of Hsu et al. [
22]; our idea was to find the relevant sentences via the graph, which takes into account the shared information between sentences, and to select the salient words from them using Hsu et al.’s method. Furthermore, instead of applying the softmax function in the PageRank method as Tan et al., we decided to use the sigmoid function, which could select more than one sentence.
Other interesting works integrated further information into sequence-to-sequence models. Refs. [
25,
26,
27] used a pre-trained language model to improve the summary generation. Their idea was that the language model carries both fluency and domain style since it is able to deeply understand the meaning of each token. In detail, Liu and Lapata [
26] and Song et al. [
27] used BERT-based models [
28] to solve the summarization task. Kryściński et al. [
25], instead, proposed to extend the sequence-to-sequence model proposed by Paulus et al. [
6] with a language model that integrates external knowledge. Other authors, such as Narayan et al. [
29], preferred to use the topic model to tie the model output to the document.
Currently, researchers are studying how to integrate neural networks with knowledge bases. Some research works focused on table-to-text [
8,
9,
10,
30], where a table that depicts a “plain” knowledge base is read by a sequence-to-sequence model and transformed into a text that reports (almost) all information occurring in the table. In Wang et al. [
30], the authors constructed both a latent graph over the entries of the table, looking to their position in the input, and an attention over them. This attention was refined in their successive work [
8], where they proposed a hybrid attention based on both the
<slot type, slot value> attention and the
link attention of the latent graph; the latter represented a sort of embedding and was learned via backpropagation rather than being computed by the model. Hayashi et al. [
9] and Liu et al. [
10] defined a language model based on a knowledge graph. In the former work, their model first selected from the knowledge base the entity that had to be copied in the output; then, a mechanism in the language model forced it to generate the words of the entity. In the case where a word of the entity was not present in the vocabulary, they used a character-based language model to generate it. In the latter work, however, their model first selected all the entities that could continue a phrase excerpt and encoded them using an LSTM [
31]; the resulting vector was then passed by the input to the language model, which copies the words of a chosen entity. Hu et al. [
11] used a neural network with a copying mechanism [
6,
7] over a knowledge graph to generate the text. Their model first encoded the RDF triplets using a stacked Gated Convolutional Neural Network (GCN) in order to obtain a more complex latent graph; then, the vertexes obtained by the GCN were fed by the input to the language model.
Other works, instead, used a knowledge graph in the Question Answering (QA) task [
12,
13,
14,
32]. Given a question, they retrieve from a knowledge base all those entities that are associated with the question; then, both the question and the retrieved entities are passed to a neural network, which selects the answer (i.e., the best retrieved entity). These research works reason on triplets
, where
h and
t are the entities and
r is the relation that connects them. In [
13,
32], Saxena et al. calculated a score for each triplet multiplying together their elements (i.e.,
h,
r, and
t); the triplet with the highest score was then selected as the answer to the input question. Huang et al. [
12] defined a knowledge embedding based on QA, where their idea was to represent the relation and the entity as low-dimensional vectors. To accomplish their task, they trained their model on a simple QA dataset, where the question was easily answered if the correct entity and predicate (the relation) were identified. Finally, Bosselut and Leskovec [
14] proposed an interesting research work: they constructed a knowledge graph using the entities present in the question and the possible answers; then, they defined a model that used attention to focus on the relevant entities of the graph in order to create the context vector. The latter, the relevance of each vertex of the graph, and their relation type were used to predict the answer.
3. Our Approach: A Model with Graph Sentence-Level Attention
Our proposed model is similar to Nallapati et al. [
4]’s model. It consists of a hierarchical bidirectional LSTM encoder and an attention-based LSTM decoder. In detail, unlike the classic encoder, the hierarchical one is composed of two encoders, each one formed by a bidirectional LSTM. In this paper, we call the first encoder the
word encoder and the second one the
sentence encoder.
The model works as follows: first, the word encoder generates sentence representations by reading all the M words in a sentence; then, the sentence representations are read by the sentence encoder to generate the document representation.
Let
be the tokens that compose the first sentence of the document and
the set of
K sentences that compose the document. At each step
i, the word encoder is fed with the input tokens
and produces two encoded states
and
, where the former is generated by reading the sentence’s words from left to right and the latter by reading the sentence’s words from right to left. Those two states are distinguished through the jargon terms
forward state and
backward state. Equation (
1) represents the forward and backward states’ construction:
In (
1),
represents the word-level forward LSTM and
the backward one. In addition,
is a function that returns the word-embed [
33,
34] of an input word.
Once the encoded states of a sentence are generated, the last forward state
and the last backward state
are concatenated together to generate the sentence representation
. Such an operation is represented in Equation (
2), in which
is the concatenation operator.
Then, the sentence representations
are passed to the sentence encoder, which produces a forward state
and backward state
. The document representation
is constructed concatenating the last forward state with the last backward one:
Similar to the encoder, at each step
t, the decoder (a single unidirectional LSTM) receives as input the embedding of the previous word from the decoder state
and the context vector
, and uses them to emit a word in the output (the network emits a probability distribution over the vocabulary that is used to select the next word). During training, it is the embedding of the target word
, whereas in testing it is the embedding of the previous word emitted by the decoder. In the model, the decoder state is initialized with the document representation. The context vector
is calculated as in Bahdanau et al. [
35]:
where
represents an attention score calculated via the attention approach and
is the representation of the
t-th word.
Inspired by the works of [
4,
20,
22], we decided to combine the word-level attention with the sentence-level scores to select only those words coming from sentences that were relevant for the summary. Given
was the sentence-level scores and
was the word-level ones, we computed the attention scores
as follows:
In (
5), the
scores were computed by Equation (
6) (the symbol ⊺ in
is the transpose operator). On the other hand, the
scores were calculated using a graph-based neural network, which took into account both the current decoder state
and the sentence representations. The graph-based network is described in
Section 3.1. The full model is depicted in
Figure 1.
For the output, we adopted the mixture model proposed by See et al. [
7], which combines the probability of a vocabulary word with its attention scores. Thus, the probability of emitting a word
at step
t is defined as follows:
where
,
,
, and
are learnable parameters.
is a value within the range
used to mix the two distributions and it is calculated using Equation (
8), where
and
are learnable parameters and
is the previously emitted word.
In Equation (
7),
was used to control how much information coming from the attention distribution will enter into the vocabulary distribution. If
is close to 1, the model generates the next word only using the vocabulary (i.e., selecting the word with the highest probability); if
is close to 0, then the model uses the attention distribution on the input document.
We trained the model to minimize the sum of the negative log likelihood of the sequence of target words
:
We found that the model tended to generate summaries that had repeated words (or sentences). Thus, in order to minimize the repeated words (sentences), we decided to apply some heuristics both in the training and inference steps. In the training step, we applied the coverage method defined by See et al. [
7] (it is described in
Section 3.2). At inference time, we adopted the scoring function of Gehrmann et al. [
21] for the beam search, which uses a hyperparameter
to control the length of the summary. We also set a minimum summary length based on the training data. Additionally, following Paulus et al. [
6], we restricted the beam search to never repeat the same trigram in the summary.
3.1. Sentence-Level Scores
Inspired by recent works on knowledge bases [
9,
10,
13,
14], our idea was to create a knowledge graph (KG) from which the neural network could select the relevant entities to insert in the summary. However, creating a KG for the summarization task was challenging: let us suppose we extracted all the entities present in the text and we created the KG from them, the model could select the entities that were not relevant for the summary, generating a text that was not related to the document’s content.
As mentioned in the Introduction, we decided to create a latent graph looking at the informativeness of each sentence. The salient words could be copied into the summary via the pointer network [
6,
7]; the sentences were then seen as containers whose information must be unveiled.
The importance of a sentence to the summary was calculated by comparing it with both the current decoder state and the other sentences. The graph was, in particular, needed to compare the sentences to one another, specifically to understand whether some sentences were more informative than others.
For instance, let us suppose that we want to calculate the score for the sentence . If we compare with the decoder state d, we obtain a score that highlights the relevance of with respect to the generated summary. Now, let us suppose that we want to calculate the score for the sentence , which contains the same facts as as well as some additional ones. If we compare and d, we obtain its relevance with respect to the summary but not with respect to . In other words, in this simpler setting, we cannot know whether a sentence is more informative than another one.
We thus constructed a graph where the vertexes were the sentences and the edge between two sentences was weighted by their cosine similarity. For simplicity, we treated the current decoder state, i.e., what the model summarized so far, as a sentence. Following the idea of Tan et al. [
20], we then applied the PageRank algorithm to the graph to calculate the score of each vertex, assigning high scores to informative and relevant sentences while pushing the scores of redundant sentences down; the pointer network of the model (defined in Equation (
7)) then copied the salient words coming from the high-score sentences.
Figure 2 depicts an example of the graph score calculation:
We defined the adjacency matrix
A of the graph via a bi-linear product of the sentences over which we applied the cosine similarity function, similar to the work proposed by Yao et al. [
36]. The cosine similarity ensured that the diagonal of
A only contained values equal to 1 (because we were comparing a vector with itself), as required by graph-based models (they require that the identity matrix be added to the adjacency matrix in order to normalize the values).
Figure 3 shows an example of a graph and its adjacency matrix.
In detail,
A is defined as follows, in which
and
are two sentence representations.
We also applied the ReLU function to the matrix
A in order to remove all edges between dissimilar sentences (i.e., those with a negative cosine score). The mathematical calculation of the sentence scores is defined by Equation (
11):
where
I is the identity matrix,
A is the adjacent matrix of the graph,
D is the diagonal matrix where the
i-th diagonal element is equal to the sum of the
i-th column of the matrix
A, and
y is a one-hot vector of size
. It has all values equal to zero, except for the one that represents the decoder state. The scalar
is the dumping factor, whereas
is the sigmoid function that assigns a score in the range
to each sentence. Unlike the softmax function that could assign the probability mass to a single sentence, the sigmoid function selects more than one sentence, allowing the model to use them either for the context vector or the pointer network.
3.2. Coverage Method
In order to avoid word and sentence repetitions in our model, we adopted the coverage vector of See et al. [
7]. In other words, we modified Equation (
6) to include it, thus obtaining:
Finally, we used the auxiliary loss proposed by See et al. [
7] to penalize the overlapping between the attention distribution and the coverage vector:
During the experiments with the coverage loss, we noticed that the model was producing semantically incorrect summaries. In our opinion, this was because the coverage loss had an impact on the attention and an indirect impact on the sentence-level scores and representations, invalidating part of the training.
We therefore decided to use such a loss in the inference time in order to guide the model in selecting the words that minimized the overlap between the attention and the coverage vector. We combined
covloss with the reinforcement learning q-function since the latter was capable of adapting to the environment, selecting the best action (in this case, the best word to generate) at each time step. We had to specify that the resulting function was not a q-learning one, i.e., that it did not calculate a score for each state (time step) and action (generated word). We then defined the score function, called
covscore, as in Equation (
14).
Covscore was added to the score function of the beam search; at
,
.
4. Evaluation
We trained and tested the proposed model on the
CNN/DailyMail dataset [
37], which contains online articles paired with multi-sentence summaries. We used the script supplied by Nallapati et al. [
4] to obtain the same version of their dataset that consists of 287,226 training pairs, 13,368 validation pairs and 11,490 test pairs. Each article has 781 tokens on average, whereas each summary has 3.75 sentences and 56 tokens on average. We used the
non-anonymized version of the dataset (please see Chen et al. [
38] for problems regarding the anonymized version).
The model has 256 dimensional hidden layers and 128 dimensional embedding layers. Following See et al. [
7], we used a small vocabulary that included 50 k tokens for both the source and target. We set the dumping factor
to 0.9. All the weights were initialized with a normal distribution, with a mean of 0 and a standard deviation of 0.1. The model was trained using AdaGrad [
39], with a learning rate of 0.15. We also used gradient clipping with a gradient norm of 2.0 and dropout [
40] with a probability of 0.2 (keep probability of 0.8) to improve model generalization. We used the loss on the validation set for early stopping.
The training was performed on a single GPU RTX 2080Ti, with a batch size of 8. At testing time, we used a beam size of 5 and an
value of 1.4 for Gehrmann et al. [
21]’s score function. We set the maximum number of encoder input sentences to 8. We truncated the length of each sentence to 50 tokens and the length of the summary to 100 tokens in order to speed up the convergence of the model. We trained the model for about 1,200,000 iterations. The training of the model took about 5 days of computation.
Table 1 reports the hyperparameters and their setting.
We compared the proposed model with the following state-of-the-art works:
See et al. [
7]’s pointer generation model, which used the pointer network and the coverage method as in our proposal;
Nallapati et al. [
4]’s models, as it employed the hierarchical encoder with the two-level attention mechanism and a pointer network;
Hsu et al. [
22]’s model, as it used the same equation to combine the two attention distributions;
Tan et al. [
20]’s abstractive model, for its hierarchical structure, which was also extended to the decoder.
4.1. Results
Table 2 shows the Rouge scores (Rouge-1, Rouge-2, and Rouge-L) [
41] of the models. Our models “
our model + coverage” and “
our model + coverage + covscore” obtained higher R-1 and R-2 scores than Nallapati et al.’s models. We suspect that our models had a low R-L due to their high abstraction power since the generated summaries contained tokens that were not present in the reference one; the longest common sub-sequence was short.
We noticed that the auxiliary loss did not improve the Rouge scores. Since the auxiliary loss modified the attention scores, it also impacted on the sentence-level scores and shook the sentence representations, invalidating part of the training. This was supported by the fact that the covloss incremented the loss in the validation set from to rather than decreasing it.
In light of these results, we can conclude that the proposed graph-based sentence-level attention generally performed better than the sentence-level attention proposed by Nallapati et al. [
4]. It allowed for the generation of more abstractive summaries, as reported in
Section 4.3, at the cost of sacrificing recall. In more detail, since the Rouge score only evaluates the presence of tokens in the generated summary of the reference model, it indirectly penalizes models that use synonyms and periphrasis. Indeed, our best model had very high precision (
for R-1,
for R-2, and
for R-L) but very low recall (
for R-1,
for R-2, and
for R-L). This is also the reason why our models had lower Rouge scores than those of See et al. [
7], whose models were more oriented toward copying words from the input document than generating novel ones.
Finally, our models underperformed with respect to those in Hsu et al. [
22] and Tan et al. [
20]. This was expected since their models had a more complex architecture and used more complex features. The model of Hsu et al. [
22] is composed of two trained models: an extractive model to select the sentences that are useful for the summary and an abstractive one to digest them; our models, instead, were jointly trained to compute a score for each sentence and generate a summary. The model of Tan et al. [
20] first generates a sentence decoder state then generates the words of a sentence using the word-level attention and the pointer network. Compared to our models, which only combine both attentions and require fewer resources to be trained, theirs is resource-intensive because the hierarchical decoder adds a further million parameters, having a positive impact on the results.
Figure 4 shows an example of the generated summaries. From the table, it can be seen that our model produced a good summary. There was only an error caused by the phrase “
somalia’s internationally recognized government that had been under pressure from al-shabaab” because it was attached to the previous sentence, which was semantically distinct from the latter phrase. The adoption of the coverage vector corrected the error and the use of the coverage loss slightly improved the summary. The use of the
covscore function modified part of the summary, including more details about the “beliefs” of the terrorist group.
Figure 5 shows the average percentage of duplicate n-grams. The measure gives an insight into both the reasons behind the low Rouge F1 scores and the impact of the coverage vector; in detail, a model with duplicate tokens in the summary could have low Rouge F1 scores because those repetitions prevented the generation of relevant terms while stretching the summary until it satisfied the minimum length. We computed the average percentage of duplicate n-grams as follows: for each method, we counted the number of repeated n-grams (from 1-grams to 4-grams) in the generated summaries, and we divided this by the length of the summary. From the figure, it is possible to observe that the coverage vector reduced the repetitions. Such a value was further reduced with the application of the coverage loss but it did not improve the Rouge scores as reported in
Table 2. Finally, the application of the
covscore function produced summaries with an average n-gram repetition that was close to the reference ones.
Finally, we compared the summaries in terms of the
readability and
relevance scores; the former expresses the quality of the summary in terms of punctuation, syntax, and semantic coherency, whereas the latter expresses whether the summary captured all the salient information from the input document. Both metrics were in the range
,
. Following Paulus et al. [
6], we randomly selected 100 generated summaries by “
our model + coverage + covscore” and we asked three annotators to evaluate them. For a better comparison, we also extracted the same summaries from Hsu et al. [
22]’s model and Tan et al. [
20]’s model.
Table 3 shows the inter-annotator agreement for both
readability and
relevance. From the table, we can see that our model had, in general, very high agreement, i.e., the annotators agreed on the same scores, with a
p-value of
. This was only surpassed by Hsu et al.’s model for the
relevance score; however, in this latter model, the annotators were discordant for the
readability score. Indeed, it had a
p-value of
. Finally, Tan et al.’s model had very low agreement, meaning that the annotators did not agree on the scores, with a
p-value lower than 1
.
Table 4 reports the results of the two metrics. The results show that Hsu et al. [
22]’s model obtained very high scores because it selected the relevant sentences in an offline manner (the sentence extractor model). It is interesting to notice that the model of Tan et al. [
20] had very low readability and relevance scores, despite obtaining very high Rouge scores. This means that the model generated summaries that would not completely satisfy human readers. Our model instead had a very high relevance score since it copied portions of the input document into the summary. However, the readability was not so high, meaning that it mostly truncated the copied sentences without merging them into a coherent summary. Also, our model outperformed Tan et al. [
20]’s model for both the
readability and
relevance scores, despite obtaining lower Rouge scores.
4.2. Attention Scores Analysis
In this section, we evaluate the scores generated by the model both at the sentence level and word level. Plotting the scores helps to understand the model’s behavior since neural networks are blackboxes. For our analysis, we selected a summary generated by “our model + coverage + covscore” from the testset.
Figure 6 shows the sentence-level scores generated by the network; it can be seen that the model mainly focused on the first sentence (dark colors) since it found it relevant for the summary. Thus, the model copied part of this sentence into the summary, as shown in
Figure 7. These figures demonstrate the ability of the network to find the most informative sentence and to select the relevant words from it.
Nevertheless, the model copied a large portion of the sentence rather than just selecting the relevant words. This could lead to swiftly generated summaries that miss the salient information since the model arrived at the maximum permitted length for the summary (which was set to 100 tokens) without being able to incorporate all relevant text and entities.
4.3. Abstractiveness of the Models
One important point to evaluate is the capability of the models to generate abstractive summaries. This requires the model to perform complex operations such as using synonyms and periphrasis, which could decrease the likelihood of matching the reference summaries. To evaluate the abstractiveness of the models, we calculated the average percentage of novel n-grams, which measures the capability of a neural network to generate n-grams that are not present in the reference summary. A low percentage of this measure means that the network is more conservative and tends to use words that appear in the document, whereas a high percentage means that the network is abstractive and tends to generate summaries that contain synonyms and periphrasis.
Figure 8 depicts the percentage of novel n-grams occurring in the generated summaries. We computed these percentages in the following way: for each n-gram (1-grams to 4-grams), we counted how many (unique) n-grams in the generated summary were not present in the reference summary. Then, we divided this by the total number of (unique) n-grams in the generated summary. The figure shows that our model had a high percentage of novel n-grams. These values remained stable with the adoption of the coverage vector, meaning that the model was using all the relevant words. Finally, the coverage loss increased the number of novel n-grams due to the generation errors of the model, i.e., the model tended to generate summaries that contained words not related to the topic of the document. The application of the
covscore function slightly increased the number of novel n-grams.
We also computed the average
value of Equation (
7) to check if the models were more oriented toward the vocabulary or the pointer network.
Table 5 shows the average
value for each model, which was over
. This value shows that the models used the vocabulary to generate the summaries, rarely relying only on the pointer network.
5. Conclusions
In this paper, we presented a novel NN-based model for TS. The model used a hierarchical encoder to obtain both word and sentence representations. Then, it computed a score for each sentence that expressed its importance with respect to the other sentences. These scores were then used to adjust the word-level attention scores, ensuring that the model focused only on the important words. Furthermore, since we used a pointer network model, the model only copied the words coming from the informative sentences into the summary.
The model that used the coverage vector performed better than the models of Nallapati et al. [
4]; however, our models had lower scores (about 2 Rouge points) than See et al. [
7]’s models and did not surpass Hsu et al. [
22]’s model and Tan et al. [
20]’s model.
We found our model had high abstractive power, i.e., the ability to use synonyms and periphrases, as confirmed by both the novelty scores (see
Figure 8) and the
values (see
Table 5). Finally, we outperformed Tan et al. [
20]’s model in terms of the
readability and
relevance scores [
6].
In future works, we are interested in improving the recall of the model, which was its weakness. To increment the recall, we think that the use of topics generated by an LDA model [
42] could be useful since they capture the semantic content and the domain style of the document, tying the summary to it. We also plan to improve the graph-based sentence-level attention, which showed very good results. We intend to follow Hsu et al. [
22]’s idea, training the sentence-level attention (via loss function) to recognize the best sentences for the summary. However, the method proposed by Hsu et al. [
22] cannot be directly adopted; we have to define a loss that can adapt to the generated summary, identifying the best sentences and penalizing the model in case it does not select them.
Another interesting research direction would be to integrate an external knowledge base (KB) with the model; in this case, we have to develop a classifier to discern the entities that are relevant to the summary from those that are not. This task would require the annotation of a dataset with the correct KB entities for the summary. We think that reasoning on the graph would reduce the copying behavior demonstrated by the pointer-network.
Finally, we plan to study how to improve the covscore function since it slightly increased the performance of the model while reducing the redundancy and attention errors.