Multistage Mixed-Attention Unsupervised Keyword Extraction for Summary Generation

Wu, Di; Cheng, Peng; Zheng, Yuying

doi:10.3390/app14062435

Open AccessArticle

Multistage Mixed-Attention Unsupervised Keyword Extraction for Summary Generation

by

Di Wu

^*,

Peng Cheng

and

Yuying Zheng

School of Information and Electronic Engineering, Hebei University of Engineering, No. 19 Taiji Road, Handan 056000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(6), 2435; https://doi.org/10.3390/app14062435

Submission received: 9 February 2024 / Revised: 1 March 2024 / Accepted: 10 March 2024 / Published: 13 March 2024

Download

Browse Figures

Versions Notes

Abstract

:

Summary generation is an important research direction in natural language processing. Aimed at the problems of redundant information processing difficulties and an inability to generate high-quality summaries from long text in existing summary generation models, BART is the backbone model, an N + 1 coarse–fine-grained multistage summary generation framework is constructed, and a multistage mixed-attention unsupervised keyword extraction summary generation model is proposed (multistage mixed-attention unsupervised keyword extraction for summary generation, MSMAUKE-Summ^N). In the N-coarse-grained summary generation stages, a sentence filtering layer (PureText) is constructed to remove redundant information in long text. A mixed-attention unsupervised approach is used to iteratively extract keywords, assisting summary inference and enriching the global semantic information of coarse-grained summaries. In the 1-fine-grained summary generation stage, a self-attentive keyword selection module (KeywordSelect) is designed to obtain keywords with higher weights and enhance the local semantic representation of fine-grained summaries. Tandem N-coarse-grained and 1-fine-grained summary generation stages are used to obtain long text summaries through a multistage generation approach. The experimental results show that the model improves the ROUGE-1, ROUGE-2, and ROUGE-L metrics by a minimum of 0.75%, 1.48%, and 1.25% over the HMNET, TextRank, HAT-BART, DDAMS, and Summ^N models on summarized datasets such as AMI, ICSI, and QMSum.

Keywords:

summary generation; multistage; mixed-attention; unsupervised; keyword extraction

1. Introduction

With the continuous development of large language models, text generation becomes a research hotspot in the field of natural language processing. It has gained a wide range of applications for tasks such as article creation, dialog generation, and conference summary. Summary generation is a task of text generation [1], and researchers use different techniques to generate summaries, including rewriting or extracting various information from documents [2]. Currently, lots of research results have been achieved in generating summaries for short texts [3,4]. However, it is still difficult to quickly obtain high-quality summaries from long textual data such as long documents, news interviews, and conference content. Therefore, improving the quality of summaries of long texts is of great research importance [5].

In the task of summary generation for long text, the current summary generation model has the problem of failing to comprehensively extract the important information of long text, and at the same time, there is redundant information in long text. Therefore, in order to better improve the generative ability of the summarization model, sentence filtering is usually used to remove redundant information in long texts during data preprocessing. To address the lack of an ability of summarization models to generate important information, based on deep learning models, supervised methods are designed by researchers to extract keywords, which require a large labeled training dataset. On the other hand, unsupervised methods do not require labeled datasets and have lower dataset requirements [6]. Therefore, unsupervised keyword extraction has been adopted by many scholars, and unsupervised keyword extraction methods based on text embedding, document embedding, and hybrid statistical embedding have emerged, such as MDERank [7], SIFRank [8], and KeyGames [9]. Although the above methods improve the extraction of keywords, the lower-weighted keywords may be used first for summarization inference, while the higher-weighted keywords fail to be applied, resulting in poor quality and readability of the summaries [10].

Therefore, the self-attention mechanism has been used by many scholars to select keywords with higher weights. For example, when generating summaries for long texts or multiple documents, a self-attention mechanism is employed by the Transformer model [11] to capture the dependencies between keywords in the input texts or documents. It achieves a focus on some of the important keywords. However, when the extracted keywords receive a small attentional weight, it affects the complete encoding of the keywords as well as the representation of important information.

The rest of this paper is organized as follows: Section 2 reviews the work related to this study. In Section 3, the framework of the summary generation model MSMAUKE-Summ^N is given and the components of the framework are described in detail. Experimental results and analysis related to the MSMAUKE-Summ^N model are shown in Section 4. Section 5 summarizes the conclusions of this paper and future research directions.

2. Related Work

Currently, Transformer-based summary generation models have made great progress in text summarization [11]. However, due to the large number of words in long texts and the sparse distribution of important information [12], the quality of summarization in long texts is poor as it is difficult for the summary generation model to capture the dependencies between important information in the text. A common solution is to reduce the input long text to short text; the BART model was proposed by Lewis et al. [13], which uses truncation of the input to reduce long text. Another approach is to improve the attention to cope with long text summarization tasks; Local Sensitive Hashing (LSH) attention was utilized by Kitaev et al. [14] to optimize the attention mechanism in Transformer to reduce the effect of quadratic complexity. In addition, a hierarchical self-attention mechanism was used by HMNet [15] and HAT-BART [16] to improve the input constraints of typical self-attention models, thus adapting the models to longer text inputs. Although the above methods can handle text content of a certain length, they are not flexible enough when the length of the text varies greatly.

The multistage modeling framework is capable of adapting to texts of different lengths by varying the number of stages, which is a more flexible way of processing compared to the truncated input and improved self-attention approaches. Therefore, the multistage summary generation model framework has been adopted by many scholars to adequately process information such as long texts or documents. A multistage summary generation framework structure, Summ^N, was proposed by Zhang et al. [17]. It is used to extract summaries of long texts and documents, and the final summary is obtained by N-coarse-grained and one-fine-grained summary generation stages. A staged long-text summary generation approach (PLSGA) was proposed by Fang et al. [18] to obtain the final summary. It includes key information extraction, extraction model training, generative model training, and candidate summary screening stages. A two-stage automatic summary generation model combining dual topic embedding and sentence absolute position embedding was proposed by Ren et al. [19]. It incorporates semantic features by introducing topic embedding in each of the two stages, extracts important information of the text in the extraction stage, and reduces the redundancy of the summary content. In addition, a sentence filtering layer was constructed by Mei et al. [20] for data preprocessing. It removes redundant sentences in long texts, shortens the text length, and provides the model with more streamlined data samples. The above methods reduce the adverse effects of excessively long text by moderating the number of stages, but these methods do not comprehensively consider the keywords and local semantic information of the text, resulting in summaries that omit important information.

The summary generation task needs to consider the importance of keywords for text summarization in addition to the effect of the length of the text. From past research work, it can be observed that with the continuous development of Pretrained Language Models (PLMs), many text generation studies have been centered around the technique of keyword extraction. Therefore, PLMs are used by many researchers as an embedding layer of keyword extraction models [21,22] to improve the extraction of keywords. Keyword extraction models are mainly classified into two categories, supervised and unsupervised. The use of supervised keyword extraction requires a large amount of labeled training data and the models generalize poorly outside the training data domain. Therefore, unsupervised learning was used by Eirini et al. [23] for keyword extraction, where the adjacency matrix corresponding to the word map of a target document is used as a vector representation of the vocabulary of that document, and distributed modeling is performed. To address the inability of embedding-based keyword extraction models to fully utilize PLMs, an unsupervised keyword extraction method, PromptRank, was proposed by Kong et al. [24]. It calculates the probability of keyword generation by the decoder based on prompts. Interpretable neural networks were used by Joshi et al. [25] for unsupervised keyword extraction, defining keywords as iconic words that are favorable for predicting the topic of a document. The above unsupervised approach simplifies the process of keyword extraction, but does not fully consider the importance of the keywords and the semantic relevance of the keywords to the sentences in the document. A mixed-attention model, AttentionRank, was proposed by Ding et al. [26] to extract keywords from documents in an unsupervised manner. It uses self-attention to determine the importance of keywords in the context of a sentence and cross-attention to calculate the semantic relevance between keywords and sentences in a document. The above methods were able to adequately extract the keywords from the text, but failed to enable the keywords to be used correctly in the inference of the summaries, resulting in the absence of important information in the summaries.

The uneven distribution of the extracted keywords in the original text and the lack of a rational method for their use lead to the fact that some keywords with lower weights may be used first in the inference of the summary, affecting the representation and distribution of important information in the summary. Therefore, in order to increase the detailed information of the summary, the targeted selection of keywords with higher weights to aid in inference is a critical step in summary generation. A method of selecting the best keywords was proposed by Tim et al. [27] to iteratively select keywords through an attention scoring function to obtain the best results overall. Meanwhile, in order to select more representative keywords, a probabilistic method was proposed by Akash et al. [28] for selecting representative keywords in a professional domain. A binary mixture model is used to obtain the distribution of keywords and select a representative subset of keywords. A keyword selection method was proposed by Venkatesh et al. [29]. The self-attentive head of the last layer of the pretraining encoder is clustered to select the most popular keywords from the original article. To address the problem of data sparsity in low-resource complex Named Entity Recognition (NER), an attention graph-aware keyword selection method was proposed by Ghosh et al. [30]. It uses attention graph-assisted selective masking to retain named entities in the input sentence and select contextually relevant keywords. The above methods select the most representative keywords in the text, and although they enhance the local semantic representation of the summary, there is some repetition in the use of keywords.

Aimed at the problems of high redundancy and poor comprehensiveness, accuracy, and readability in existing long text summaries, an N + 1 coarse–fine-grained multistage summary generation framework is constructed and a multistage mixed-attention unsupervised keyword extraction summary generation model (MSMAUKE-Summ^N) is proposed. During data preprocessing, the sentence filtering layer, PureText, is set to remove useless information from the data samples to obtain high-quality data samples. A mixture of self-attention and cross-attention is used to consider the importance and semantic relevance of keywords at the sentence level and document level. Multistage iterative extraction of keywords is used to facilitate summarization inference to enrich the global semantic information of coarse-grained summaries, rationalize the use of keywords using the self-attention mechanism, select more weighted keywords to assist the summary, and increase the detailed information of the summary to enhance the local semantic representation of the fine-grained summary.

3. MSMAUKE-Summ^N Model

Aimed at the problems of information redundancy, poor accuracy, comprehensiveness, and readability in current long-text summaries, an N + 1 coarse–fine-grained multistage summary generation framework is constructed, and a multistage mixed-attention unsupervised keyword extraction summary generation model (MSMAUKE-Summ^N) is proposed. The model framework contains two parts: the N-coarse-grained summary generation stage and the 1-fine-grained summary generation stage. In the N-coarse-grained summary generation stage, the sentence filtering layer, PureText, is constructed to remove redundant information from the original text (D) and the target text (T) to provide high-quality data samples for the summary generation model. An unsupervised approach with mixed attention is used to iteratively extract keywords to assist summary inference and enrich coarse-grained summary (Z) global semantic information. In the 1-fine-grained summary generation stage, self-attention is utilized to select keywords with higher weights to enhance the local semantic representation of the fine-grained summary (F). The framework of the MSMAUKE-Summ^N model is shown in Figure 1.

In Figure 1, an N + 1 coarse–fine-grained multistage summary generation framework is constructed, which connects N-coarse-grained and 1-fine-grained summary generation stages in series to obtain summaries of long texts in a multistage summary generation manner. In this case, in the N-coarse-grained summary generation stage, data filtering and segmentation, and coarse-grained summary generation are iterated N times. With the keyword-enhanced coarse-grained summary (Z) and the target text (T), the model is finetuned in the 1-fine-grained summary generation stage to obtain the final fine-grained summary (F). Specifically, in the process of data filtering and segmentation, the text filtering layer, PureText, is used to remove redundant information from the text, and the original text and the target text are segmented and paired using the ROUGE-based greedy algorithm. In the process of coarse-grained summary generation, mixed-attention screening of the initially extracted keywords is performed; the keywords at the sentence level and document level levels are paid full attention. The contextual relevance of the keywords to the sentences is considered. The final score of the keywords is calculated by linear combination to solve the problem of keyword duplication, so that more valuable keywords can assist the summary inference and enhance the global semantic representation of the summary. During fine-grained summary generation, self-attention focuses on selecting keywords with higher weights to aid summary inference and enhance the local semantic representation of the summary.

3.1. N-Coarse-Grained Summary Generation Stage

3.1.1. Data Filtering and Segmentation

In order to remove redundant information in long text, weakly supervised learning is used to train the BERT model in the N-coarse-grained summary generation stage. A versatile and lightweight sentence filtering layer (PureText) is constructed to filter out redundant sentences by utilizing text similarity. The process is to compare the ROUGE score of each sentence with the score of the standard summary sentence and then rank the importance of the sentence, filtering out up to 60% of the sentences in the text. By fine-tuning the BERT model, the ROUGE-1 metric value is utilized to distinguish the importance of different sentences. Since sentences are subunits of text with grammatical structure, it is natural to choose to filter sentences to produce a more concise text. It is hypothesized that the ROUGE-1 of a sentence is an important indicator that influences the evaluation of a summary. This was carried out by calculating the similarity scores between the sentences and the ROUGE-1 metric values of the standard summary sentences, labeling sentences with scores higher than the median as ‘important’ and sentences with scores lower than the median as ‘unimportant’.

After each sentence is labeled by the classifier, the sentences in the text are ranked according to their respective ‘importance’. The reward,

R_{i}

, is defined according to the importance of the sentence. In the background of the 0–1 Knapsack algorithm, the set of sentences with the highest accumulated reward,

\sum R_{i}

, is found, while the length of the set of sentences does not exceed the input length limit (L) of the summary generation model. Each sentence is then weighted according to its number of tokens. Finally, the best set of sentences (filtered text) is obtained by filtering. In addition, the filtered original text (D) and target text (T) are segmented and matched to obtain the dataset

(D_{i}, T_{i})

. Fine-tuning summary generation models with high-quality datasets lead to better models. The data filtering and segmentation are shown in Figure 2.

In Figure 2, the sentence filtering layer PureText performs data preprocessing on the original text (D) and the target text (T) to remove redundant sentences. Then, they are segmented to obtain text paragraphs

(D_{1}, D_{2} \cdot \cdot \cdot D_{n})

and

(T_{1}, T_{2} \cdot \cdot \cdot T_{n})

, respectively.

3.1.2. Coarse-Grained Summary Generation

During coarse-grained summary generation, the process of keyword extraction and integration is executed iteratively N times. Keywords are utilized to aid the inference process of summarization; a multistage approach is taken to extract coarse-grained level summary information from long texts. There are five main steps in the process, which are keyword extraction, self-attention accumulation calculation, cross-attention calculation, linear combination score calculation, and keyword incorporation into summary inference.

(1): Keyword Extraction

The keyword extraction task is accomplished by using the keyword extraction module implemented in EmbedRank, which is capable of the lexical annotation of words. The words labeled NN, NNS, NNP, JJ, VB, VBD, etc., are recognized and the keywords are obtained by phrase extraction using Python’s own NLTK package. For example, for a given sentence ‘The process of creating experiments is minimized by the simple control of running the program code, achieved by editing most of the program’s parameters into a text file’, the keywords extracted from it are ‘program’s parameters’, ‘text file’, ‘program code’, ‘running’, ‘control’, ‘minimized’, and ‘process’. For the keywords initially extracted from the text, it is also necessary to use attention to calculate the importance of the keywords themselves as well as their relevance scores to the context of the sentence and the document, in order to obtain keywords that are of practical value for summary generation.

(2): Self-Attention Accumulation Calculation

In the process of calculating self-attention accumulation for keywords, the input text segments

(D_{1}, D_{2} \cdot \cdot \cdot D_{n})

are lexically annotated and phrase marked to obtain a keyword set,

K

. The pre-trained BERT is utilized to extract

w_{i}^{k}

, which denotes the word self-attention weight of the keyword,

k

, in the

i

-th sentence. The accumulated self-attention weight,

a_{i}^{k}

, for the keyword,

k

, in the sentence,

i

, is then calculated. Finally, the self-attention of the keywords in each sentence is summed to obtain the document-level self-attention value,

a_{k}

, for the keyword, k. The self-attention accumulation calculation is shown in Figure 3.

In Figure 3,

a_{w}

is obtained from the pre-trained BERT, and

a_{w}

denotes the value of the self-attention weights of the words in the sentence; the equation is shown below:

a_{w} = \sum_{w^{'} \in s} a_{w^{'} w}

(1)

where

a_{w^{'} w}

denotes the value of the attentional weight of word

w

together with all the other words,

w^{'}

, in the same sentence,

s

.

Assuming that the sum of the self-attention weight values of keyword k in sentence

i

is

a_{i}^{k}

, the self-attention weight values of the words in

k

are summed to obtain

a_{i}^{k}

; the equation is shown below:

a_{i}^{k} = \sum_{w \in k} a_{w}

(2)

The document-level self-attention weight value,

a_{k}

, of keyword

k

is the sum of the keyword weight values in all sentences; the equation is shown below:

a_{k} = \sum_{i \in d} a_{i}^{k}

(3)

where

d

denotes the document and sentence

i

denotes the

i

-th sentence in the document.

The importance of the keywords is taken into account by calculating the self-attention accumulation of the keywords. However, the evaluation of the keywords is multifaceted and needs to be considered regarding the relevance of the keywords to the context of the sentence and the document.

(3): Cross-Attention Calculation

To calculate the relevance scores of keywords to sentence context and documents, the network architecture based on the Hierarchical Attention Retrieval (HAR) model and the Bidirectional Attention model is used to design the cross-attention structure. It calculates the relevance between keywords and documents based on the context. A mixture of self-attention and cross-attention is used. The cross-attention calculation is shown in Figure 4.

In Figure 4, keyword

k

containing

m

words is denoted as

E^{k} = {e_{1}^{k}, \dots, e_{m}^{k}}

, with

e_{i} \in ℝ^{H}

denoting the embedding of

w_{i}

. Similarly, sentence

i

containing n words is denoted as

E^{i} = {e_{1}^{i}, \dots, e_{n}^{i}}

. Calculating document embeddings through cross-attention provides a better measure of the contextual relevance between keywords and sentences in a document.

The equation of the similarity matrix,

S \in ℝ^{n \times m}

, between sentence

i

and keyword

k

is shown below:

S = E^{i} \cdot E^{k T}

(4)

where

E^{i} \in ℝ^{n \times H}

denotes sentence

i

and

E^{k} \in ℝ^{m \times H}

denotes keyword

k

.

The word-based sentence-to-keyword

{\bar{S}}_{i 2 k}

and keyword-to-sentence

{\bar{S}}_{k 2 i}

similarity calculations are shown in Equations (5) and (6), respectively:

{\bar{S}}_{i 2 k} = s o f t m a x_{r o w} (S)

(5)

{\bar{S}}_{k 2 i} = s o f t m a x_{c o l u m n} (S)

(6)

The calculation of cross-attention weights from sentence to keyword,

A_{i 2 k}

, and word-based keyword to sentence,

A_{k 2 i}

, are shown in Equations (7) and (8), respectively:

A_{i 2 k} = {\bar{S}}_{i 2 k} \cdot E^{k}

(7)

A_{k 2 i} = {\bar{S}}_{i 2 k} \cdot {\bar{S}}_{k 2 i}^{T} \cdot E^{i}

(8)

Based on the cross-attention weights, the new sentence representation,

V^{i}

, for sentence

i

is obtained by averaging; the equation is shown below:

V^{i} = A V G (E^{i}, A_{i 2 k}, E^{i} ⊙ A_{i 2 k}, E^{i} ⊙ A_{k 2 i})

(9)

where

E^{i}

denotes the original context of the sentence,

A_{i 2 k}

,

E^{i} ⊙ A_{i 2 k}

, and

E^{i} ⊙ A_{k 2 i}

measure the contextual relevance between the sentence and the keywords, and

⊙

denotes element-by-element multiplication. The new sentence representation is obtained by calculating the average of the sum of the four variables mentioned above, incorporating word-based relationships between keywords and sentences.

A new sentence representation containing

n

words is

V^{i} = {v_{1}^{i}, \dots, v_{n}^{i}}

, and to highlight the importance of the words,

V^{i}

performs self-attention. The average of all columns is calculated to obtain the final representation of the sentence

α^{i} \in ℝ^{H}

. The self-attention and

α^{i}

of sentence

i

are shown in Equations (10) and (11), respectively:

I = s o f t m a x_{r o w} (V^{i} \cdot V^{i T}) \cdot V^{i}

(10)

α^{i} = A V E (I [:, i])

(11)

A document,

d

, containing a set of sentences is denoted as

E^{d} = {α^{1}, \dots, α^{i}}

, and

E^{d}

is subjected to performing self-attention. The average of all columns is calculated as the final document embedding

p^{d} \in ℝ^{H}

, and the self-attention and

p^{d}

of document

d

are shown in Equations (12) and (13), respectively:

P = s o f t m a x_{r o w} (E^{d} \cdot E^{d T}) \cdot E^{d}

(12)

p^{d} = A V E (P [:, i])

(13)

The set of keyword embeddings is denoted as

E^{k} = {e_{1}^{k}, \dots, e_{m}^{k}}

, self-attention is applied to

E^{k}

, and the average of all the columns is calculated to obtain the final keyword embedding,

p^{k} \in ℝ^{H}

. The self-attention and

p^{k}

for keyword

k

are shown in Equations (14) and (15), respectively:

K = s o f t m a x_{r o w} (E^{k} \cdot E^{k T}) \cdot E^{k}

(14)

p^{k} = A V E (K [:, i])

(15)

The equation of the relevance value,

r_{k}

, between keyword

k

and document

d

is shown below:

r_{k} = \frac{p^{k} \cdot p^{d}}{‖ p^{k} ‖ \cdot ‖ p^{d} ‖}

(16)

where

r_{k}

is determined by a cosine similarity calculation by

p^{k}

and

p^{d}

. It takes into account both the interrelationship between the keywords and the context of the sentence and the relevance between the keywords and the document, weighing the importance of the keywords from multiple perspectives.

(4): Linear Combination Score Calculation

The importance of the keywords and the semantic relevance between the keywords and the document sentences are fully taken into account after self-attention accumulation calculation and cross-attention calculation for the preliminary extracted keywords. The final score,

s_{k}

, of the keywords is calculated by the linear combination method; the equation is shown below:

s_{k} = d * a_{k} + (1 - d) * r_{k}

(17)

where

d \in [0, 1]

, the normalized self-attention value

a_{k}

, and the cross-attention based relevance value

r_{k}

are calculated accumulatively for each keyword, respectively. The final score of the keywords,

s_{k}

, is determined by linearly combining

a_{k}

and

r_{k}

. The keywords are further evaluated based on the scores, and a portion of the useless keywords are eliminated. It more effectively integrates the keywords into the summary inference, resulting in an improvement in the quality of the summary.

(5): Keyword Incorporation into Summary Inference

In the N-coarse-grained summary generation stages, redundant text information is removed through sentence filtering and data segmentation to further streamline the text content. The obtained data samples

(D_{i}, T_{i})

are provided to the summary generation model to obtain a coarse-grained paragraph summary,

Z_{i}^{l}

, where

l \in [1, N]

,

l

denotes the index of the current stage; the equation is shown below:

Z_{i}^{l} = {MSMAUKE - S UMM}_{l} (D_{i}, T_{i})

(18)

Concatenate

n

coarse-grained paragraph summaries

(Z_{1}^{l}, Z_{2}^{l} \cdot \cdot \cdot Z_{n}^{l})

to obtain

Z^{l} = Z_{1}^{l} \oplus Z_{2}^{l} \oplus \cdot \cdot \cdot \oplus Z_{n}^{l}

and use

Z^{l}

as a new data sample for the next stage.

MSMAUKE-Summ^N can be applied to summary generation models such as BART and T5 [31], because the BART model performs poorly on summary generation for long texts. Therefore, in order to reflect the framework’s facilitation of BART for long-text summarization, BART is adopted as the backbone model of MSMAUKE-Summ^N.

3.2. 1-Fine-Grained Summary Generation Stage

When

Z^{l}

is used as an input source, if the text length of

Z^{l}

is less than the maximum input length,

L

, of the summary generation model, it can satisfy the requirements of the summary generation model in the 1-fine-grained stage. In the 1-fine-grained summary generation stage, the summary generation model is trained directly on the dataset

(Z^{N}, T)

from the last coarse-grained stage.

Keyword Selection: a pre-trained encoder model with

x

self-attentive heads

h_{1}, h_{2}, \dots, h_{x}

is used in the 1-fine-grained summary generation stage. Each self-attentive head, h_i, attends to a different part of the input text [32]. Divide each head into

y

clusters,

C = {c_{1}, c_{2}, \dots, c_{y}}

, so there are

g = \frac{x}{y}

heads per cluster. The headers are clustered using a sequential approach and the

m

most popular keywords are identified from each header. Since a keyword may receive higher attention from more than one head, it may result in an overlap in the set of keywords obtained by each head. To solve the problem of overlapping keywords, up to

g * m

keywords should be obtained from each cluster. Assume that the total number of keywords used for the coarse-grained summary inference process is

n

. By retaining

r

keywords as topic keywords, the remaining

n - r

keywords are non-overlapping keywords with less attention.

r

is a hyperparameter, and it can be derived from the statistical laws that

r

is typically 10% of

n

. Similar to the inference process of using keywords for coarse-grained summarization, it is also necessary to further select a portion of keywords that are more relevant to the topic for use in the inference process of fine-grained summarization.

Keywords are utilized to assist the inference process of the summary generation model. A fine-grained summary,

F

, is obtained as the final output of MSMAUKE-Summ^N; the equation is shown below:

F = {MSMAUKE - S UMM}_{N + 1} (Z^{N}, T)

(19)

where

(Z^{N}, T)

denotes the dataset of the last coarse-grained stage, and the dataset

(Z^{N}, T)

is utilized by the MSMAUKE-Summ^N framework to finetune the summary generation model in the (N + 1)-th stage to obtain the final fine-grained summary,

F

.

The pseudo-code for MSMAUKE-Summ^N is shown in Algorithm 1:

Algorithm 1 MSMAUKE-Summ^N pseudo-code

Input:: $D$ , $T$ $, L r$ , Coarse-grained beam width, Fine-grained beam width, Input_max_token.
Output:: $F$

1: for n in range(1, N + 1) do
2:

PureText (D, T)

Summary Generation Model

\leftarrow

ROUGE (D_{i}, T_{i})

3:

keywords

\Leftarrow

NLTK (D)

4:

a_{k}

\Leftarrow

{Accumulation (keywords)}_{Self - attention}

5:

r_{k}

\Leftarrow

Relevance {(keywords, sentence, document)}_{Mixed - attention}

6:

s_{k}

\Leftarrow

Linear_combination_score (a_{k}, r_{k})

7:

Z_{i}^{l}

\Leftarrow

M S M A U K E - S U M M_{l} (D_{i}, T_{i})

(Z_{1}, Z_{2}, \cdot \cdot \cdot Z_{n})

=

Z_{i}^{l}

8:

Z

\Leftarrow

Tan dem (Z_{1}, Z_{2}, \cdot \cdot \cdot Z_{n})

Summary Generation Model

\leftarrow

ROUGE (Z_{i}, T_{i})

9: end for
10:

F

\Leftarrow

M S M A U K E - S U M M_{N + 1} (Z^{N}, T)

11: return

F

Streamlined data samples are obtained by the sentence filtering process. Mixed-attention keyword extraction is utilized to enrich the global semantic information of coarse-grained summaries. The self-attentive keyword selection module is designed and keywords are selected to assist summary inference to enhance the local semantic representation of fine-grained summaries. An accurate, comprehensive, and readable fine-grained summary,

F

, is obtained.

4. Experimental Results and Analysis

4.1. Experimental Setup and Dataset

In this study, the experimental environment is configured on a cloud server with a single NVIDIA RTX A5000 (NVIDIA, Santa Clara, CA, USA) with 24 GB of video memory. The Python version 3.7 programming language, the Pytorch version 1.8.1 framework, CUDA version 11.1, and fairseq version 0.10.0 are utilized for the experiments.

The experimental hyperparameter settings are shown in Table 1.

From Table 1, Lr denotes the learning rate, which is set to 2 × 10⁻⁵. The coarse-grained beam width and fine-grained beam width denote the beam widths of the coarse-grained and fine-grained stages, which are set to 2 and 10, respectively. Input_max_token denotes the maximum number of tokens in the original text, which is set to 1024.

In addition, by looking at the model parameters, we can find that the BART-large model pre-trained on the CNN/DM dataset has the best summary generation capability [33]. Therefore, in this paper, the model parameters of BART-large trained on the CNN dataset are used as initialization parameters.

Both AMI and ICSI are conference scripts generated through automatic speech recognition systems. AMI is a dataset collected from a company’s product design meetings, and ICSI is a dataset collected from an academic group’s meetings. QMSum [34] is a query-based dataset of conference summaries, where each query and example are written by an expert. SummScreen [35] is a dataset consisting of transcripts of TV episodes from TVMegaSite (TMS) and ForeverDream (FD). The summary of each record is a summary of the TMS or FD. GovReport [36] is a large-scale long-document summary dataset containing 19,466 long reports issued by the U.S. Government Accountability Office on national policy issues. These datasets were chosen, in general terms, because they all belong to the long text summarization type of datasets. In terms of separate domains, they are from company meetings, academic conferences, long document reports released by the government, etc., which can improve the generalization ability of the MSMAUKE-Summ^N model and help the proposed summary generation model more. Therefore, these datasets are selected for related experiments in this study.

The statistics of the dataset used for the experiment are shown in Table 2.

The original text length and target text length are averaged over the entire dataset, and N denotes the number of coarse-grained summary generation stages.

4.2. Evaluation Metrics

ROUGE is used as an evaluation metric in this study, which is mainly categorized into ROUGE-1 (hereinafter referred to as R-1), ROUGE-2 (hereinafter referred to as R-2), and ROUGE-L (hereinafter referred to as R-L). R-1 is usually used as the standard evaluation metric because higher scores on the R-1 metric usually imply higher scores on the R-2 and R-L metrics as well. Meanwhile, R-1 has a lower time complexity compared to other ROUGE metrics.

The equation of definitions for R-1, R-2, and R-N are shown as follows:

R - N = \frac{\sum_{R \in {R e f e r e n c e S u m m e r i e s}} \sum_{g r a m_{n} \in R} C o u n t_{m a t c h} (g r a m_{n})}{\sum_{R \in {R e f e r e n c e S u m m e r i e s}} \sum_{g r a m_{n} \in R} C o u n t (g r a m_{n})}

(20)

where

n

represents the length of the

g r a m_{n}

,

C o u n t_{m a t c h} (g r a m_{n})

denotes the maximum number of

g r a m_{n}

that appear in both the candidate and reference summaries, and

R

denotes the reference summary.

For R-L, L is the longest common subsequence (LCS), and the LCS is used by R-L. R-L is calculated from

R_{l c s}

,

P_{l c s}

, and

F_{l c s}

.

R_{l c s}

and

P_{l c s}

denote recall and precision, respectively, while

F_{l c s}

denotes R-L. R-L is defined by Equations (21)–(23) as follows:

R_{l c s} = \frac{L C S (X, Y)}{m}

(21)

P_{l c s} = \frac{L C S (X, Y)}{n}

(22)

F_{l c s} = \frac{(1 + β^{2}) R_{l c s} P_{l c s}}{R_{l c s} + β^{2} P_{l c s}}

(23)

where

L C S (X, Y)

is the length of the longest common subsequence of

X

and

Y

.

m

and

n

denote the lengths of the reference and auto-summaries, respectively. Since

β

is a very large number, R-L considers

R_{l c s}

almost exclusively.

4.3. Ablation Experiments

To better validate the usefulness of the modules, the following ablation experiments are designed on the relatively long dataset GovReport. Only the sentence filtering layer PureText is added in Experiment 1. Only the mixed-attention keyword extraction module is incorporated to aid in summary inference in Experiment 2. The sentence filtering layer PureText and the mixed-attention keyword extraction module are added in Experiment 3. The mixed-attention keyword extraction module and the self-attention keyword selection module are incorporated in Experiment 4. The sentence filtering layer PureText, the mixed-attention keyword extraction module, and the self-attention keyword selection module are added in Experiment 5. The ablation experiment module selection is shown in Table 3.

The results of ablation Experiments 1, 2, 3, 4, and 5 on the three metrics R-1, R-2, and R-L are shown in Table 4.

As can be seen from Table 4, the addition of the sentence filtering layer PureText and the mixed-attention keyword extraction module in Experiments 1 and 2, respectively, resulted in an improvement in the R-1, R-2, and R-L metrics compared to Summ^N. The improvement was more obvious when the latter was added alone, with an improvement of 0.46%, 0.92%, and 0.81%, respectively. Sentence filtering provides a streamlined sample of data without any effective processing of keywords. Therefore, the improvement in summary generation metrics is relatively small, while the mixed-attention mechanism makes self-attention and cross-attention interact with each other, and fully focus on the keywords at the sentence and document levels. It is more beneficial for obtaining the important information of the summary and improving the readability and conciseness of the summary, and the improvement effect is more obvious.

In addition, from Experiment 3, it can be seen that with the simultaneous inclusion of the sentence filtering layer PureText and the mixed-attention keyword extraction module, the quality of summary generation can be improved by them in a mutually reinforcing way. From Experiment 4, it can be seen that the three metrics are improved by 0.68%, 1.27%, and 1.12% with the addition of the mixed-attention keyword extraction module and the self-attention keyword selection module, respectively. This is due to the large number of keywords initially extracted and the fact that the keywords were not filtered by methods such as attention, and there are many duplicates, which is not beneficial to the summarization inference process. Thus, it is not directly useful for summarization inference, while the mixed-attention weighting of keywords in a document is calculated implicitly, taking repetition into account. Semantic relevance between keywords and sentences in the document is also considered to enrich the global semantic information of the summary. On the other hand, keywords with higher weights can be found and utilized by the keyword selection module to assist summary inference. The ability of the summary generation model to represent detailed information is strengthened, and the local semantic representation of the summary is enhanced.

From Experiment 5, it is clear that the enhancement of the ROUGE metric by the three metrics is accumulative. Because redundant information in data samples can be reduced by sentence filtering, the redundancy of summary information can be reduced and the summary generation model can be improved. The summary generation model can be more comprehensively assisted by attention to obtain important keywords. The accuracy, comprehensiveness, and readability of summary information can be improved. Therefore, attention mechanisms can lead to greater enhancement.

4.4. Comparison Experiments

On the AMI, ICSI, and QMSum datasets, the improved modeling framework MSMAUKE-Summ^N is compared with the following comparison models in this paper:

HMNet [15]: A hierarchical attention structure and cross-domain pretraining are used by HMNet to extract meeting summaries, and a novel abstract summarization network is constructed to adapt to meeting scenarios. Role vectors are designed to cope with the semantic structure and style of different meeting records, and long meeting records are accommodated through a hierarchical structure.
TextRank [37]: A graph ranking approach is utilized by TextRank to consider the importance of each node’s information. It recursively calculates the importance weight value of each node in the relationship graph from the global semantic information. It counts and ranks important node information for keyword extraction and text summarization.
HAT-BART [16]: Layered attention is applied by HAT-BART to Transformer. Visual Hierarchical Encoder–Decoder Attention is designed to overcome the uneven distribution of important information in long texts. Hierarchical Attention Transformer’s (HAT) architecture is used to adapt to longer text inputs.
DDAMS [38]: Relational maps are used by DDAMS to model different discourse relationships; the interactions between discourses in a meeting are explicitly modeled. A relational graph encoder module is constructed, and graph interaction is used to model the interactions between discourses. The semantic relationships and information structure in the summaries are clarified.
Summ^N [17]: The data samples are segmented by Summ^N. It generates coarse-grained summaries through multiple stages, based on which fine-grained summaries are generated. The framework can handle input text of any length by adjusting the number of stages.

The results of the ROUGE metric values of the MSMAUKE-Summ^N model with HMNet, TextRank, HAT-BART, DDAMS, and Summ^N on the AMI, ICSI, and QMSum datasets are shown in Table 5.

For some special datasets, such as the SummScreen dataset, in this paper, the neural model Longformer + ATT and the hybrid model NN + BM25 + Neural proposed by Chen et al. [35] are used as comparison models and experimented on the SummScreen dataset. In addition, for the GovReport dataset, the comparison models are divided into two main categories, one of which is the BART variant, which represents the self-attentive variant with full attention. Another category is BART HEPOS, denoting the encoder variant with head position step encoder–decoder attention. The results of the ROUGE metric values of the MSMAUKE-Summ^N and other models on the SummScreen and GovReport datasets are shown in Table 6 and Table 7, respectively.

As can be seen in Figure 5, the model is significantly improved after adding the sentence filtering layer, the mixed-attention keyword extraction module, and the self-attention keyword selection module. The improvement effect is more obvious in Figure 5b compared to (a) because the ICSI dataset is relatively large and more information can be provided for summary generation. Meanwhile, MSMAUKE-Summ^N outperforms models such as Summ^N in all categories of ROUGE because sentence filtering long texts results in high-quality data samples, which helps to improve the quality of summary generation. The importance of keywords at the sentence level and document level can be paid full attention by the mixed-attention keyword extraction module. The extracted keywords are used in the stage of coarse-grained summary generation. The inference process of summary generation is enhanced, and the information extraction ability in the coarse-grained stage is better improved. In addition, keywords with higher weights are used by the self-attentive keyword selection module in the inference process of fine-grained summary generation. It is beneficial to further improve the summary ability of the summary generation model on detailed and focused information.

As can be seen in Figure 6, since SummScreen is a larger-volume dataset in the TV series genre, the MSMAUKE-Summ^N model was able to improve on each of the ROUGE metrics over the other comparison models. However, parameter optimization becomes more difficult due to computational resource constraints, resulting in it being a little harder for the model to improve metrics on larger datasets. It can still be well illustrated that MSMAUKE-Summ^N with BART as the backbone model has a better summary generation ability.

As can be seen in Figure 7, on the document dataset GovReport, the semantic relevance between keywords and the context of sentences and documents is fully considered due to the mixed-attention mechanism. The keywords are considered from multiple perspectives, and the importance of the keywords themselves as well as their semantic relevance to the documents are combined to assist summary inference. The MSMAUKE-Summ^N model has a better summary generation ability compared to other comparison models when dealing with document-like datasets.

In order to further demonstrate the superior performance of the MSMAUKE-Summ^N model relative to other models at different text lengths, text of different lengths is intercepted by us to test the performance of the model for summary generation. The performance difference between the models can be clearly seen through the summary generation time. The performance test of the model with different text lengths is shown in the following figure.

As can be seen from Figure 8, the unit of time on the vertical axis is seconds; text paragraphs of different lengths are intercepted for the document dataset GovReport, with text paragraphs of lengths such as 500, 1000, and 1500 words; and the performance test of the model’s summary generation is accomplished on these text paragraphs of different lengths. As can be seen from the figure, the time taken by each model to generate a summary is very close for shorter texts of 500 words; however, as the text length grows, it can be seen that the difference in performance between the models gradually widens. This is because the longer the length of the text, the more it restricts the model’s ability to generate summaries, unlike the MSMAUKE-Summ^N model because of its ability to flexibly adjust the number of stages to accommodate different lengths of text and to effectively take keywords into account. Therefore, the MSMAUKE-Summ^N model can outperform other models in terms of performance and the quality of summary generation.

At different text lengths, there are large differences in both the time to process the text and the quality of the summaries generated by the different models. When the same length of text is processed, the MSMAUKE-Summ^N model will take less time compared to HAT-BART, DDAMS, and other models, and the difference between the MSMAUKE-Summ^N model and the other models will be more obvious as the length of the text grows. Thus, this is good proof of the advantages of the MSMAUKE-Summ^N model in processing long text summaries. Meanwhile, the MSMAUKE-Summ^N model proposed in this paper is able to adjust the number of generation stages to adapt to different lengths of text, whereas models such as HAT-BART have a more complex and hierarchical structure, which is less adaptive to the task of summarizing long texts. Therefore, the MSMAUKE-Summ^N model is characterized by a simple and flexible structure compared to other models. Similarly, the MSMAUKE-Summ^N model in this paper is tediously trained on various types of long-text summarization datasets, which sufficiently improves the model’s adaptability to different domains and provides a better generalization of the MSMAUKE-Summ^N model compared to other models that are trained on more homogeneous datasets.

5. Conclusions

In the text generation task, the generation of summaries for long texts is an important task; an N + 1 coarse–fine-grained multistage summary generation framework is constructed and a multistage mixed-attention unsupervised keyword extraction summary generation model (MSMAUKE-Summ^N) is proposed. In the N-coarse-grained summary generation stage, the sentence filtering layer, PureText, is constructed to remove redundant information from the text. It improves the summary generation model’s ability to extract important information. Self-attention and cross-attention are mixed. The importance of keywords at the sentence level and document level is fully considered, and keywords are extracted stage-by-stage in an unsupervised manner using mixed attention. The inference process of coarse-grained summary generation is aided and the global semantic information of coarse-grained summaries is enriched. In the 1-fine-grained summary generation stage, the self-attentive keyword selection module is designed to improve the controllability of keyword usage. Keywords with higher weights are incorporated into the summary inference. The local semantic representation of the fine-grained summary is enhanced. By flexibly adapting to texts of different lengths through the N + 1 coarse–fine-grained multistage summary generation framework, the quality of summaries is considered from both global and local semantic perspectives. The accuracy, comprehensiveness, and readability of summary information are improved, and the redundancy of summary information is reduced. The experimental results show that the MSMAUKE-Summ^N model framework is improved over HMNET, TextRank, HAT-BART, DDAMS, and Summ^N models in terms of ROUGE-1, ROUGE-2, and ROUGE-L metrics, with a minimum of 0.75%, 1.48%, and 1.25% improvement in the three metrics across the datasets relative to the Summ^N model. It is demonstrated that MSMAUKE-Summ^N has a better summary generation ability for long texts or documents.

Further research will be conducted in future work on how to generate summaries of multimodal data, which can be data that contains multiple forms such as text and audio, text and image, or text and video, as opposed to a single text type of data. The fusion of MSMAUKE-Summ^N with various improved convolutional neural networks is considered in future work to investigate how to improve applicability to multi-domain data. In addition, how to further improve the accuracy of keyword extraction is also to be solved. In subsequent research work, diversified feature weighting will be considered to improve the self-attention mechanism, and multiple features of keywords are considered comprehensively to improve the accuracy of keyword extraction.

Author Contributions

D.W., P.C. and Y.Z.: conceptualization; P.C.: validation; D.W., P.C. and Y.Z. wrote the paper; D.W. and Y.Z. reviewed the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Research Projects of the Nature Science Foundation of Hebei Province (No. F2021402005), as well as the National Natural Science Foundation of China (No. 62101174).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Acknowledgments

The authors look forward to the insightful comments and suggestions of the anonymous reviewers and editors, which will go a long way towards improving the quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, H.; Liu, X.; Zhang, J. Extractive Summarization via ChatGPT for Faithful Summary Generation. In Findings of the Association for Computational Linguistics: EMNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 3270–3278. [Google Scholar]
Bao, G.; Ou, Z.; Zhang, Y. GEMINI: Controlling The Sentence-Level Summary Style in Abstractive Text Summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 831–842. [Google Scholar]
Chen, J.; Yang, D. Structure-Aware Abstractive Conversation Summarization via Discourse and Action Graphs. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 1380–1391. [Google Scholar]
Yoo, C.; Lee, H. Improving Abstractive Dialogue Summarization Using Keyword Extraction. Appl. Sci. 2023, 13, 9771. [Google Scholar] [CrossRef]
Chen, T.; Wang, X.; Yue, T. Enhancing Abstractive Summarization with Extracted Knowledge Graphs and Multi-Source Transformers. Appl. Sci. 2023, 13, 7753. [Google Scholar] [CrossRef]
Xia, W.; Huang, H.; Gengzang, C.; Fan, Y. A review of extractive text summarisation based on unsupervised and supervised learning. Comput. Appl. 2023, 1–17. [Google Scholar]
Zhang, L.; Chen, Q.; Wang, W.; Deng, C.; Zhang, S.; Li, B.; Wang, W.; Cao, X. MDERank: A Masked Document Embedding Rank Approach for Unsupervised Keyphrase Extraction. In Findings of the Association for Computational Linguistics: ACL; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 396–409. [Google Scholar]
Sun, Y.; Qiu, H.; Zheng, Y.; Wang, Z.; Zhang, C. SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model. IEEE Access 2020, 8, 10896–10906. [Google Scholar] [CrossRef]
Saxena, A.; Mangal, M.; Jain, G. KeyGames: A Game Theoretic Approach to Automatic Keyphrase Extraction. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 2037–2048. [Google Scholar]
Leonardo, F.R.R.; Mohit, B.; Markus, D. Generating Summaries with Controllable Readability Levels. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 11669–11687. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar]
Griffin, A.; Alex, F.; Faisal, L.; Eric, L.; Noémie, E. From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting. In Proceedings of the 4th New Frontiers in Summarization Workshop, Singapore, 6–10 December 2023; pp. 68–74. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer: The Efficient Transformer. International Conference on Learning Representations. arXiv 2020, arXiv:2001.04451. [Google Scholar]
Zhu, C.; Xu, R.; Zeng, M.; Huang, X. A Hierarchical Network for Abstractive Meeting Summarization with Cross-Domain Pretraining. In Findings of the Association for Computational Linguistics: EMNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 194–203. [Google Scholar]
Rohde, T.; Wu, X.; Liu, Y. Hierarchical Learning for Generation with Long Source Sequences. arXiv 2021, arXiv:2104.07545. [Google Scholar]
Zhang, Y.; Ni, A.; Mao, Z.; Wu, C.H.; Zhu, C.; Deb, B.; Awadallah, A.; Radev, D.; Zhang, R. SummN: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1, pp. 1592–1604. [Google Scholar]
Fang, J.; Li, B.; You, X.; Lv, X. PLSGA: A staged approach to long text summary generation. Comput. Eng. Appl. 2023, 1–10. [Google Scholar]
Ren, S.; Zhang, J.; Zhap, Z.; Rao, D. A two-stage text summarization model combining topic and location information. Intell. Comput. Appl. 2023, 13, 158–163. [Google Scholar]
Mei, A.; Kabir, A.; Bapat, R.; Judge, J.; Sun, T.; Wang, W.Y. Learning to Prioritize: Precision-Driven Sentence Filtering for Long Text Summarization. In Proceedings of the 13th Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 313–318. [Google Scholar]
Sun, S.; Liu, Z.; Xiong, C.; Liu, Z.; Bao, J. Capturing Global Informativeness in Open Domain Keyphrase Extraction. In Proceedings of the Natural Language Processing and Chinese Computing: 10th CCF International Conference, NLPCC 2021, Qingdao, China, 13–17 October 2021. [Google Scholar]
Song, M.; Jing, L.; Xiao, L. Importance Estimation from Multiple Perspectives for Keyphrase Extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, 7–11 November 2021; pp. 2726–2736. [Google Scholar]
Papagiannopoulou, E.; Tsoumakas, G.; Papadopoulos, A. Keyword Extraction Using Unsupervised Learning on the Document’s Adjacency Matrix. In Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15), Mexico City, Mexico, 11 June 2021; pp. 94–105. [Google Scholar]
Kong, A.; Zhao, S.; Chen, H.; Li, Q.; Qin, Y.; Sun, R.; Bai, X. PromptRank: Unsupervised Keyphrase Extraction Using Prompt. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 10–12 July 2023; Volume 1, pp. 9788–9801. [Google Scholar]
Joshi, R.; Balachandran, V.; Saldanha, E.; Glenski, M.; Volkova, S.; Tsvetkov, Y. Unsupervised Keyphrase Extraction via Interpretable Neural Networks. In Findings of the Association for Computational Linguistics: EACL; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 1107–1119. [Google Scholar]
Ding, H.; Luo, X. AttentionRank: Unsupervised Keyphrase Extraction using Self and Cross Attentions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Online, 7–11 November 2021; pp. 1919–1928. [Google Scholar]
Kreutz, T.; Daelemans, W. Streaming Language-Specific Twitter Data with Optimal Keywords. In Proceedings of the 12th Web as Corpus Workshop, Marseille, France, 11–16 May 2020; European Language Resources Association: Luxemburg, 2020; pp. 57–64. [Google Scholar]
Akash, P.S.; Huang, J.; Chang, K.; Li, Y.; Popa, L.; Zhai, C. Domain Representative Keywords Selection: A Probabilistic Approach. In Findings of the Association for Computational Linguistics: ACL; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 679–692. [Google Scholar]
Venkatesh, E.; Kaushal, M.; Deepak, K.; Maunendra, S.D. DivHSK: Diverse Headline Generation using Self-Attention based Keyword Selection. In Findings of the Association for Computational Linguistics: ACL; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 1879–1891. [Google Scholar]
Sreyan, G.; Utkarsh, T.; Manan, S.; Sonal, K.; Ramaneswaran, S.; Dinesh, M. ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1, pp. 104–125. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Peng, H.; Schwartz, R.; Li, D.; Smith, N.A. A Mixture of h − 1 Heads is Better than h Heads. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6566–6577. [Google Scholar]
Zhang, Y.; Ni, A.; Yu, T.; Zhang, R.; Zhu, C.; Deb, B.; Celikyilmaz, A.; Awadallah, A.H.; Radev, D. An Exploratory Study on Long Dialogue Summarization: What Works and What’s Next. arXiv 2021, arXiv:2109.04609. [Google Scholar]
Zhong, M.; Yin, D.; Yu, T.; Zaidi, A.; Mutuma, M.; Jha, R.; Awadallah, A.H.; Celikyilmaz, A.; Liu, Y.; Qiu, X.; et al. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5905–5921. [Google Scholar]
Chen, M.; Chu, Z.; Wiseman, S.; Gimpel, K. SummScreen: A Dataset for Abstractive Screenplay Summarization. arXiv 2022, arXiv:2104.07091. [Google Scholar]
Huang, L.; Cao, S.; Parulian, N.; Ji, H.; Wang, L. Efficient Attentions for Long Document Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 1419–1436. [Google Scholar]
Mihalcea, R.; Tarau, P. TextRank: Bringing Order into Text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]
Feng, X.; Feng, X.; Qin, B.; Geng, X. Dialogue discourse-aware graph model and data augmentation for meeting summarization. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–27 August 2021; pp. 3808–3814. [Google Scholar]

Figure 1. The framework of the MSMAUKE-Summ^N model.

Figure 2. The data filtering and segmentation.

Figure 3. The self-attention accumulation calculation.

Figure 4. The cross-attention calculation.

Figure 5. ROUGE metric values on AMI (a) and ICSI (b) datasets.

Figure 6. ROUGE metric values on SummScreen-FD (a) and SummScreen-TMS (b) datasets.

Figure 7. ROUGE metric values on the GovReport dataset for comparative models such as LSH (4096) and Sinkhorn (5120).

Figure 8. Performance test of the model with different text lengths.

Table 1. Experimental hyperparameter settings.

Serial Number	Parameters	Parameter Value
1	Lr	2 × 10⁻⁵
2	Coarse-grained beam width	2
3	Fine-grained beam width	10
4	Input_max_token	1024

Table 2. Statistics of the dataset.

Dataset	Type	Domain	Size	Original Text Length	Target Text Length	N + 1
AMI	Dialogue	Meetings	137	6007.7	296.6	2
ICSI	Dialogue	Meetings	59	13,317.3	488.5	3
QMSum	Dialogue	Meetings	1808	9069.8	69.6	2
SummScreen	Dialogue	TV shows	26,851	6612.5	337.4	2
GovReport	Document	Reports	19,466	9409.4	553.4	3

Table 3. Selection of experimental modules for MSMAUKE-Summ^N ablation experiments.

Experiment Number	Sentence Filtering	Mixed-Attention Keyword Extraction	Self-Attention Keyword Selection
1	√
2		√
3	√	√
4		√	√
5	√	√	√

A “√” in the above table indicates that the corresponding module has been selected.

Table 4. Results of MSMAUKE-Summ^N ablation experiments.

	R-1 (%)	R-2 (%)	R-L (%)
Experiment Number	R-1 (%)	R-2 (%)	R-L (%)
1	56.87	23.56	54.12
2	57.23	24.17	54.71
3	57.35	24.33	54.85
4	57.45	24.52	55.02
5	57.52	24.73	55.15

MSMAUKE-Summ^N’s ROUGE metric values on the dataset GovReport are bolded.

Table 5. ROUGE metric values on AMI, ICSI and QMSum datasets.

	AMI			ICSI			QMSum-All			QMSum-Gold
Model	R-1 (%)	R-2 (%)	R-L (%)	R-1 (%)	R-2 (%)	R-L (%)	R-1 (%)	R-2 (%)	R-L (%)	R-1 (%)	R-2 (%)	R-L (%)
HMNET	52.36	18.63	24.00	45.97	10.14	18.54	32.29	8.67	28.17	36.06	11.36	31.27
TextRank	35.19	6.13	16.70	30.72	4.69	12.97	16.27	2.69	15.41	-	-	-
HAT-BART	52.27	20.15	50.57	43.98	10.83	41.36	-	-	-	-	-	-
DDAMS	53.15	22.32	25.67	40.41	11.02	19.18	-	-	-	-	-	-
Summ^N	53.44	20.30	51.39	45.57	11.49	43.32	34.03	9.28	29.48	40.20	15.32	35.62
MSMAUKE-Summ^N	55.19	21.46	53.35	47.54	12.85	45.25	36.29	10.8	31.22	42.2	17.09	38.09

MSMAUKE-Summ^N’s ROUGE metric values on the AMI, ICSI, and QMSum datasets are bolded.

Table 6. ROUGE metric values on the SummScreen dataset.

	SummScreen-FD			SummScreen-TMS
Model	R-1 (%)	R-2 (%)	R-L (%)	R-1 (%)	R-2 (%)	R-L (%)
Longformer + ATT	25.9	4.2	23.8	42.9	11.9	41.6
NN + BM25 + Neural	25.3	3.9	23.1	38.8	10.2	36.9
Summ^N	32.48	5.85	27.55	44.64	11.87	42.53
MSMAUKE-Summ^N	34.34	6.86	28.98	46.38	12.64	44.74

MSMAUKE-Summ^N’s ROUGE metric values on the SummScreen dataset are bolded.

Table 7. ROUGE metric values on the GovReport dataset.

	GovReport
Model	R-1 (%)	R-2 (%)	R-L (%)
BART Variants
Full (1024)	52.83	20.5	50.14
Stride (4096)	54.29	20.8	51.35
LIN. (3072)	44.84	13.87	41.94
LSH (4096)	54.75	21.36	51.27
Sinkhorn (5120)	55.45	21.45	52.48
BART HEPOS
LSH (7168)	55	21.13	51.67
Sinkhorn (10,240)	56.86	22.62	53.82
Summ^N	56.77	23.25	53.9
MSMAUKE-Summ^N	57.52	24.73	55.15

MSMAUKE-Summ^N’s ROUGE metric values on the GovReport dataset are shown in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, D.; Cheng, P.; Zheng, Y. Multistage Mixed-Attention Unsupervised Keyword Extraction for Summary Generation. Appl. Sci. 2024, 14, 2435. https://doi.org/10.3390/app14062435

AMA Style

Wu D, Cheng P, Zheng Y. Multistage Mixed-Attention Unsupervised Keyword Extraction for Summary Generation. Applied Sciences. 2024; 14(6):2435. https://doi.org/10.3390/app14062435

Chicago/Turabian Style

Wu, Di, Peng Cheng, and Yuying Zheng. 2024. "Multistage Mixed-Attention Unsupervised Keyword Extraction for Summary Generation" Applied Sciences 14, no. 6: 2435. https://doi.org/10.3390/app14062435

APA Style

Wu, D., Cheng, P., & Zheng, Y. (2024). Multistage Mixed-Attention Unsupervised Keyword Extraction for Summary Generation. Applied Sciences, 14(6), 2435. https://doi.org/10.3390/app14062435

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multistage Mixed-Attention Unsupervised Keyword Extraction for Summary Generation

Abstract

1. Introduction

2. Related Work

3. MSMAUKE-Summ^N Model

3.1. N-Coarse-Grained Summary Generation Stage

3.1.1. Data Filtering and Segmentation

3.1.2. Coarse-Grained Summary Generation

3.2. 1-Fine-Grained Summary Generation Stage

4. Experimental Results and Analysis

4.1. Experimental Setup and Dataset

4.2. Evaluation Metrics

4.3. Ablation Experiments

4.4. Comparison Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Multistage Mixed-Attention Unsupervised Keyword Extraction for Summary Generation

Abstract

1. Introduction

2. Related Work

3. MSMAUKE-SummN Model

3.1. N-Coarse-Grained Summary Generation Stage

3.1.1. Data Filtering and Segmentation

3.1.2. Coarse-Grained Summary Generation

3.2. 1-Fine-Grained Summary Generation Stage

4. Experimental Results and Analysis

4.1. Experimental Setup and Dataset

4.2. Evaluation Metrics

4.3. Ablation Experiments

4.4. Comparison Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. MSMAUKE-Summ^N Model