1. Introduction
Generative text summarization aims to create short, concise, and easy-to-read summaries by extracting key information from a text. This approach offers greater flexibility than extractive summarization, allowing models to generate new words with some probability. This capability brings summaries closer to those manually created, leading to a surge in research on generative models [
1,
2,
3].
However, despite significant advancements achieved through neural networks, a critical challenge remains. Maynez et al. [
4] demonstrated that sequence-to-sequence generative models often introduce “hallucinatory contents”—factual errors not present in the original text. Research by Falke et al. [
5] further highlights this problem, finding that 25% of summaries from state-of-the-art models contain factual errors.
Table 1 showcases examples of these errors in three classical models [
5]. As shown, the PGC model replaces the protagonist Jim Jeps with the unrelated Green Party leader Natalie Bainter, the FAS model transforms a viewpoint based on assumptions into definitive statements, and the BUS model incorrectly changes the protagonist of the document from a defender to a club. This issue plagues Chinese text summarization as well, significantly hindering the credibility and usability of generative text summarization models.
This paper proposes the Dependency-Augmented Summarization Model (DASUM) to address the issue of hallucination in Chinese text summarization, where summaries contain factual inconsistencies. DASUM leverages jieba (
https://github.com/fxsjy/jieba, accessed on 30 June 2024) for initial segmentation of the source document. It then employs Baidu’s DDParser [
8] to analyze the dependency relationships between the resulting tokens. Dependency analysis is crucial for capturing the relationships between words in Chinese text. This analysis results in a dependency graph where each edge represents a relationship between two words (head and dependent). DASUM then employs the Relational Graph Attention Network (RGAT) [
9] within its encoder to encode the dependency graph extracted from source document to obtain a representation for each node in the graph. This representation captures both the node’s information and its connection to other words. This, in turn, helps the model capture the relationships between important words in the sentence, thereby reducing potential distortions during summarization. The node representations are then integrated into a standard Transformer-based [
10] encoder–decoder architecture. During decoding, the decoder attends to each graph node within each Transformer block, allowing it to focus on relevant information throughout the summarization process. This approach improves the model’s understanding of factual relationships within the source text, leading to summaries with improved factual accuracy.
Experiments conducted on the benchmark LCSTS dataset [
11] demonstrate that DASUM effectively improves both the quality and factual correctness of generated summaries. Evaluated using ROUGE [
12] metrics, DASUM outperformed the baseline Transformer model, achieving a substantial 7.79-point increase in the ROUGE-1 score and an 8.73-point improvement in the ROUGE-L score on the LCSTS dataset. Additionally, StructBERT [
13] evaluation revealed a notable 4.48-point improvement in factual correctness for DASUM-generated summaries compared to the baseline. Finally, these quantitative results are corroborated by human evaluation, confirming the model’s effectiveness.
In summary, the contribution of this paper is threefold:
The Dependency-Augmented Summarization Model (DASUM) is proposed to tackle the hallucination problem in Chinese text summarization.
A dependency graph is constructed based on the dependency relationships in the source document to extract the relationships between keywords in the article.
The Relational Graph Attention Network (RGAT) is employed to encode the dependency graph, thereby integrating factual information from the source document into the summary generation model.
The rest of this paper is organized as follows: Related works are reviewed in
Section 2.
Section 3 briefly introduces the system model. Furthermore, our experimental and evaluation results are provided and analyzed in
Section 4. Finally,
Section 5 concludes the paper and outlines potential future directions.
3. System Models
Our model employs a Transformer-based architecture without pre-training. To leverage dependency relationships within the source document, we incorporate a Relational Graph Attention Network (R-GAT) [
9] on top of the Transformer. Sentence embeddings are initially generated by the sentence encoder. Subsequently, R-GAT produces node embeddings for the dependency graph, capturing inter-word relationships. During decoding, the model attends to both sentence and node embeddings, enabling the generation of informative and factually accurate summaries. The model architecture is detailed in
Figure 1.
3.1. Problem Formulation
Given a tokenized input sentence
, where
n is the number of tokens, and a corresponding summary
with
, we conduct dependency analysis to construct a dependency graph
. In this graph,
V represents the set of nodes,
E represents the set of edges, and each edge is defined as
, where
,
, and
.
denotes the set of dependency labels. Our objective is to generate the target summary
Y given input sentence
X and its dependency graph
G. While the Transformer’s time complexity is
, R-GAT’s complexity is
for a graph with
V vertices,
E edges, and feature dimension
F. Despite this, parallel training maintains an overall
complexity. Our proposed method offers enhanced expressive power compared to a baseline Transformer model. Algorithm 1 shows the training process of DASUM.
Algorithm 1 Training Process of the Dependency-Augmented Summarization Model |
- Require :
- training dataset. - 1:
Initialize network parameters ; - 2:
for each training epoch do - 3:
for do - 4:
Dependency relation D = DDParser(X). - 5:
= graph_construct(D). - 6:
and - 7:
- 8:
- 9:
compute and accumulate loss . - 10:
end for - 11:
update network parameters - 12:
end for
|
3.2. Graph Encoder
We employed DDParser, a Chinese dependency analysis tool built on the PaddlePaddle platform, to extract dependencies between words. Directed graphs were then constructed from these dependencies (as shown in
Figure 2). Subsequently, we obtained node representations using the Relational Graph Attention Neural Network (R-GAT).
R-GAT extends the Graph Attention Network (GAT) to incorporate edge labels. In this paper, a dependency tree is modeled as a graph
G with
n nodes representing words, and edges denoting dependency relations. GAT iteratively updates node representations by aggregating information from neighboring nodes using a multi-head attention mechanism:
where
is the attention head of node
i at layer
, and
denotes the concatenation of vectors from
to
.
represents the normalized attention coefficient computed by the
attention head at layer
l, and
is the input transformation matrix. This paper employs dot-product attention for
.
While GAT aggregates information from neighboring nodes, it overlooks dependency relationships, potentially missing crucial information. To solve this problem, we introduce R-GAT, which has relation-specific attention heads, allowing for differentiated information flow based on dependency types. Intuitively, nodes connected by different dependency relations should have varying influences. By capturing and leveraging these fine-grained dependency details, our model excels at comprehending and representing sentence structure. This improved understanding benefits subsequent natural language processing tasks. The overall architecture is depicted in
Figure 3.
Specifically, we first map the dependencies to vector representations and then compute the relation header as
where
denotes the relational embedding between node
i and node
j. R-GAT contains
K attention heads and
M relational heads. The final representation of each node is computed by
3.3. Sentence Encoder
The sentence encoder processes the input sentence to generate contextual representations, mirroring the standard Transformer architecture. It comprises multiple layers, each consisting of multi-head attention followed by Add & Norm, and a feed-forward layer followed by another Add & Norm layer. To account for word order, essential for language understanding, positional encoding (PE) is incorporated. PE vectors, identical in dimension to word embeddings, are added to the word embeddings. These PE vectors are computed using a predetermined formula:
where
denotes the position of the word in the sentence and
denotes the dimension of the PE.
In the multi-head self-attention mechanism, attention scores are computed from query (
Q), key (
K), and value (
V) matrices. To mitigate the impact of large inner products between
Q and
K, the result is scaled by the square root of the key dimension (
). Subsequently, a softmax function is applied to obtain attention weights, which are then used to compute a weighted sum of the values:
Following the multi-head attention mechanism, a feed-forward network is applied. Both components are wrapped in residual connections and Layer Normalization (Add & Norm). The encoder processes the input matrix
X through multiple layers of this structure, generating the following output representation:
A single encoder processes the input matrix and outputs a matrix O of identical dimensions. To enhance representation learning, multiple encoder layers can be stacked sequentially. The output of each layer becomes the input to the subsequent one, with the final output derived from the last encoder.
3.4. Decoder
The decoder combines the outputs of the sentence encoder and graph encoder into a unified representation:
Specifically, the decoder employs two multi-head attention layers. The first layer utilizes masked self-attention to prevent the model from attending to future tokens during training, akin to the standard Transformer. The second layer’s
K and
V matrices are computed from a concatenation of the sentence encoder’s and graph encoder’s outputs, while the
Q matrix is derived from the previous decoder layer’s output. This allows the decoder to access information from both the sentence and dependency graph for each generated token. Finally, a softmax layer predicts the next word based on the decoder’s output.
4. Experimentation and Evaluation
This paper compares our proposed model, DASUM, to a baseline Transformer and a GAT+Transformer model to evaluate the effectiveness of the R-GAT module. While R-GAT and GAT models excel at modeling graph-structured data, they are rarely applied to natural language generation tasks. Therefore, in this study, we did not conduct separate comparative experiments on text summarization using R-GAT or GAT. Additional text summarization models will be incorporated for a comprehensive evaluation.
4.1. Datasets
We train and evaluate our model on the LCSTS [
11] dataset, a benchmark for Chinese text summarization. Characterized by short text length and noise, LCSTS comprises three main parts. The first part of LCSTS, which contains 2,400,591 summary-text pairs across domains such as politics, athletics, military, movies, games, etc., offering diverse summarization topics and styles, is used as the training set of this paper. The second part of 10,666 manually annotated pairs including scores ranging from 1 to 5, indicating the correlation between the texts and the corresponding summary, is used as the validation set.
The third and final part of the dataset is a test set comprising 1106 samples. This set was manually curated based on the relevance between short texts and abstracts. To ensure data quality, duplicates, obvious mismatches, and other inconsistencies were removed, resulting in a final test set of 1012 samples.
Table 2 presents the names and sizes of the three subsets within the LCSTS dataset.
4.2. Model Hyperparameters
Our model leverages the Transformer architecture as its foundation, integrating an R-GAT network for enhanced performance (the complete source code for the experiment described in this paper can be found on the following website:
https://gitee.com/liyandewangzhan/dasum, accessed on 20 July 2024). Based on previous findings [
28,
29] which suggest that smaller dimensions are effective for low-resource tasks like short-text summarization, we set the word embedding and hidden layer dimensions for both the sentence encoder and decoder to 512. Additionally, we limit the maximum sentence length to 512 tokens. To efficiently capture inter-sentence relationships, we adopt a 6-layer, 8-head multi-head attention mechanism as our base model. Future experiments will explore larger-scale attention mechanisms. To ensure consistency with the Transformer module, we employ the same hyperparameter configuration for the R-GAT network. This includes a 6-layer, 8-head self-attention mechanism with a word embedding dimension and hidden state size of 512. To promote the model’s generalization ability, we introduce a dropout rate of 0.1. We utilize the Adam optimizer with a learning rate of 1e-3 and momentum parameters of (0.9, 0.98). These hyperparameter configurations improve our model’s effectiveness in processing text tasks, leading to enhanced performance and efficiency.
4.3. Assessment Criteria
To evaluate the quality of generated summaries, we employed the standard ROUGE-1, ROUGE-2, and ROUGE-L metrics. These metrics correlate well with human judgments and assess the accuracy of words, bi-gram matching, and longest common subsequences, respectively.
Table 3 presents the ROUGE scores for all experiments.
Experimental results demonstrate that the Transformer baseline significantly outperformed the RNN model, achieving 12.92, 11.54, and 11.59 points higher F1 scores on ROUGE-1, ROUGE-2, and ROUGE-L, respectively, demonstrating the Transformer’s superiority. Our model further improved upon this foundation, surpassing the Transformer by 7.79, 4.70, and 8.73 points on ROUGE-1, ROUGE-2, and ROUGE-L F1 scores, respectively. These findings underscore the effectiveness of the DASUM model.
Additionally, we evaluated the three models (DASUM, GAT+Transformer and Transformer) on the NLPCC test dataset (NLPCC 2017 Shared Task Test Data (Task 3), available at:
http://tcci.ccf.org.cn/conference/2017/taskdata.php, accessed on 20 July 2024), which contains significantly longer text compared to the LCSTS dataset. Due to resource limitations, we directly tested the models trained on LCSTS without retraining them on NLPCC. As shown in
Table 4, model performance on NLPCC is notably lower than on LCSTS. This is primarily attributed to the discrepancy in data distribution, with NLPCC predominantly featuring longer texts. Nevertheless, the DASUM model consistently outperforms both the GAT+Transformer and Transformer models on the NLPCC dataset.
Meanwhile, we employed the StructBERT Chinese natural language inference model and ChatGPT-3.5-turbo large language model to evaluate factual correctness of summary texts. StructBERT, trained on CMNLI and OCNLI datasets, determines the semantic relationship between the source document and the generated summary. Using the dialogue template presented in
Table 5, we instructed the ChatGPT-3.5-turbo model to provide a factual correctness score between 0 and 5 for each generated summary. By averaging these scores across all summaries in the test set via the ChatGPT API, we determined the model’s overall factual correctness.
A comparative analysis of DASUM, the baseline Transformer, and GAT+Transformer across these evaluation metrics (presented in
Table 6) demonstrates the superior factual correctness of our proposed model. Furthermore, manual evaluations of randomly selected samples support these findings.
4.4. Manual Assessment
To further assess our model’s ability to enhance semantic relevance and mitigate content bias, we conducted a manual evaluation focusing on fidelity, informativeness, and fluency. Representative examples are presented in
Table 7.
5. Conclusions and Future Work
This paper explores the utilization of dependency graphs to enhance Chinese text summarization models. We introduce DASUM, a novel model that effectively integrates original text with dependency information to guide summary generation. Experimental results on the LCSTS dataset demonstrate the model’s effectiveness in improving semantic relevance and overall summary quality. While dependency relations offer valuable insights, challenges such as parsing errors in complex sentences and limited handling of cross-sentence dependencies persist. Future work will attempt to address the aforementioned issues, including the following:
Incorporating multiple dependency parsers to mitigate the impact of errors from individual parsers;
Enhancing dependency graph construction by considering the unique syntactic patterns and idiomatic expressions of the Chinese language;
Leveraging human-in-the-loop reinforcement learning to iteratively refine summary generation quality.
Furthermore, we aim to develop more comprehensive evaluation metrics to assess summary quality from multiple perspectives.