Cross-Domain Document Summarization Model via Two-Stage Curriculum Learning

Lee, Seungsoo; Kim, Gyunyeop; Kang, Sangwoo

doi:10.3390/electronics13173425

Open AccessArticle

Cross-Domain Document Summarization Model via Two-Stage Curriculum Learning

by

Seungsoo Lee

,

Gyunyeop Kim

^*

and

Sangwoo Kang

^*

School of Computing, Gachon University, 1342, Seongnam-daero, Sujeong-gu, Seongnam-si 13120, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(17), 3425; https://doi.org/10.3390/electronics13173425

Submission received: 26 June 2024 / Revised: 19 August 2024 / Accepted: 27 August 2024 / Published: 29 August 2024

(This article belongs to the Special Issue Natural Language Processing Method: Deep Learning and Deep Semantics)

Download

Browse Figure

Versions Notes

Abstract

:

Generative document summarization is a natural language processing technique that generates short summary sentences while preserving the content of long texts. Various fine-tuned pre-trained document summarization models have been proposed using a specific single text-summarization dataset. However, each text-summarization dataset usually specializes in a particular downstream task. Therefore, it is difficult to treat all cases involving multiple domains using a single dataset. Accordingly, when a generative document summarization model is fine-tuned to a specific dataset, it performs well, whereas the performance is degraded by up to 45% for datasets that are not used during learning. In short, summarization models perform well with in-domain cases, as the dataset domain during training and evaluation is the same but perform poorly with out-domain inputs. In this paper, we propose a new curriculum-learning method using mixed datasets while training a generative summarization model to be more robust on out-domain datasets. Our method performed better than XSum with 10%, 20%, and 10% lower performance degradation in CNN/DM, which comprised one of two test datasets used, compared to baseline model performance.

Keywords:

naturual language processing; abstractive summarization; curriculum learning

1. Introduction

With the recent development of large-scale artificial intelligence (AI), humans receive more assistance in areas of their daily lives. One of these areas is time-saving text processing via document summarization. Efficient processing of a vast amount of text generated in real-time has become increasingly important due to social media development, and the use of document summarization tasks to shorten long texts using large-scale language models [1,2,3] such as ChatGPT is increasing.

With the increasing demand for document summarization tasks, research on natural language processing (NLP), which utilizes generative language models through various domain document summarization datasets, is actively ongoing. Document summarization is an NLP sub-task that generates natural summary sentences while preserving the key content of the main text. The types of document summarization datasets vary according to the data source and collection method used for the summarization, including news [4,5,6,7], journals [8], and meeting notes [9]. With the development of document generative summarization technology, the latest generative summarization models [2,10,11] have achieved high performance for each of these document summarization datasets.

However, we still need to overcome summarization model performance degradation when applied to an unseen domain. As shown in Section 5, we discovered significant performance degradation of the summarization model that achieved the best performance on individual document summarization datasets. As further discussed in Section 5, it was confirmed that summarization model performance on out-domain data, which did not appear during the training process, significantly decreased. In actual applications, documents on various topics can be inputted. Therefore, there is a need for research on developing a document generative summarization model that mitigates summarization performance degradation for out-domain documents that do not appear during the training process, while not compromising the in-domain performance.

This paper presents the first experimental study on out-domain summarization methodology in a cross-domain setting that combines document summarization data samples from various domains to train a model on mixed training data. When training with cross-domain summarization data, the proposed method uses curriculum learning [12] to improve the performance on both in- and out-domain data for some document summarization datasets. Our contributions can be summarized as follows: (1) We proposed a two-stage abstractive summarization technique that can adapt curriculum learning in a cross-domain environment. (2) We developed a simple method to measure the difficulty of each component of the cross-domain dataset and observed that domain similarity affects the degree of performance degradation in out-domain tasks.

The remainder of this paper is organized as follows. In Section 2, the foundational technologies are discussed to help us understand the training data composition for document summarization and generative language models. Recent research trends related to document generative summarization and the proposed curriculum learning method are discussed in Section 3; the proposed method is further described in Section 4. In Section 5, the two types of curriculum learning methods used in the proposed method are explained. Section 6 presents experiments conducted on the proposed method to evaluate the summarization performance on both in- and out-domain summarization datasets. Finally, Section 7 concludes the paper.

2. Background

In this paper, a training method that mixes datasets with different domain compositions when training a document generative summarization model is proposed. In this section, we present the background knowledge related to document summarization tasks and curriculum learning.

2.1. Text Summarization

Document summarization is an NLP task that generates a shorter, condensed version of a long text while preserving key content [13,14,15]. Document summarization can be broadly divided into extractive and abstractive summarization, depending on how the problem is defined. Extractive summarization involves selecting important sentences from an original document, while abstractive summarization involves rephrasing the core content of the original text into a shorter version [13,14,16,17].

Abstractive summarization is defined as a natural language generation (NLG) problem in which an input document is summarized by generating new natural language sentences [14]. Owing to the nature of document summarization tasks, the input documents are typically very long. Therefore, many training methodologies are proposed to treat a long input length for the summarization model [18,19]. Previously, abstractive summarization did not yield satisfactory results owing to the limitations of NLP technology; thus, related research has not advanced significantly [20].

Abstractive summarization datasets typically consist of the main text and a reference summary created by humans. The goal of generative summarization is to produce a short, coherent summary that captures the essence of the main text [13,14,20], offering better readability than extractive summarization [5,6], and the generation of abstract sentences that are not present in the original text [13].

Most research on abstractive summarization uses lexical overlap and semantic similarity scores between the reference and generated summaries as evaluation metrics. Lexical overlap measures the surface similarity between input pairs, namely, the reference and generated summaries. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [21] is a lexical metric that calculates the match ratio of word subsequence overlap between them. ROUGE-N measures N-gram overlap, typically using uni-gram or bi-gram metrics such as ROUGE-1 and ROUGE-2 [6,22]. ROUGE-L measures the overlapped Longest Common Subsequence (LCS) to capture the overlap of non-continuous subsequences. Conversely, BERTScore [23] is a semantic similarity score that uses a pre-trained embedding model to measure the pairwise cosine similarity of input pairs. BERTScore uses the BERT [24] embedding score to measure similarity for each token pair.

2.2. Abstractive Summarization Methods

Abstractive Summarization employs many methods like template-based [25] and graph-based [26] methods. Due to the frequent occurrence of reduction and paraphrasing in abstractive summarization, a methodology that uses deep-learning methods [27,28] for text summarization has become popular [29]. Bidirectional Encoder Representations from Transformers (BERT) [24], pre-trained model-based methodologies have shown high performance in various NLP tasks. According to this trend, high-performance pre-trained models [2,10,11] have been fine-tuned and show high performance in various summarization tasks.

After the introduction of BERT [24], pre-trained language models have generally been used to perform document generative summarization tasks. In [10,11], performance improvements in summarization tasks were demonstrated by adopting effective pre-training tasks. BART [10] uses the denoising autoencoder (DAE) method during the pre-training phase, where the objective function involves restoring the original text from the input corrupted with token-level noise. It exploits the token generation ability of various denoising techniques such as random token deletion, substitution, and masking. PEGASUS [11] employs the gap sentence generation (GSG) method as a pre-training objective for generative summarization. It expands the BART token denoising task to the sentence level by masking and restoring all key sentences from the input.

2.3. Curriculum Learning

Curriculum learning [12] mimics the learning process of humans by gradually exposing the model to data samples with increasing difficulty levels, starting from easy samples and progressing to harder ones. There are two main types of curriculum learning based on how difficulty levels are applied during the model training process: data- and model-level methods. Data-level curriculum learning involves providing data samples to the model in an order predefined by their difficulty levels and gradually accumulating them as training data at each step of the learning process. Data-level methods can improve the model convergence when the curriculum is well-constructed. Model-level curriculum learning adjusts the proportion of loss values according to the difficulty level of the data samples when updating the model. Model-level methods need less effort than manually determining the difficulty, but the model might face an overfitting risk.

Self-paced learning (SPL) [30] has been proposed for image-segmentation tasks [31]. Self-paced learning is a model-level curriculum-learning method in which only data samples with loss values below a certain threshold are selected based on the model’s loss values during the learning process. In contrast to data-level curriculum learning, which processively increases data samples at each step of the learning process, self-paced learning gradually increases the difficulty threshold for model updates, incorporating more challenging samples into the model. Self-paced learning may enhance the robustness of the model, but it is challenging to modify the learning strategy as it is designed to solve image tasks but not texts.

3. Related Work

3.1. Abstractive Summarization on Various Domains

Recently, a contrastive learning method has been proposed to align the differences between the pre-training objective functions of sequence-to-sequence (Seq2Seq) language models [10,32] and the summarization metrics introduced in Section 2. Seq2Seq language models maximize the likelihood of word-level occurrences of the reference sentence during the decoding process by Maximum Likelihood Estimation (MLE) training. However, the summarization model performance is measured using sentence-level evaluation metrics, which compare the generated sentence with the reference summary. Ref. [33] proposed a method to mitigate the issue of different application units between language model training and evaluation by contrastive learning. This ensures that the generated summaries are represented closer to the reference summaries in semantic space. In an extension of [33], ref. [34] proposes a method that adjusts the weight loss value according to the occurrence probability of generated summaries in fine-tuned summarization models.

Despite these approaches for further model training using DAE [10] or contrastive learning [33,34], these studies have difficulty applying the training conducted on a specific dataset to other datasets. However, several studies that expand summarization models to work in more generalized settings exist. Ref. [35] tried to detect the out-domain sample among inputs by analyzing the distance features of a pre-trained model. Ref. [36] proposed a learning method for preference models in a low-resource summarization environment. Ref. [37] presented an experimental study of various domain adaptation methods on abstractive summarization. Ref. [37] showed that the gaps in results depended on resource allowance and suggested the need for higher-level domain adaptation methods for abstractive summarization tasks.

3.2. Curriculum Learning in NLP

Several studies incorporated curriculum learning into NLP procedures to enhance performance. Ref. [38] applies curriculum learning to a question-answering task, with the NLP task choosing from candidates the answer that is most relevant to a given passage. It ensembles the choice of curriculum learning results. Ref. [39] performs data-level curriculum learning by choosing one answer among generation results during the training process, which can be relevant to reference sentence lengths. Ref. [40] applied data-level curriculum learning by increasing the length of correct answers from the entire training dataset while training machine-translation models.

Conversely, there are approaches to using curriculum learning for model training efficiency. Ref. [41] guided Transformer-based models with curriculum learning by weighting the candidates to create an efficient learning procedure. Ref. [42] adopted curriculum learning to boost a prompting model’s perturbation, the curriculum strategy following the degree of labeled data.

4. Proposed Method

We propose curriculum learning using cross-domain summarization datasets to alleviate the performance degradation of out-domain data in generative summarization models.

Overall Architecture

In this study, we use a cross-domain summarization dataset as training data for generative summarization models. Figure 1 illustrates the entire training process of the proposed method. The cross-domain summarization dataset is composed of mixed datasets from various domains. By using a cross-domain dataset and two-step curriculum learning, we expect to avoid the performance drop in out-domain scenarios by two-stage curriculum learning. Figure 1 illustrates the entire training process of the proposed method. This approach applies two-stage curriculum learning at both the dataset level and the model level.

For the first stage of curriculum learning for the cross-domain summarization dataset, we use a dataset-level curriculum-learning process to consider the domain difficulty between domains. The dataset-level curriculum is processed to determine each domain’s difficulty for inter-domain difficulty instead of the data-level method of prior works. In Section 5, the method to determine cross-domain dataset difficulty is evaluated using statistical metrics. For each training step of the dataset-level process, data samples at each step are repeatedly accumulated and processed to the next step to train the model.

For the second stage of curriculum learning for the cross-domain summarization dataset, we incorporate a model-level curriculum-learning process to consider intra-domain difficulty on each step’s accumulated data samples. Although data-level curriculum learning can reflect the difficulty level between domains, it may not adequately address the variability in difficulty level within each domain at the individual data-sample levels. We suggest model-level curriculum learning to adjust the amount of learning between data samples during the repetitive training process by focusing on easier examples within the accumulated samples. Our method used SuperLoss [43] as a model-level strategy, to prevent some edge samples from over-influencing and interfering with the model training process. SuperLoss [43] is a curriculum-learning technique that appends an extra loss function on top of the task loss. We applied SuperLoss as a breakwater for stable learning over harsh training conditions within a cross-domain dataset containing texts from various sources. In our process, the SuperLoss support model is used to distinguish difficult samples within mixed data accumulation, selected via dataset-level curriculum learning.

5. Dataset-Level Curriculum Learning

In this study, a cross-domain dataset

D = {d_{1}, d_{2}, \dots, d_{N}}

was sorted based on the difficulty level of curriculum learning. Here, d represents a single-domain dataset, and D is assumed to be sorted according to difficulty level. In other words, if

i < j

,

d_{i}

is considered easier than

d_{j}

. The process of determining the difficulty level of a single-domain summarization dataset

d_{i} = {(d o c_{1}^{i}, r e f_{1}^{i}), (d o c_{2}^{i}, r e f_{2}^{i}), \dots (d o c_{n}^{i}, r e f_{n}^{i})}

is as follows, where n is the number of data in the single-domain summarization datasets d.

d o c

and

r e f

represent the body text and the reference summary of the summarization dataset.

\begin{matrix} S (d o c_{j}^{i}, r e f_{j}^{i}) = α \times r a t i o L e n (d_{i}) - β \times R (d o c_{j}^{i}, r e f_{j}^{i}) + γ \times s t d D e v (d o c_{j}^{i}, r e f_{j}^{i}) \end{matrix}

(1)

\begin{matrix} s c o r e (d_{i}) = \sum_{j = 0}^{n} \frac{S (d o c_{j}^{i}, r e f_{j}^{i})}{N} \end{matrix}

(2)

Equation (1) evaluates the difficulty level of the j-th data in

d_{i}

. In this study, we assumed that a smaller value indicates an easier dataset. To determine the difficulty level of dataset

d_{i}

, the difficulty level was measured for each data point in

d_{i}

. Equation (2) defines

s c o r e (d_{i})

as the learning difficulty of single-domain dataset

d_{i}

. The difficulty level of each dataset was averaged to derive the overall difficulty of the dataset.

In Equation (1), the difficulty level measurement for one data point is based on three scores, namely,

(r a t i o L e n, s t d D e v, R)

.

r a t i o L e n

is a score based on the ratio of the length of the source document to the reference summary. We assumed that summarizing a long document in a shorter sentence would increase the difficulty level. Therefore, if the length of the reference summary is shorter than that of the source document, we consider it more difficult.

r a t i o L e n (d o c_{j}^{i}, r e f_{j}^{i})

is calculated by subtracting the ratio of token numbers

r e f_{j}^{i} / d o c_{j}^{i}

from unity. This evaluation considers a dataset to be easier if the ratio of lengths between the source document and the reference summary is close to unity, indicating a similarity in length. A smaller value indicates a similarity in length between the source document and the reference summary; therefore, it is an easier dataset. The formula for this process is as follows;

σ_{i}

represents the standard deviation of

r a t i o L e n

in the single-domain summary dataset

d_{i}

.

\begin{matrix} r a t i o L e n (d o c_{j}^{i}, r e f_{j}^{i}) = | 1 - \frac{l e n g h t (r e f_{j}^{i})}{l e n g h t (d o c_{j}^{i})} | \end{matrix}

(3)

R represents the ROUGE score measured between the source document and reference summary. ROUGE is an evaluation metric used to measure the similarity between sentences. We assume that a lower similarity between the source document and reference summary indicates a higher difficulty level. ROUGE scores are higher when the similarity between sentences is higher and lower when the similarity is lower. We converted the ROUGE score to a negative value so that a higher R indicates an easier dataset in terms of difficulty level.

\begin{matrix} R (d o c_{j}^{i}, r e f_{j}^{i}) = - R O U G E (d o c_{j}^{i}, r e f_{j}^{i}) \end{matrix}

(4)

s t d D e v

represents the reciprocal of the standard deviation of

r a t i o L e n

for the single-domain dataset

d_{i}

, divided by the number of data n. We believe that inconsistent length ratios among the data within a dataset indicate a lack of coherence. Consequently, datasets with a lower consistency are presumed to be more difficult to learn. Therefore, datasets with smaller standard deviation values are considered easier because they exhibit higher consistency among the lengths of the data. Unlike the previous two scores

(r a t i o L e n, R O U G E)

, the standard deviation is a statistical metric at the sentence level. However, for ease of weight adjustment in the weighted sum, the standard deviation is normalized by the number of data points, resulting in measurement difficulty for each data point. The formula for this process is as follows:

\begin{matrix} s t d D e v (d_{i}) = \frac{σ_{i}}{n} \end{matrix}

(5)

To measure the final difficulty of a single dataset, the weighted sum of each score was calculated, as shown in Equation (2). The weights for each score are denoted by

α, β

, and

γ

. In the experiments, the weights were set to

α = 0.5, β = 0.1

, and

γ = 0.4

.

For dataset-level curriculum learning, an ordered dataset based on the measured difficulty was used, denoted as

D = {d_{1}, d_{2}, \dots, d_{N}}

. If

i < j

,

d_{i}

is evaluated as easier than

d_{j}

. The number of learning steps in curriculum learning is the same as the number of single-domain datasets used, denoted by N. During the N learning steps, in the i-th learning step, only the easy i datasets from the entire dataset are utilized for learning. In other words, in the i-th learning step,

D_{1 \sim i} = {d_{1}, \dots, d_{i}}

is used for learning.

6. Experimental Section

In this section, we present the experimental setup, results, and an analysis of the experiments conducted to evaluate the performance of the proposed methodology.

6.1. Model

In the experiment step, PEGASUS [11] was used as the backbone model for abstractive summarization. PEGASUS [11] employs a mix of news and web-crawling data in proportions suitable for the data size during the pre-training process [2,4]. PEGASUS is a pre-trained model fine-tuned on all document summarization datasets used in the experiments and has excellent summarization performance. To put long input summarization document data as a context, we use the ‘pegasus-large’ version in [11] for an experiment baseline. We truncate input document length exceeding 1024 tokens, which is the maximum token length of the baseline model.

6.2. Datasets

In this study, we selected six representative datasets commonly used in abstractive summarization for experimentation. Table 1 provides the domain and summary relationship information for the selected datasets.

WikiHow [44] is an extractive summarization dataset that utilizes responses, and the topic is marked as the correct answer on the question-and-answer website WikiHow. ArXiv [8] uses the titles and abstracts of academic papers from the scholarly archive website ArXiv as the source for abstractive summarization. CNN/DM [5] is an extractive summarization dataset that uses news articles from CNN and headlines from each paragraph. XSum [6] is an abstractive summarization dataset using news articles from the BBC with abstract-like summaries, which consist of more abstractive titles than CNN/DM. RedditTIFU [45] is a summarization dataset extracted from conversations in the English-speaking online community Reddit. In [46], the convention of summarizing posts with the keyword ‘TLDR’ at the end of community posts was briefly utilized to generate abstractive summaries. SAMSUM [9] is a 1:1 messenger conversation summarization dataset composed of conversation logs and summary content.

For the experiments in this study, we divided the six datasets [5,6,8,9,44,45] into training and evaluation datasets. Four datasets, namely, WikiHow [44], ArXiv [8], XSum [6], and RedditTIFU [45] were used for training, while a subset of four single-domain datasets was sampled to construct the proposed cross-domain dataset. CNN/DM [5] and SAMSUM [9] were used for evaluation. Using different datasets for training and evaluation, we demonstrate the effectiveness of the proposed methodology in out-domain scenarios. Following the divide-and-conquer approach described in [47], we constructed the training dataset using the first paragraph and correct summary from the entire document of the six summary datasets.

Owing to data imbalance in each dataset and for an efficiency and data diversity balance, we randomly sampled 10,000 samples from the training data of each training dataset. Similarly, for the evaluation dataset, we selected 1000 samples from the test set of each single-domain dataset.

6.3. Experiment Setting

We measure our baseline performance on out-domain data according to which fine-tuning method and training data are used. The hyperparameters used during training are listed in Table 2. The experiment is performed with the PEGASUS [11] model through Huggingface, using the ‘pegasus-large’ model to allow for the long input context of the experiment dataset. The experiment was run on hardware with two RTX3090 GPUs for 3 h for 3 epochs. We trained the model during noticeable performance changes. Following the other hyperparameter settings employed in [34], we set the learning rate to

2 \times 10^{- 2}

, maximum generation length to 128, maximum generation length to 11, and length penalty to 1.0. We regulated the SuperLoss weight parameter

λ

to 10.

6.4. Experimental Result and Analysis

In this experiment, the performance of out-domain scenarios was examined by comparing the results of the proposed methodology with cases that fine-tune the model using a single dataset. Table 3 shows backbone model [11] performance on test set data according to the fine-tuning process. Table 4 and Table 5 show the out-domain performance of the model according to various training datasets. The sample result of the model trained using the proposed methodology is available in Table A1 in the Appendix A.

In Table 3, the drop in backbone model [11] performance averaged of 65%. This shows that the original model’s performance is vulnerable to out-domain conditions, which means testing on data unseen while training. This shows that the test data used in this paper is harder than other training datasets, as seen by the 31% or 23% more performance decrease with unseen test data.

Table 4 and Table 5 both show that model out-domain performance significantly increased when model-level curriculum learning is used. However, the RedditTIFU dataset exhibits a smaller performance drop than other training datasets even without model-level curriculum application. This smaller performance drop is likely due to the nature of RedditTIFU [45], which consists of conversational summaries from a community, similar to the dialogue summary dataset SAMSUM [9]. This can indicate that RedditTIFU can positively influence inference on similar out-domain settings like SAMSUM. In summary, the performance degradation without model-level curriculum learning shows that model-level curriculum learning acts as a performance buffer for model inference on unseen domains.

In Table 4, the model trained with the proposed dataset generally exhibited higher performance than XSum, which performed lower than other datasets with model-level curriculum learning. It is notable that the model trained with the proposed dataset achieves a ROUGE-L score of 22.96, which is the highest score among models trained with model-level curriculum learning. However, in Table 5, the model trained with the proposed dataset shows slightly lower performance when using other training data except XSum. This is remarkable because XSum [6] is a dataset from the newspaper domain, similar to CNN/DM [5], as with SAMSUM, as shown in Table 4. It seems that using only XSum may cause overfitting of training data while using mixed-domain data like our proposed dataset prevents the model from being overfitted.

7. Conclusions

In this study, we conducted performance evaluations on in- and out-domain datasets to identify fine-tuning methods for generative summarization models that are applicable across various domains. Applying the two proposed curriculum-learning methods mitigated the drop in performance with out-domain evaluation data during fine-tuning. We discovered that a conversational dataset can act as a buffer for the same conversational test data, although it is not used during the pre-training step. Implementing model-level curriculum learning prevented a significant drop in performance on out-domain performance compared to the models that were not fine-tuned. Utilizing mixed samples with data-level curriculum learning ensured smaller performance variations in the out-domain evaluation dataset [5,9] across different training data samples used in the experiments.

In future work, we plan to refine the criteria for evaluating learning difficulty and expand the variety and quantity of the experimental datasets to verify the generalizability of the proposed methods. In this study, we sampled training and validation datasets for efficiency in the experimental process. Despite the fact that we sampled randomly, sampled data can cause different performance effects to those of using all available data. Therefore, we intend to use the full dataset to directly compare the performance with other studies. In addition to ROUGE, we incorporated semantic similarity-based evaluation metrics and unsupervised evaluation techniques to assess the generalizability of the proposed methods from multiple perspectives.

Author Contributions

Methodology, S.L.; software, S.L.; formal analysis, S.L. and G.K.; validation, S.L.; writing—original draft preparation/review and editing, S.L. and G.K.; supervision/project administration/funding acquisition, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Government of Korea (MSIT) (No. 2022R1A2C1005316) and in part by the Gachon University research fund of 2020 (GCU-202008450001).

Data Availability Statement

A publicly available dataset was utilized in this study. These data can be found here: https://github.com/teddy309/CL_abSum (accessed on 1 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Summarization Result on CNN/DM

In Table A1, the baseline model tends to generate too short an extracted summary compared to the reference summary (Gold). The model trained with our method is considered to contain as much information as the reference summary.

Table A1. Example of summary in CNN/DM. Example output of summaries generated by baseline and proposed method. The red part of the generated example is the parts that are factually inconsistent. A factually inconsistent summary was generated both by PEGASUS and the model trained with the cross-domain dataset (Ours). Lorraine Valentine is never called a ‘real-life vampire’ literally and it is not certain that she is a mother of four children.

Model	Example
PEGASUS (base)	a woman who has a rare medical condition that makes her allergic to the sun has been named as the ‘real life vampire’.
Ours	Lorraine Valentine suffers from the rare disease erythropoietic protoporphyria (EPP) which causes her skin to burn and become itchy when exposed to sunlight. The 42-year-old went on a holiday to Lanzarote last June but was in hospital for six days as her skin swelled up. The mother-of-four has to keep herself completely covered up in summer as even a small dose of sun leaves her in pain. She was first diagnosed with the incurable condition when she was nine years old.
Gold	Lorraine Valentine, 42, suffers from erythropoietic protoporphyria (EPP) Rare condition means she burns, itches and swells in sunlight or UV light. Was hospitalised for 6 days with burns after a family holiday in Lanzarote. Has to cover herself completely as small amounts of light leave her in pain.

References

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Nice, France, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1532–4435. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
Grusky, M.; Naaman, M.; Artzi, Y. Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; (Long Papers). Walker, M., Ji, H., Stent, A., Eds.; Volume 1, pp. 708–719. [Google Scholar] [CrossRef]
Hermann, K.M.; Kočiský, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching Machines to Read and Comprehend. In Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 1, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Narayan, S.; Cohen, S.B.; Lapata, M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; pp. 1797–1807. [Google Scholar] [CrossRef]
See, A.; Liu, P.J.; Manning, C.D. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Long Papers. Volume 1, pp. 1073–1083. [Google Scholar] [CrossRef]
Cohan, A.; Dernoncourt, F.; Kim, D.S.; Bui, T.; Kim, S.; Chang, W.; Goharian, N. A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; (Short Papers). Volume 2. [Google Scholar]
Gliwa, B.; Mochol, I.; Biesek, M.; Wawer, A. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, Hong Kong, China, 4 November 2019. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; pp. 7871–7880. [Google Scholar] [CrossRef]
Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; (Proceedings of Machine Learning Research). Volume 119. [Google Scholar]
Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum Learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009. [Google Scholar]
Gu, J.; Lu, Z.; Li, H.; Li, V.O. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Long Papers. Erk, K., Smith, N.A., Eds.; Volume 1, pp. 1631–1640. [Google Scholar] [CrossRef]
Kim, T.Y.; Kim, J.; Kang, H.W.; Kim, S.B.; Kang, P.S. Building an integrated framework of Korean text summarization and voice synthesis. Ind. Eng. Manag. Syst. 2022, 48, 80–90. [Google Scholar]
Berry, M.W.; Dumais, S.T.; O’Brien, G.W. Using Linear Algebra for Intelligent Information Retrieval. SIAM Rev. 1995, 37, 573–595. [Google Scholar] [CrossRef]
Gudivada, V.N. Chapter 12—Natural Language Core Tasks and Applications. In Handbook of Statistics; Gudivada, V.N., Ed.; Elsevier: Amsterdam, The Netherlands, 2018; pp. 403–428. [Google Scholar] [CrossRef]
Zhu, C. Chapter 8—Applications and future of machine reading comprehension. In Machine Reading Comprehension; Zhu, C., Ed.; Elsevier: Amsterdam, The Netherlands, 2021; pp. 185–207. [Google Scholar] [CrossRef]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. Adv. Neural Inf. Process. Syst. 2020, 33 , 17283–17297. [Google Scholar]
Kim, D.H.; Lee, S.W.; Lee, G.G.B. Query-Based Document Summarization using Important Sentence Selection Heuristics and MMR. In Proceedings of the Annual Conference on Human and Language Technology, Human and Language Technology, Cheongju-si, Republic of Korea, 11–12 October 2002; pp. 285–291. [Google Scholar]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004. [Google Scholar]
Kryscinski, W.; McCann, B.; Xiong, C.; Socher, R. Evaluating the Factual Consistency of Abstractive Text Summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 9332–9346. [Google Scholar] [CrossRef]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; (Long and Short Papers). Volume 1, pp. 4171–4186. [Google Scholar]
Harabagiu, S.M.; Lacatusu, F. Generating single and multi-document summaries with gistexter. In Proceedings of the Document Understanding Conferences, Philadephia, PA, USA, 11–12 July 2002; pp. 11–12. [Google Scholar]
Givchi, A.; Ramezani, R.; Baraani-Dastjerdi, A. Graph-based abstractive biomedical text summarization. J. Biomed. Inform. 2022, 132, 104099. [Google Scholar] [CrossRef] [PubMed]
Graves, A. Long Short-Term Memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Zhang, Y.; Li, D.; Wang, Y.; Fang, Y.; Xiao, W. Abstract Text Summarization with a Convolutional Seq2seq Model. Appl. Sci. 2019, 9, 1665. [Google Scholar] [CrossRef]
Jiang, L.; Meng, D.; Zhao, Q.; Shan, S.; Hauptmann, A.G. Self-Paced Curriculum Learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2009; Volume 29. [Google Scholar]
Kumar, M.P.; Turki, H.; Preston, D.; Koller, D. Learning specific-class segmentation from diverse data. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Liu, Y.; Liu, P. SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 2–4 August 2021; Short Papers. Volume 2. [Google Scholar]
Liu, Y.; Liu, P.; Radev, D.; Neubig, G. BRIO: Bringing Order to Abstractive Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Long Papers. Volume 1. [Google Scholar]
Xu, K.; Ren, T.; Zhang, S.; Feng, Y.; Xiong, C. Unsupervised Out-of-Domain Detection via Pre-trained Transformers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; Long Papers. Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Volume 1, pp. 1052–1061. [Google Scholar] [CrossRef]
Chen, Y.S.; Song, Y.Z.; Shuai, H.H. SPEC: Summary Preference Decomposition for Low-Resource Abstractive Summarization. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 603–618. [Google Scholar] [CrossRef]
Yu, T.; Liu, Z.; Fung, P. AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; pp. 5892–5904. [Google Scholar] [CrossRef]
Sachan, M.; Xing, E. Easy Questions First? A Case Study on Curriculum Learning for Question Answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Long Papers. Volume 1, pp. 453–463. [Google Scholar] [CrossRef]
Subramanian, S.; Rajeswar, S.; Dutil, F.; Pal, C.; Courville, A. Adversarial Generation of Natural Language. In Proceedings of the 2nd Workshop on Representation Learning for NLP), Vancouver, BC, Canada, 3 August 2017; pp. 241–251. [Google Scholar] [CrossRef]
Liu, C.; He, S.; Liu, K.; Zhao, J. Curriculum Learning for Natural Answer Generation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, (IJCAI-18)), International Joint Conferences on Artificial Intelligence Organization, Stockholm, Sweden, 13–19 July 2018; pp. 4223–4229. [Google Scholar] [CrossRef]
Sotudeh, S.; Goharian, N.; Deilamsalehy, H.; Dernoncourt, F. Curriculum-guided Abstractive Summarization for Mental Health Online Posts. In Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), Abu Dhabi, United Arab Emirates, 7 December 2022; Lavelli, A., Holderness, E., Jimeno Yepes, A., Minard, A.L., Pustejovsky, J., Rinaldi, F., Eds.; pp. 148–153. [Google Scholar] [CrossRef]
Li, C.; Wang, L.; Lin, X.; de Melo, G.; He, L. Curriculum Prompt Learning with Self-Training for Abstractive Dialogue Summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; pp. 1096–1106. [Google Scholar] [CrossRef]
Castells, T.; Weinzaepfel, P.; Revaud, J. SuperLoss: A Generic Loss for Robust Curriculum Learning. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33. [Google Scholar]
Koupaee, M.; Wang, W.Y. WikiHow: A Large Scale Text Summarization Dataset. arXiv 2018, arXiv:1810.09305. [Google Scholar]
Kim, B.; Kim, H.; Kim, G. Abstractive Summarization of Reddit Posts with Multi-level Memory Networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; (Long and Short Papers). Volume 1. [Google Scholar]
Cachola, I.; Lo, K.; Cohan, A.; Weld, D.S. TLDR: Extreme summarization of scientific documents. arXiv 2020, arXiv:2004.15011. [Google Scholar]
Gidiotis, A.; Tsoumakas, G. A Divide-and-Conquer Approach to the Summarization of Long Documents. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 3029–3040. [Google Scholar] [CrossRef]

Figure 1. Overall Architecture of Proposed Method.

Table 1. Abstractive Summarization Dataset Description.

Dataset	Domain	Doc/Ref Policy
WikiHow	Web Question Answering	Answer/Topic
ArXiv	Research Paper (ArXiv)	Abstract/Title
Xsum	News Article (BBC News)	Article/Topic
RedditTIFU	Online Community (Reddit)	Article/TLDR
SAMSUM	Dialogue (1:1 Messenger)	History/Summary
CNNDM	News Article (CNN News)	Article/Headline

Table 2. Hyperparameters of Experiment.

Hyperparameter	Value
Epoch	3
Batch size	1
Learning rate	$2 \times 10^{- 2}$
SuperLoss-weight $λ$	10
Generation length (min/max)	11.64
Length penalty	1.0

Table 3. Model in-domain performance according to fine-tuning. This table shows model performance transition versus in-domain fine-tuning. FT indicates that the model is fine-tuned to the training data. This table shows examination performance on test data without applying the proposed methodology (model-level and dataset-level curriculum). Theerformance gap between the fine-tuned and non-fine-tuned (non-FT) model and the original version [11] may be affected by the low-resource used during the experiment.

FT	Train		Test	R-1	R-2	R-L
O	ArXiv	->	ArXiv	44.21	16.95	25.67
	WikiHow	->	WikiHow	46.39	22.12	38.41
	RedditTIFU	->	RedditTIFU	27.99	09.81	22.94
	XSum	->	XSum	47.60	24.83	39.64
X	(non-FT)	->	ArXiv	18.25	05.84	15.55
			WikiHow	17.79	05.80	16.48
			RedditTIFU	12.18	01.49	10.45
			XSum	27.14	07.76	21.51

Table 4. Results of Experiments on SAMSUM. This table shows model performance on SAMSUM with out-domain test data. DL and SPL indicate whether dataset-level or model-level curriculum is applied at each row.

DL	Train		Test	SPL	R-1	R-2	R-L
X	ArXiv	->	SAMSUM	X	06.51	00.61	05.80
	WikiHow			X	12.86	02.20	11.95
	RedditTIFU			X	21.02	06.32	19.15
	XSum			X	14.37	02.25	12.39
O	Proposed dataset	->	SAMSUM	X	13.65	02.51	13.57
X	ArXiv	->	SAMSUM	O	25.09	07.32	22.79
	WikiHow			O	25.31	07.44	22.88
	RedditTIFU			O	25.33	07.34	22.90
	XSum			O	24.79	07.46	22.38
O	Proposed dataset	->	SAMSUM	O	25.25	07.32	22.96

Table 5. Result of Experiments on CNN/DM. This table shows model performance on CNN/DM with out-domain test data. DL and SPL indicate whether dataset-level or model-level curriculum is applied at each row.

DL	Train		Test	SPL	R-1	R-2	R-L
X	ArXiv	->	CNN/DM	X	10.16	01.98	08.87
	WikiHow			X	13.28	02.26	11.30
	RedditTIFU			X	19.76	07.99	16.91
	XSum			X	15.74	03.27	13.03
O	Proposed dataset	->	CNN/DM	X	12.86	05.03	14.56
X	ArXiv	->	CNN/DM	O	25.24	10.65	21.22
	WikiHow			O	25.28	10.62	21.05
	RedditTIFU			O	25.27	10.66	21.07
	XSum			O	21.44	07.57	18.06
O	Proposed dataset	->	CNN/DM	O	24.03	09.58	20.09

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, S.; Kim, G.; Kang, S. Cross-Domain Document Summarization Model via Two-Stage Curriculum Learning. Electronics 2024, 13, 3425. https://doi.org/10.3390/electronics13173425

AMA Style

Lee S, Kim G, Kang S. Cross-Domain Document Summarization Model via Two-Stage Curriculum Learning. Electronics. 2024; 13(17):3425. https://doi.org/10.3390/electronics13173425

Chicago/Turabian Style

Lee, Seungsoo, Gyunyeop Kim, and Sangwoo Kang. 2024. "Cross-Domain Document Summarization Model via Two-Stage Curriculum Learning" Electronics 13, no. 17: 3425. https://doi.org/10.3390/electronics13173425

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Domain Document Summarization Model via Two-Stage Curriculum Learning

Abstract

1. Introduction

2. Background

2.1. Text Summarization

2.2. Abstractive Summarization Methods

2.3. Curriculum Learning

3. Related Work

3.1. Abstractive Summarization on Various Domains

3.2. Curriculum Learning in NLP

4. Proposed Method

Overall Architecture

5. Dataset-Level Curriculum Learning

6. Experimental Section

6.1. Model

6.2. Datasets

6.3. Experiment Setting

6.4. Experimental Result and Analysis

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Summarization Result on CNN/DM

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI