Unified Training for Cross-Lingual Abstractive Summarization by Aligning Parallel Machine Translation Pairs

Cheng, Shaohuan; Chen, Wenyu; Tang, Yujia; Fu, Mingsheng; Qu, Hong

doi:10.3390/math12132107

Open AccessArticle

Unified Training for Cross-Lingual Abstractive Summarization by Aligning Parallel Machine Translation Pairs

by

Shaohuan Cheng

,

Wenyu Chen

,

Yujia Tang

,

Mingsheng Fu

and

Hong Qu

^*

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(13), 2107; https://doi.org/10.3390/math12132107

Submission received: 18 June 2024 / Revised: 30 June 2024 / Accepted: 3 July 2024 / Published: 4 July 2024

(This article belongs to the Special Issue Recent Trends and Advances in the Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Cross-lingual summarization (CLS) is essential for enhancing global communication by facilitating efficient information exchange across different languages. However, owing to the scarcity of CLS data, recent studies have employed multi-task frameworks to combine parallel monolingual summaries. These methods often use independent decoders or models with non-shared parameters because of the mismatch in output languages, which limits the transfer of knowledge between CLS and its parallel data. To address this issue, we propose a unified training method for CLS that combines parallel machine translation (MT) pairs with CLS pairs, jointly training them within a single model. This design ensures consistent input and output languages and promotes knowledge sharing between the two tasks. To further enhance the model’s capability to focus on key information, we introduce two additional loss terms to align the hidden representations and probability distributions between the parallel MT and CLS pairs. Experimental results demonstrate that our method outperforms competitive methods in both full-dataset and low-resource scenarios on two benchmark datasets, Zh2EnSum and En2ZhSum.

Keywords:

cross-lingual summarization; multi-task learning; machine translation; low-resource scenario

MSC:

68T50

1. Introduction

Cross-lingual summarization (CLS) generates a shorter summary in a target language (e.g., Chinese) from a lengthy document in a source language (e.g., English). CLS enables non-native speakers to access summarized information efficiently, facilitating the exchange of information in a globalized world [1]. However, CLS poses significant challenges as it requires the simultaneous performance of translation and summarization [2].

Early approaches [3,4] typically employed a pipeline of either “summarize-then-translate” or “translate-then-summarize”, both of which are prone to error propagation issues. To address these issues, Zhu et al. [2] proposed an end-to-end CLS training method. Additionally, they created two benchmark datasets where the cross-lingual summaries are translations of standard monolingual summaries, as illustrated in Figure 1.

Several multi-task approaches [1,2,5] leverage the relationship between the cross-lingual summary and its parallel monolingual summary to improve performance. These approaches can be categorized into two groups. The first group [2,5], depicted in Figure 2a, uses two independent decoders that are responsible for the CLS and monolingual summarization (MS) tasks, respectively. However, the two decoders, with non-shared parameters, limit the model’s ability to effectively align these two tasks. The second group [1], shown in Figure 2b, employs two separate models to handle the MS and CLS tasks, learning the output features of the MS teacher through knowledge distillation. However, the performance is affected by the discrepancies between the hidden representations in different linguistic vector spaces. These methods extract the MS training pair (i.e., the source document and monolingual summary) to combine with the CLS data, thus facing obstacles caused by the mismatch between the output languages.

To avoid the issues discussed above, we propose a unified training method for the cross-lingual summarization (CLS) task. As shown in Figure 2c, our method combines the machine translation (MT) training pair (i.e., the monolingual summary and cross-lingual summary) to ensure the consistency of input and output languages.

It is inspired by [6], which views the CLS task as a translation sub-task with longer inputs. Therefore, we unify these two tasks into one task with inputs of different lengths, jointly training them within a single model. This design allows for shared model parameters and acquired knowledge. Additionally, the same linguistic vector space enables the alignment of the MT and CLS outputs without transforming their hidden representations. Therefore, we design two alignments at both the probability level and feature level to further enhance the interaction between the parallel MT and CLS data. In summary, by unifying the linguistic vector space and enhancing the semantic correlation between the two tasks at two levels, the model’s alignment and compression capabilities for CLS are improved.

To evaluate our method, we conduct extensive experiments on two benchmark datasets under both full-dataset and low-resource scenarios. The results demonstrate that our method outperforms previous CLS methods in most cases without requiring additional data. Additionally, ablation studies validate the effectiveness of each design element. In summary, our key contributions are as follows:

We propose a unified training method for CLS that learns parallel MT and CLS pairs within a single model, which is a novel and efficient integration mode for CLS data.
We design two levels of alignment between the parallel outputs to encourage the model to focus on key information from the lengthy input, thereby improving its summarization capability.
Extensive experimental results conducted in various scenarios demonstrate the superiority of our method. Ablation studies and visualization results further corroborate this conclusion.

2. Related Work

The early methods [3,4] for CLS primarily relied on a pipeline strategy, encompassing two phases: translation and summarization. These methods typically followed a sequence of “translate-then-summarize” or “summarize-then-translate”. For instance, the core idea of “translate-then-summarize” was to use existing machine translation systems to convert the document into the target language, and then apply text summarization techniques to generate summaries. As a result, these methods were prone to error propagation, where errors of the first phase could carry over into the second phase. Additionally, they failed to effectively establish semantic connections between languages, as the two phases were optimized independently.

In recent years, Zhu et al. [2] proposed applying end-to-end methods to the CLS task, achieving significant performance improvements. Furthermore, they created two benchmark datasets and introduced two heuristic methods for combining parallel monolingual summaries or additional machine translation data with CLS data, which have promoted subsequent CLS studies [7]. These approaches include multi-task learning [6,8,9,10], knowledge distillation [11,12], resource enhancement [13,14,15,16,17], pre-training frameworks [18,19,20], and multilingual training [21,22,23,24], among others.

In the realm of multi-task learning, Cao et al. [8] focus on jointly learning to align and summarize in cross-lingual summarization. They introduce a multi-task learning framework that integrates monolingual and cross-lingual summarization models into a unified model. This approach involves constructing linear mappings to project context representations from one language to another and designing several specific loss functions to facilitate this learning process. Bai et al. [6] improve upon traditional multi-task learning methods by designing a compression rate model, introducing a compression rate as a new parameter to control the amount of information retained. Takase et al. [10] train neural encoder–decoder models using genuine and pseudo cross-lingual summarization data, as well as monolingual summarization and translation data, employing special tokens attached at the beginning of input sentences to designate the target task. This enables the direct integration of different data types into the training process without additional architectural changes. Their method aims to enhance the quality and effectiveness of cross-lingual summarization by leveraging synergies between translation and summarization tasks, thereby improving performance over methods that use only pseudo data or separate models for each task.

With the development of pre-trained models, numerous works have achieved notable results in cross-lingual and even multilingual scenarios. Xu et al. [18] utilized a Transformer-based encoder–decoder architecture for mixed-lingual pre-training, leveraging a large amount of unlabeled monolingual data to enhance the model’s language understanding capabilities and improve its ability to handle language translation and summarization. Chi et al. [19] proposed the MT6 model, based on a multi-task pre-training approach, using cross-lingual parallel corpora and monolingual data, integrating these resources with various multi-task training objectives.

However, most aforementioned methods [1,6,16,21] require additional data or knowledge, while a few studies [5,9,11] focus on the given CLS data. MCLAS [9] utilized a single decoder to first generate the monolingual summary, and then produce the cross-lingual summary by aligning both the source document and the monolingual summary within the same decoding process. Nguyen et al. [11] combined knowledge distillation with Sinkhorn Divergence [25] to improve the model’s performance in handling languages with significant differences in grammatical structure and lexical morphology. Zhang et al. [5] proposed a two-stage framework where the monolingual summary model is pre-trained with the given small-scale monolingual data in the first stage. Subsequently, another decoder is introduced to train the CLS and MS tasks simultaneously. However, these methods suffer from a mismatch in output languages, which prevents CLS from effectively aligning the parallel data.

Unlike these methods, our model innovatively unifies the MT and CLS tasks by using the same encoder and decoder to process both input and output texts in the same language. This design ensures efficient parameter sharing and strengthens the model’s translation capability. By comparing and aligning the outputs of the parallel training pairs, our model more accurately captures and retains key source information, significantly enhancing the accuracy of the generated summaries based on the given small-scale CLS data.

3. Preliminary

3.1. Task Definition

Monolingual Summarization (MS). Given a document $X = {x_{i}^{A}}_{i = 1}^{| X |}$ in the source language A, the goal of MS is to generate a summary $Y = {y_{i}^{A}}_{i = 1}^{| Y |}$ in the same language A, where $| X | > | Y |$ . The MS task $F_{ms}$ can be denoted as $Y = F_{ms} (X)$ .
Machine Translation (MT). Given an input text $X = {x_{i}^{A}}_{i = 1}^{| X |}$ in the source language A, the goal of MT is to generate its translation $Y = {y_{i}^{B}}_{i = 1}^{| Y |}$ in the target language B, where $| X | \approx | Y |$ . The MT task $F_{mt}$ can be denoted as $Y = F_{mt} (X)$ .
Cross-Lingual Summarization (CLS). Given a document $X = {x_{i}^{A}}_{i = 1}^{| X |}$ in the source language A, the goal of CLS is to generate a summary $Y = {y_{i}^{B}}_{i = 1}^{| Y |}$ in the target language B, where $| X | > | Y |$ . The CLS task $F_{cls}$ can be denoted as $Y = F_{cls} (X)$ .

Note that it is assumed that a monolingual summary

Y^{'} = {{y^{'}}_{i}^{A}}_{i = 1}^{| Y^{'} |}

in the source language A is usually available in a CLS dataset, where

| Y^{'} | \approx | Y |

. Previous works [9,11] usually combine the MS task, denoted as

Y^{'} = F_{ms} (X)

, with the CLS task

Y = F_{cls} (X)

, into one framework.

In contrast, our method chooses the MT task

Y = F_{mt} (Y^{'})

to combine with the CLS task

Y = F_{cls} (X)

, which ensures the consistency of their input and output languages. Therefore, it is convenient to use the parallel training pairs of MT and CLS simultaneously within a single model. The unified task

F_{uni}

can be denoted as

Y = F_{uni} (X^{*})

, where

X^{*}

is an input with variable length.

3.2. Model Architecture

We utilize a standard Seq2Seq Transformer architecture [26] as the model, which consists of an encoder and a decoder. Due to the excellent performance of pre-trained multilingual models in cross-lingual tasks, the pre-trained mBART [27] is used to initialize the parameters. Here, we briefly introduce this Seq2Seq architecture.

Encoder. As shown in Figure 3, the encoder is stacked by L layers. Each layer consists of a multi-head self-attention (SelfAtt) sub-layer and a position-wise feed-forward network (FFN) sub-layer. For simplicity, the normalization sub-layer is omitted. Thus, the encoder layer can be denotes as follows:

\begin{matrix} u_{e}^{l} & = SelfAtt (h_{e}^{l - 1}) + h_{e}^{l - 1}, \end{matrix}

(1)

\begin{matrix} h_{e}^{l} & = FFN (u_{e}^{l}) + u_{e}^{l}, \end{matrix}

(2)

where

h_{e}^{l}

represents the hidden states of the l-th encoder layer.

h_{e}^{0}

is the embedding of the encoder input in the source language, including the token embedding and the positional embedding.

Decoder. The decoder is similar to the encoder, but each layer has an extra multi-head cross-attention (CrossAtt) sub-layer. The self-attention sub-layer in the decoder only attends to the tokens up to the current time step to prevent information leakage from future tokens. The cross-attention sub-layer attends to the hidden states $h_{e}^{L}$ based on the current time step state to extract key information from the source input. Formally, the decoder layer can be denoted as follows:

\begin{matrix} u_{d}^{l} & = SelfAtt (h_{d}^{l - 1}) + h_{d}^{l - 1}, \end{matrix}

(3)

\begin{matrix} v_{d}^{l} & = CrossAtt (u_{d}^{l}, h_{e}^{L}) + u_{d}^{l}, \end{matrix}

(4)

\begin{matrix} h_{d}^{l} & = FFN (v_{d}^{l}) + v_{d}^{l}, \end{matrix}

(5)

where

h_{d}^{l}

represents the hidden states of the l-th decoder layer and

h_{d}^{0}

is the embedding of the decoder input in the target language.

Additionally, a classifier head consisting of a linear transformation (Linear) layer and a Softmax layer follows the decoder to generate the probability distribution

p

. Mathematically,

p = {p_{t}}_{t = 1}^{T}

, where T is the total length of the output and

p_{t}

is calculated as follows:

p_{t} = Softmax (Linear (h_{d, t}^{L})),

(6)

where

h_{d, t}^{L}

is the hidden state at time step t of the L-th decoder layer, i.e.,

h_{d, t}^{L}

is the component of

h_{d}^{L}

at t.

4. Methodology

As shown in Figure 4, the parallel MT and CLS training pairs are simultaneously trained within a single model. The unified training allows the model to leverage the strengths of both tasks. In addition to learning from the gold reference, the outputs of the two tasks are aligned at both the probability and feature levels. This alignment ensures that the model learns to map similar inputs to similar outputs, regardless of the task. By doing so, the model can significantly improve its cross-lingual summarization capability by effectively utilizing the parallel translation pairs.

4.1. Unified Training

We unify the MT and CLS tasks into one task and train them in parallel in a single model. Specifically, given the MT sample

(Y^{'}, Y)

and the CLS sample

(X, Y)

, the encoder encodes them into the hidden representations, and then the decoder generates their outputs

p^{(1)} = {p_{t}^{(1)}}_{t = 1}^{T}

and

p^{(2)} = {p_{t}^{(2)}}_{t = 1}^{T}

. Like a normal text generation task, we apply standard cross-entropy (CE) loss on these outputs:

L_{c e} = \frac{1}{2} (\sum_{t = 1}^{T} CE (p_{t}^{(1)}, y_{t}) + \sum_{t = 1}^{T} CE (p_{t}^{(2)}, y_{t})),

(7)

where

y_{t}

is the one-hot embedding of the t-th token of Y.

We use one unified model rather than two separate models to train the MT and CLS tasks simultaneously for two reasons: (1) The parameters can be reduced by half, significantly saving training resources. (2) The two tasks can utilize the knowledge learned from each other without the need for transfer. The significant performance benefits of using one unified model can be verified by the ablation experiments.

4.2. Alignment between MT and CLS

When unifying the MT and CLS tasks into one task,

(Y^{'}, Y)

and

(X, Y)

can be viewed as a positive sample pair with inputs of different lengths. The final goal of the model is to generate the same output whether the input is longer, like X, or shorter, like

Y^{'}

, which are like two augmentations of the same input to some extent. Therefore, to enhance the relationship of this training pair, we align their outputs at both the probability level and feature level.

Probability-level Alignment. The final probability distributions of the two tasks, $p^{(1)}$ and $p^{(2)}$ , are aligned by minimizing the bidirectional Kullback–Leibler (KL) divergence between them:

L_{k l} = \frac{1}{2} (D_{K L} (p^{(1)} ‖ p^{(2)}) + D_{K L} (p^{(2)} ‖ p^{(1)})) .

(8)

This loss term pushes two probability distributions closer, encouraging the model to produce similar outputs when given similar inputs.

Feature-level Alignment. Although we wish the decoder to generate the same probability at each time step, achieving this is challenging due to the significant difference in input lengths between the two tasks. Therefore, a feature-level alignment is designed to alleviate this issue. The decoder layer includes a cross-attention sub-layer, allowing the decoder to focus on different key information from the two inputs. Given that $Y^{'}$ is a summary of X, aligning the hidden representations of CLS with those of MS helps the model extract key information from X to be similar to $Y^{'}$ , thereby implicitly enhancing the model’s summarization capability. Thus, we add the feature-level alignment to ensure the decoder summarizes the same information from X as it attends to from $Y^{'}$ .

Concretely, during the decoding process, the hidden states of MT and CLS,

{h_{d}^{(1), l}}_{l = 1}^{L}

and

{h_{d}^{(2), l}}_{l = 1}^{L}

, can be obtained. Then, we group the L layers’ hidden states into K sub-modules and align the hidden states of the last layer of each sub-module using mean squared error (MSE) loss. Mathematically, each sub-module consists of

n = \frac{L}{K}

layers, and the index of the last layer of the k-th sub-module can be denoted as

l_{k} = k \cdot n

. Therefore, the formula for calculating MSE loss is as follows:

L_{m s e} = \frac{1}{K} \sum_{k = 1}^{K} {∥h_{d}^{(1), l_{k}} - h_{d}^{(2), l_{k}}∥}_{2}^{2} .

(9)

We adopt the module-based feature alignment because it offers more flexibility. In fact, if

K = 1

, only the hidden state of the last layer is aligned. If

K = L

, each layer is aligned. Additionally, K can be adjusted to adapt to other models with different numbers of layers to achieve the best performance. In the setting of mBART, L is 12. We empirically set

K = 2

, hence,

n = 6

and

l_{1} = 6, l_{2} = 12

.

4.3. Training and Inference

In summary, during training, given a training triple

(X, Y^{'}, Y)

, the unified model simultaneously encodes X and

Y^{'}

and then aligns their outputs with the reference and aligns their outputs with each other by the combination of the above three losses:

L = L_{c e} + L_{k l} + L_{m s e} .

(10)

The pseudo-code of the unified training process is shown in Algorithm 1.

During inference, given a test sample X, the trained model can be directly used to perform the CLS task without additional architecture changes. Finally, given an input text in the source language A, the model generates a concise summary in the target language B.

Algorithm 1 Unified Training Algorithm.

1:: Input: Training data $D = {(X_{i}, Y_{i}^{'}, Y_{i})}_{i = 1}^{n}$ ;
2:: Initialize model with the parameters of mBART;
3:: while not converged do
4:: Randomly sample data $(X, Y^{'}, Y) \sim D$ ;
5:: Forward the data ${(Y, Y)}^{'}$ and obtain the hidden states ${h_{d}^{(1), l}}_{l = 1}^{L}$ and the probability distribution $p^{(1)}$ ;
6:: Forward the data $(X, Y)$ and obtain the hidden states ${h_{d}^{(2), l}}_{l = 1}^{L}$ and the probability distribution $p^{(2)}$ ;
7:: Calculate the cross-entropy loss $L_{c e}$ using Equation (7);
8:: Calculate the KL-divergence loss $L_{k l}$ using Equation (8);
9:: Calculate the mean squared error loss $L_{m s e}$ using Equation (9);
10:: Update the model parameters by minimizing the loss $L$ in Equation (10);
11:: end while

5. Experiments

In this section, we first introduce the datasets utilized in this study and outline the baseline methods. Next, we describe the experimental procedures and present the results. Lastly, we perform ablation studies and conduct analyses to evaluate the effectiveness of each component.

5.1. Dataset

We conduct experiments on two benchmark datasets: the Chinese-to-English CLS dataset Zh2EnSum and the English-to-Chinese CLS dataset En2ZhSum. These datasets were generated by Zhu et al. [2] using a round-trip translation strategy on existing monolingual summarization datasets, and have been adopted to validate performance in many previous works [1,11]. Specifically, Zh2EnSum is converted from the LCSTS [28], containing 1,693,713 training samples, 3000 validation samples, and 3000 test samples. En2ZhSum is converted from the CNN/DailyMail [29] and MSMO [30] datasets, containing 364,687 training samples, 3000 validation samples, and 3000 test samples. Therefore, each training sample in these two datasets consists of a source document, a monolingual summary, and a cross-lingual summary. Both the test sets of Zh2EnSum and En2ZhSum have been manually corrected.

Following the settings of previous works [9,11], we evaluate our method under both full-dataset and low-resource scenarios. For the full-dataset scenario, the entire dataset is used to train the model. For the low-resource scenarios, three different amounts of training samples (minimum, medium, and maximum) are randomly selected from the training set to train the model. The validation and test sets remain the same as in the full-dataset scenario. Detailed numbers for the different scenarios are presented in Table 1.

5.2. Implementation Details

We use a multilingual pre-training model, mBART [27], to initialize the model, which consists of 12 layers in both the encoder and decoder. The hyperparameter K is set to 2. The optimizer used is AdamW [31] with a learning rate of 5 × 10⁻⁵. The maximum lengths of the input and output texts are set to 768 and 128, respectively. For Zh2EnSum, the batch size is set to 12. For En2ZhSum, due to its longer average text length, the batch size is reduced to 2, with gradient accumulation performed after every 6 batches. All models are trained on one Nvidia GeForce A100 GPU. During the inference stage, beam search (size 4) and trigram block are used to avoid repetition.

5.3. Baselines

Under the full-dataset scenario, we compare our method with the following baseline methods.

Pipeline methods:
- GETran: This method first translates the document into the target language using Google Translator (https://translate.google.com, accessed on 10 June 2024) and then summarizes the translated text using LexRank [32].
- GLTran: A monolingual summarization model first summarizes the document, and then Google Translator translates the summary into the target language.

The results of these two pipeline methods are from [2].

End-to-End methods:
- NCLS [2]: The first end-to-end CLS method that directly trains on the vanilla Transformer [26].
- NCLS+MS [2]: A multi-task CLS framework that uses two separate decoders to perform CLS and MS tasks, respectively, which was also proposed by Zhu et al. [2].
- VHM [1]: A multi-task framework based on the conditional variational auto-encoder using large amounts of extra machine translation data.
- DKCS [33]: A knowledge-enhanced method based on graph attention networks.
- MCLAS [9]: A multi-task framework adapted for low-resource scenarios, utilizing the multilingual pre-trained model mBERT [34] as the encoder and an initialized decoder. Under low-resource scenarios, it needs to be pre-trained on the full monolingual summarization data of the training set.
- KD [11]: A knowledge-distillation-based low-resource framework with the same model setting as MCLAS, requiring the teacher model to be trained on the full monolingual summarization data of the training set.
- mBART-CLS [27]: We apply NCLS based on mBART as a strong baseline.
- mBART+MS: Similarly, we apply NCLS+MS based on mBART.

Under low-resource scenarios, MCLAS, KD, mBART-CLS, and mBART+MS are chosen as the baselines because they adapt well to limited-resource settings.

5.4. Auto Evaluation Metrics

ROUGE [35] is a standard method for evaluating automatic summarization. The ROUGE score is computed based on the overlap of n-grams and sub-sequences between the reference summary R and the generated summary G. Following previous works [2,11], we report ROUGE-1, ROUGE-2, and ROUGE-L scores. ROUGE-1 and ROUGE-2 can be computed as follows:

ROUGE - N = \frac{\sum_{g r a m_{n} \in R} {Count}_{m a t c h} (g r a m_{n})}{\sum_{g r a m_{n} \in R} Count (g r a m_{n})},

(11)

where the denominator represents the number of n-grams in the reference summary R, and the numerator represents the number of n-grams that occur in both the reference summary R and the generated summary G. ROUGE-L can be computed as follows:

\begin{matrix} ROUGE - L & = \frac{(1 + β^{2}) R_{l c s} P_{l c s}}{R_{l c s} + β^{2} P_{l c s}}, \end{matrix}

(12)

\begin{matrix} R_{l c s} & = \frac{LCS (R, G)}{len (R)}, \end{matrix}

(13)

\begin{matrix} P_{l c s} & = \frac{LCS (R, G)}{len (G)}, \end{matrix}

(14)

where

β

is a hyperparameter and

LCS (R, G)

represents the length of the longest common sub-sequence between the reference summary R and the generated summary G.

len (R)

and

len (G)

denote the lengths of the reference summary R and the generated summary G, respectively. All ROUGE scores are reported with a 95% confidence interval measured by the official script (The parameters for the ROUGE script are “-c 95 -r 1000 -n 2 -a”). For brevity, we use R-1, R-2, and R-L to represent ROUGE-1, ROUGE-2, and ROUGE-L, respectively.

5.5. Results

Full-dataset scenario. The results under the full-dataset scenario are shown in Table 2.

Under the full-dataset scenario, our method achieves the best results on both Zh2EnSum and En2ZhSum, demonstrating the superiority of our approach. Additionally, we draw several conclusions based on all results:

The performance of the pipeline methods (GETran and GLTran) is significantly lower than that of the end-to-end methods, even when using Google Translator. For example, GLTran underperforms NCLS by 4.65/4.87/3.77 points on R-1/R-2/R-L, respectively, on En2ZhSum. This demonstrates that the end-to-end method effectively mitigates the error propagation issues inherent in the pipeline methods.
VHM and DKCS combine additional resources, such as large-scale machine translation data or entity association relationships, to achieve performance improvements of approximately 2–3 points compared to NCLS. However, they underperform compared to methods based on multilingual pre-trained models like mBART-CLS, by approximately 4–5 points, indicating that incorporating multilingual pre-trained models is highly beneficial for cross-lingual tasks.
Although NCLS+MS shows performance improvement over NCLS, mBART+MS shows a performance drop compared to mBART-CLS, even though it uses monolingual data and two separate decoders. For instance, on Zh2EnSum, NCLS+MS scores higher than NCLS by 1.49/0.72/1.34 on R-1/R-2/R-L, respectively, whereas mBART+MS scores lower than mBART-CLS by 0.47/1.13/0.73 points on the same metrics. This indicates that simple incremental changes cannot bring the expected improvement based on a strong baseline with significant performance. In contrast, our method achieves achieves improvements of 0.31/0.44/0.36 on R-1/R-2/R-L, respectively, compared to mBART-CLS.
Under the full-dataset scenario, MCLAS and KD utilize the same amount of monolingual data as our method. However, they show lower performance compared to our method. For instance, KD underperforms our method by 10.35/9.89/10.68 points on R-1, R-2, and R-L, respectively, on Zh2EnSum, indicating that our approach makes better use of the same amount of data.

Low-resource scenario. The results under the minimum, medium, and maximum scenarios are shown in Table 3.

Our method approaches or surpasses the state-of-the-art performance across various scenarios with limited data. Several conclusions are drawn as follows:

MCLAS and KD, two multi-task frameworks specifically designed for low-resource scenarios, demonstrate strong performance under all low-resource conditions. For example, the gaps between KD and mBART-CLS (KD minus mBART-CLS) in the maximum scenario are −0.35/0.45/0.57 points on R-1/R-2/R-L, respectively, while in the full-dataset scenario, the gaps are −10.04/−9.45/−10.32 points on R-1/R-2/R-L, respectively. MCLAS and KD leverage additional large-scale monolingual data to pre-train the model in low-resource conditions, which contributes to their performance improvement.
Our method demonstrates stronger advantages in cases of limited data. On En2ZhSum, under the full-dataset scenario, our method only outperforms mBART-CLS by 0.3/0.35/0.21 points on R-1/R-2/R-L, respectively. However, under the minimum scenario, it shows significant improvements of 2.24/2.41/2.04 points on R-1/R-2/R-L, respectively. Furthermore, under the maximum scenario, the improvements are 2.84/2.55/2.81 points, respectively.
Our method achieves most (13 out of 18) of the best scores under all low-resource scenarios. Compared to MCLAS and KD, our method only utilizes the limited training triplets without additional monolingual data. This demonstrates that unifying the languages of input and output can better learn features from the parallel training pairs.

5.6. Human Evaluation

Beyond automatic evaluation, we also perform a human evaluation for a more accurate assessment. Specifically, we randomly select 20 samples from each low-resource scenario in the Zh2EnSum test set. Seven graduate students, fluent in both Chinese and English, are asked to independently assess the generated summaries and the gold references based on three aspects: informativeness (IF), conciseness (CC), and fluency (FL) [9]. Following the best–worst scaling method [36], each participant identifies the best and worst methods for all aspects. The final scores are determined by subtracting the percentage of times each method is selected as the worst from the percentage of times it is selected as the best, resulting in a range from −1 to 1. The results are displayed in Table 4.

It can be observed that our method outperforms other automatic summarization methods across all scores. Although there is still a gap between our method and the gold reference, this gap decreases as the number of training samples increases. Notably, the conciseness of our method is even better than that of the gold reference under the maximum scenario.

5.7. Ablation Experiment

To validate the impact of each module, we conduct ablation experiments by removing each module. As a comparison for a unified model trained on both MT and CLS tasks, we use two separate models responsible for the different tasks, denoted as Our-Two. Additionally, when all modules are removed, the model degrades to mBART-CLS. The results are shown in Table 5. It can be observed that each module contributes to the overall performance. The most notable performance gains are derived from feature alignment (−

L_{m s e}

) and unified training (Our-Two). Feature alignment enhances the model’s summarization capabilities, while unified training allows the model to improve its translation abilities and better leverage features from parallel MT data.

5.8. Hyperparameter Experiment

In this section, we present the results for different K values on Zh2EnSum under the medium scenario in Table 6; while, in theory, making the hidden states of each layer as similar as possible is preferable, setting

K = 12

resulted in the lowest performance. Instead, a smaller K is more appropriate. We hypothesize that this is due to the inconsistent encoding of positional information caused by the varying lengths of the MT and CLS inputs. Imposing too much penalty may lead to the loss of this information. Therefore, it is advisable to adjust K based on different scenarios and models to maximize performance.

5.9. Length Error Analysis

The output of the MT task is comparable in length to the input, whereas the output of the CLS task is considerably shorter than the input. When unifying MT and CLS tasks into a single model for training, a potential concern is whether the CLS task will be affected by the MT task, resulting in excessively long outputs. Therefore, we conducted a length error analysis, quantifying the absolute length errors between the outputs of our method and the gold references, as well as those of mBART-CLS (trained directly on the CLS task). The results are presented in Table 7.

It can be observed that our method does not produce overly long summaries; on the contrary, it generates summaries that are closer to the gold references compared to mBART-CLS across various scenarios. The length metrics indicate that, compared to a single-task model, the proposed unified model generates more precise summaries while preserving the critical information, resulting in enhanced summarization performance.

5.10. Cross-Attention Analysis

To verify that the model correctly focuses on important information in the input during decoding, we visualize the cross-attention maps of mBART-CLS and our method, as shown in Figure 5. The tokens in the source document that correspond to the monolingual summary are highlighted in red. It can be observed that our method pays more pronounced and focused attention to these highlighted tokens. For instance, when decoding “maryland” and “usa”, our method pays more attention to the corresponding geographical names in the input (i.e., columns 1–4). This suggests that aligning the features of the MT and CLS outputs can effectively enhance the model’s summarization capability.

6. Conclusions

In this paper, we propose a novel unified training method for cross-lingual summarization that effectively leverages parallel data alongside the original CLS data. Our method unifies the CLS and MT tasks into one task, training them jointly within a single model. Additionally, we design two-level output alignments between the MT and CLS training pairs to enhance the model’s summarization capability. Experiments on two cross-lingual summarization benchmarks demonstrate that our method outperforms all baseline models in the full-dataset scenario and achieves the best scores in most ROUGE metrics in low-resource scenarios. Additionally, human evaluations support these results. In the future, we will explore better combination patterns to fully utilize all parallel data, including monolingual summarization pairs and machine translation pairs, to further improve performance.

Author Contributions

Conceptualization, S.C. and H.Q.; methodology, S.C.; software, S.C.; validation, S.C., W.C. and Y.T.; investigation, S.C., Y.T. and M.F.; data curation, S.C. and W.C.; writing—original draft preparation, S.C. and Y.T.; writing—review and editing, S.C., W.C., Y.T., M.F. and H.Q.; visualization, S.C. and Y.T.; supervision, W.C.; funding acquisition, H.Q. and M.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Young Scientists Fund of the Natural Science Foundation of Sichuan Province under Grant 2024NSFSC1469.

Data Availability Statement

The data used in this study have been cited and referenced [2]. They are available in NCLS-Corpara at https://github.com/ZNLP/NCLS-Corpora (accessed on 10 June 2024). Our code can be found at https://github.com/shaohuancheng/unified_training_for_cls (accessed on 12 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liang, Y.; Meng, F.; Zhou, C.; Xu, J.; Chen, Y.; Su, J.; Zhou, J. A Variational Hierarchical Model for Neural Cross-Lingual Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1: Long Papers. Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2022; pp. 2088–2099. [Google Scholar] [CrossRef]
Zhu, J.; Wang, Q.; Wang, Y.; Zhou, Y.; Zhang, J.; Wang, S.; Zong, C. NCLS: Neural Cross-Lingual Summarization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3054–3064. [Google Scholar] [CrossRef]
Leuski, A.; Lin, C.Y.; Zhou, L.; Germann, U.; Och, F.J.; Hovy, E. Cross-lingual c* st* rd: English access to hindi information. ACM Trans. Asian Lang. Inf. Process. 2003, 2, 245–269. [Google Scholar] [CrossRef]
Wan, X.; Li, H.; Xiao, J. Cross-language document summarization based on machine translation quality prediction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010; pp. 917–926. [Google Scholar]
Zhang, K.; Zhang, Y.; Yu, Z.; Huang, Y.; Tan, K. A two-stage fine-tuning method for low-resource cross-lingual summarization. Math. Biosci. Eng. 2024, 21, 1125–1143. [Google Scholar] [CrossRef] [PubMed]
Bai, Y.; Huang, H.; Fan, K.; Gao, Y.; Zhu, Y.; Zhan, J.; Chi, Z.; Chen, B. Unifying cross-lingual summarization and machine translation with compression rate. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 1087–1097. [Google Scholar]
Wang, J.; Meng, F.; Zheng, D.; Liang, Y.; Li, Z.; Qu, J.; Zhou, J. A survey on cross-lingual summarization. Trans. Assoc. Comput. Linguist. 2022, 10, 1304–1323. [Google Scholar] [CrossRef]
Cao, Y.; Liu, H.; Wan, X. Jointly learning to align and summarize for neural cross-lingual summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 6220–6231. [Google Scholar]
Bai, Y.; Gao, Y.; Huang, H. Cross-Lingual Abstractive Summarization with Limited Parallel Resources. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual, 1–6 August 2021; Volume 1: Long Papers. Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2021; pp. 6910–6924. [Google Scholar] [CrossRef]
Takase, S.; Okazaki, N. Multi-Task Learning for Cross-Lingual Abstractive Summarization. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 21–25 June 2022; Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., et al., Eds.; European Language Resources Association: Paris, France, 2022; pp. 3008–3016. [Google Scholar]
Nguyen, T.T.; Luu, A.T. Improving neural cross-lingual abstractive summarization via employing optimal transport distance for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 11103–11111. [Google Scholar]
Yang, X.; Yun, J.; Zheng, B.; Liu, L.; Ban, Q. Oversea Cross-Lingual Summarization Service in Multilanguage Pre-Trained Model through Knowledge Distillation. Electronics 2023, 12, 5001. [Google Scholar] [CrossRef]
Zhu, J.; Zhou, Y.; Zhang, J.; Zong, C. Attend, translate and summarize: An efficient method for neural cross-lingual summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 1309–1321. [Google Scholar]
Pan, H.; Xi, Y.; Wang, L.; Nan, Y.; Su, Z.; Cao, R. Dataset construction method of cross-lingual summarization based on filtering and text augmentation. PeerJ Comput. Sci. 2023, 9, e1299. [Google Scholar] [CrossRef] [PubMed]
Li, Q.; Wan, W.; Zhao, Y. A Multitask Cross-Lingual Summary Method Based on ABO Mechanism. Appl. Sci. 2023, 13, 6723. [Google Scholar] [CrossRef]
Zhang, Y.; Gao, S.; Huang, Y.; Tan, K.; Yu, Z. A Cross-Lingual Summarization method based on cross-lingual Fact-relationship Graph Generation. Pattern Recognit. 2024, 146, 109952. [Google Scholar] [CrossRef]
Jiang, S.; Tu, D.; Chen, X.; Tang, R.; Wang, W.; Wang, H. ClueGraphSum: Let key clues guide the cross-lingual abstractive summarization. arXiv 2022, arXiv:2203.02797. [Google Scholar]
Xu, R.; Zhu, C.; Shi, Y.; Zeng, M.; Huang, X. Mixed-Lingual Pre-training for Cross-lingual Summarization. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, 4–7 December 2020; Wong, K.F., Knight, K., Wu, H., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2020; pp. 536–541. [Google Scholar]
Chi, Z.; Dong, L.; Ma, S.; Huang, S.; Singhal, S.; Mao, X.L.; Huang, H.; Song, X.; Wei, F. mT6: Multilingual pre-trained Text-to-Text Transformer with Translation Pairs. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2021; pp. 1671–1683. [Google Scholar] [CrossRef]
Wang, J.; Liang, Y.; Meng, F.; Zou, B.; Li, Z.; Qu, J.; Zhou, J. Zero-Shot Cross-Lingual Summarization via Large Language Models. In Proceedings of the 4th New Frontiers in Summarization Workshop, Singapore, 6–10 December 2023; Dong, Y., Xiao, W., Wang, L., Liu, F., Carenini, G., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2023; pp. 12–23. [Google Scholar] [CrossRef]
Wang, J.; Meng, F.; Zheng, D.; Liang, Y.; Li, Z.; Qu, J.; Zhou, J. Towards Unifying Multi-Lingual and Cross-Lingual Summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1: Long Papers. Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2023; pp. 15127–15143. [Google Scholar] [CrossRef]
Li, P.; Zhang, Z.; Wang, J.; Li, L.; Jatowt, A.; Yang, Z. ACROSS: An Alignment-based Framework for Low-Resource Many-to-One Cross-Lingual Summarization. In Proceedings of the Findings of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 2458–2472. [Google Scholar]
Bhattacharjee, A.; Hasan, T.; Ahmad, W.U.; Li, Y.F.; Kang, Y.B.; Shahriyar, R. CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1500+ Language Pairs. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1: Long Papers. Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2023; pp. 2541–2564. [Google Scholar] [CrossRef]
Bao, Z.; Wang, J.; Yang, Z. Multi-path Based Self-adaptive Cross-lingual Summarization. In Proceedings of the International Conference on Knowledge Science, Engineering and Management, Guangzhou, China, 16–18 August 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 282–294. [Google Scholar]
Feydy, J.; Séjourné, T.; Vialard, F.X.; Amari, S.i.; Trouvé, A.; Peyré, G. Interpolating between optimal transport and mmd using sinkhorn divergences. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Japan, 16–18 April 2019; pp. 2681–2690. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 2020, 8, 726–742. [Google Scholar] [CrossRef]
Hu, B.; Chen, Q.; Zhu, F. LCSTS: A Large Scale Chinese Short Text Summarization Dataset. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; Màrquez, L., Callison-Burch, C., Su, J., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2015; pp. 1967–1972. [Google Scholar] [CrossRef]
Hermann, K.M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching machines to read and comprehend. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar]
Zhu, J.; Li, H.; Liu, T.; Zhou, Y.; Zhang, J.; Zong, C. MSMO: Multimodal summarization with multimodal output. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4154–4164. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleands, LA, USA, 6–9 May 2019. [Google Scholar]
Erkan, G.; Radev, D.R. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 2004, 22, 457–479. [Google Scholar] [CrossRef]
Jiang, S.; Tu, D.; Chen, X.; Tang, R.; Wang, W.; Wang, H. DKCS: A Dual Knowledge-Enhanced Abstractive Cross-Lingual Summarization Method Based on Graph Attention Networks. In Proceedings of the International Conference on Neural Information Processing, Changsha, China, 20–23 November 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 109–121. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers). Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Kiritchenko, S.; Mohammad, S. Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 2: Short Papers. Barzilay, R., Kan, M.Y., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2017; pp. 465–470. [Google Scholar] [CrossRef]

Figure 1. An example of the cross-lingual summarization dataset (En2ZhSum).

Figure 2. Comparison of different methods.

Figure 3. A simplified schematic diagram of the Transformer architecture.

Figure 4. Illustration of our method. The parallel MT and CLS training pairs are simultaneously trained within a single model. The probability distributions and certain hidden states of the two tasks are aligned.

Figure 5. Cross-attention visualizations of mBART-CLS and our method. The values on the horizontal and vertical axes represent the tokens generated by the tokenizer from the source document and the cross-lingual summary, respectively. Certain source tokens are highlighted in red based on the monolingual summary reference. This example is sampled from the Zh2EnSum test set.

Table 1. Detailed numbers for the different scenarios.

Zh2En	Train	Valid	Test	En2Zh	Train	Valid	Test
Minimum	5000	3000	3000	Minimum	1500	3000	3000
Medium	25,000	3000	3000	Medium	7500	3000	3000
Maximum	50,000	3000	3000	Maximum	15,000	3000	3000
Full-dataset	1,693,713	3000	3000	Full-dataset	364,687	3000	3000

Table 2. Results under the full-dataset scenario. The scores in bold indicate the best results.

Method	Zh2EnSum			En2ZhSum
Method	R-1	R-2	R-L	R-1	R-2	R-L
GETran	24.34	9.14	20.13	28.19	11.40	25.77
GLTran	35.45	16.86	31.28	32.17	13.85	29.43
NCLS	38.85	21.93	35.05	36.82	18.72	33.20
NCLS+MS	40.34	22.65	36.39	38.23	20.21	34.76
VHM	41.36	24.64	37.15	40.98	23.07	37.12
DKCS	40.80	22.05	37.88	45.56	23.97	32.86
MCLAS	35.65	16.97	31.14	42.27	24.60	30.09
KD	36.93	20.99	32.33	44.75	25.76	31.05
mBART-CLS	46.97	30.44	42.65	45.35	27.94	41.77
mBART+MS	46.50	29.31	41.92	43.85	26.03	40.07
Ours	47.28	30.88	43.01	45.65	28.29	41.98

Table 3. Results under different low-resource scenarios. The scores in bold indicate the best results, while those that are underlined represent the second-best results under each scenario.

Scenario	Method	Zh2EnSum			En2ZhSum
Scenario	Method	R-1	R-2	R-L	R-1	R-2	R-L
Minimum	MCLAS	21.03	6.03	18.16	32.03	13.17	21.17
	KD	22.37	6.50	18.47	35.59	13.77	22.56
	mBART-CLS	24.61	7.24	20.21	32.87	13.25	29.28
	mBART+MS	24.60	7.39	20.29	32.14	12.66	28.35
	Ours	25.99	8.44	21.83	35.11	15.66	31.32
Medium	MCLAS	27.84	10.41	24.12	37.28	18.10	25.26
	KD	27.97	11.51	27.16	40.30	20.01	25.79
	mBART-CLS	29.18	10.40	24.36	36.46	17.22	32.46
	mBART+MS	28.84	10.12	24.05	36.35	17.10	32.29
	Ours	31.90	12.88	27.04	38.09	19.29	34.25
Maximum	MCLAS	30.73	12.26	26.51	38.35	19.75	26.41
	KD	31.08	12.70	27.16	41.24	20.01	27.06
	mBART-CLS	31.43	12.25	26.59	37.05	18.45	33.18
	mBART+MS	31.17	11.91	25.97	37.59	18.42	33.62
	Ours	33.21	14.07	28.05	39.89	21.00	35.99

Table 4. Human evaluation results on Zh2ENSum. The bold scores indicate the best performance achieved by different methods.

Method	Minimum			Medium			Maximum
Method	IF	CC	FL	IF	CC	FL	IF	CC	FL
MCLAS	−0.221	−0.144	−0.173	−0.179	−0.115	−0.164	−0.163	−0.100	−0.136
KD	−0.142	−0.076	−0.109	−0.072	−0.013	−0.107	−0.029	−0.007	−0.073
mBART−CLS	−0.121	−0.073	−0.036	−0.064	−0.061	−0.028	−0.036	−0.021	−0.024
mBART+MS	−0.085	−0.021	−0.074	−0.022	0.005	−0.065	−0.007	0.007	−0.035
Ours	0.036	0.043	0.075	0.084	0.074	0.090	0.093	0.078	0.107
Reference	0.535	0.273	0.317	0.250	0.114	0.271	0.142	0.042	0.168

Table 5. Ablation results on Zh2ENSum under the medium scenario. The bold scores indicate the best results.

Method	R-1	R-2	R-L
Ours	31.90	12.88	27.04
- $L_{k l}$	31.75	12.78	27.03
- $L_{m s e}$	30.82	12.07	26.04
- $L_{k l}$ & $L_{m s e}$	30.99	11.88	25.86
Our-Two	30.24	11.33	25.30
-all (mBART-CLS)	29.18	10.40	24.36

Table 6. Hyperparameter experimental results on Zh2ENSum under the medium scenario. The bold scores indicate the best results.

Method	R-1	R-2	R-L
$K = 1$	31.59	12.59	26.53
$K = 2$	31.90	12.88	27.04
$K = 3$	31.54	12.52	26.80
$K = 6$	31.84	12.62	26.75
$K = 12$	31.22	12.40	26.70

Table 7. Length error statistics of different methods compared to the reference under different scenarios on Zh2ENSum, expressed as mean (variance).

Method	Minimum	Medium	Maximum	Full-Dataset
mBART-CLS	6.02 (21.42)	5.16 (16.61)	4.73 (14.93)	3.65 (12.21)
Ours	5.42 (16.81)	4.54 (13.16)	4.70 (15.01)	3.37 (11.39)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, S.; Chen, W.; Tang, Y.; Fu, M.; Qu, H. Unified Training for Cross-Lingual Abstractive Summarization by Aligning Parallel Machine Translation Pairs. Mathematics 2024, 12, 2107. https://doi.org/10.3390/math12132107

AMA Style

Cheng S, Chen W, Tang Y, Fu M, Qu H. Unified Training for Cross-Lingual Abstractive Summarization by Aligning Parallel Machine Translation Pairs. Mathematics. 2024; 12(13):2107. https://doi.org/10.3390/math12132107

Chicago/Turabian Style

Cheng, Shaohuan, Wenyu Chen, Yujia Tang, Mingsheng Fu, and Hong Qu. 2024. "Unified Training for Cross-Lingual Abstractive Summarization by Aligning Parallel Machine Translation Pairs" Mathematics 12, no. 13: 2107. https://doi.org/10.3390/math12132107

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unified Training for Cross-Lingual Abstractive Summarization by Aligning Parallel Machine Translation Pairs

Abstract

1. Introduction

2. Related Work

3. Preliminary

3.1. Task Definition

3.2. Model Architecture

4. Methodology

4.1. Unified Training

4.2. Alignment between MT and CLS

4.3. Training and Inference

5. Experiments

5.1. Dataset

5.2. Implementation Details

5.3. Baselines

5.4. Auto Evaluation Metrics

5.5. Results

5.6. Human Evaluation

5.7. Ablation Experiment

5.8. Hyperparameter Experiment

5.9. Length Error Analysis

5.10. Cross-Attention Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI