1. Introduction
Cross-lingual summarization (CLS) generates a shorter summary in a target language (e.g., Chinese) from a lengthy document in a source language (e.g., English). CLS enables non-native speakers to access summarized information efficiently, facilitating the exchange of information in a globalized world [
1]. However, CLS poses significant challenges as it requires the simultaneous performance of translation and summarization [
2].
Early approaches [
3,
4] typically employed a pipeline of either “summarize-then-translate” or “translate-then-summarize”, both of which are prone to error propagation issues. To address these issues, Zhu et al. [
2] proposed an end-to-end CLS training method. Additionally, they created two benchmark datasets where the cross-lingual summaries are translations of standard monolingual summaries, as illustrated in
Figure 1.
Several multi-task approaches [
1,
2,
5] leverage the relationship between the cross-lingual summary and its parallel monolingual summary to improve performance. These approaches can be categorized into two groups. The first group [
2,
5], depicted in
Figure 2a, uses two independent decoders that are responsible for the CLS and monolingual summarization (MS) tasks, respectively. However, the two decoders, with non-shared parameters, limit the model’s ability to effectively align these two tasks. The second group [
1], shown in
Figure 2b, employs two separate models to handle the MS and CLS tasks, learning the output features of the MS teacher through knowledge distillation. However, the performance is affected by the discrepancies between the hidden representations in different linguistic vector spaces. These methods extract the MS training pair (i.e., the source document and monolingual summary) to combine with the CLS data, thus facing obstacles caused by the mismatch between the output languages.
To avoid the issues discussed above, we propose a unified training method for the cross-lingual summarization (CLS) task. As shown in
Figure 2c, our method combines the machine translation (MT) training pair (i.e., the monolingual summary and cross-lingual summary) to ensure the consistency of input and output languages.
It is inspired by [
6], which views the CLS task as a translation sub-task with longer inputs. Therefore, we unify these two tasks into one task with inputs of different lengths, jointly training them within a single model. This design allows for shared model parameters and acquired knowledge. Additionally, the same linguistic vector space enables the alignment of the MT and CLS outputs without transforming their hidden representations. Therefore, we design two alignments at both the probability level and feature level to further enhance the interaction between the parallel MT and CLS data. In summary, by unifying the linguistic vector space and enhancing the semantic correlation between the two tasks at two levels, the model’s alignment and compression capabilities for CLS are improved.
To evaluate our method, we conduct extensive experiments on two benchmark datasets under both full-dataset and low-resource scenarios. The results demonstrate that our method outperforms previous CLS methods in most cases without requiring additional data. Additionally, ablation studies validate the effectiveness of each design element. In summary, our key contributions are as follows:
We propose a unified training method for CLS that learns parallel MT and CLS pairs within a single model, which is a novel and efficient integration mode for CLS data.
We design two levels of alignment between the parallel outputs to encourage the model to focus on key information from the lengthy input, thereby improving its summarization capability.
Extensive experimental results conducted in various scenarios demonstrate the superiority of our method. Ablation studies and visualization results further corroborate this conclusion.
2. Related Work
The early methods [
3,
4] for CLS primarily relied on a pipeline strategy, encompassing two phases: translation and summarization. These methods typically followed a sequence of “translate-then-summarize” or “summarize-then-translate”. For instance, the core idea of “translate-then-summarize” was to use existing machine translation systems to convert the document into the target language, and then apply text summarization techniques to generate summaries. As a result, these methods were prone to error propagation, where errors of the first phase could carry over into the second phase. Additionally, they failed to effectively establish semantic connections between languages, as the two phases were optimized independently.
In recent years, Zhu et al. [
2] proposed applying end-to-end methods to the CLS task, achieving significant performance improvements. Furthermore, they created two benchmark datasets and introduced two heuristic methods for combining parallel monolingual summaries or additional machine translation data with CLS data, which have promoted subsequent CLS studies [
7]. These approaches include multi-task learning [
6,
8,
9,
10], knowledge distillation [
11,
12], resource enhancement [
13,
14,
15,
16,
17], pre-training frameworks [
18,
19,
20], and multilingual training [
21,
22,
23,
24], among others.
In the realm of multi-task learning, Cao et al. [
8] focus on jointly learning to align and summarize in cross-lingual summarization. They introduce a multi-task learning framework that integrates monolingual and cross-lingual summarization models into a unified model. This approach involves constructing linear mappings to project context representations from one language to another and designing several specific loss functions to facilitate this learning process. Bai et al. [
6] improve upon traditional multi-task learning methods by designing a compression rate model, introducing a compression rate as a new parameter to control the amount of information retained. Takase et al. [
10] train neural encoder–decoder models using genuine and pseudo cross-lingual summarization data, as well as monolingual summarization and translation data, employing special tokens attached at the beginning of input sentences to designate the target task. This enables the direct integration of different data types into the training process without additional architectural changes. Their method aims to enhance the quality and effectiveness of cross-lingual summarization by leveraging synergies between translation and summarization tasks, thereby improving performance over methods that use only pseudo data or separate models for each task.
With the development of pre-trained models, numerous works have achieved notable results in cross-lingual and even multilingual scenarios. Xu et al. [
18] utilized a Transformer-based encoder–decoder architecture for mixed-lingual pre-training, leveraging a large amount of unlabeled monolingual data to enhance the model’s language understanding capabilities and improve its ability to handle language translation and summarization. Chi et al. [
19] proposed the MT6 model, based on a multi-task pre-training approach, using cross-lingual parallel corpora and monolingual data, integrating these resources with various multi-task training objectives.
However, most aforementioned methods [
1,
6,
16,
21] require additional data or knowledge, while a few studies [
5,
9,
11] focus on the given CLS data. MCLAS [
9] utilized a single decoder to first generate the monolingual summary, and then produce the cross-lingual summary by aligning both the source document and the monolingual summary within the same decoding process. Nguyen et al. [
11] combined knowledge distillation with Sinkhorn Divergence [
25] to improve the model’s performance in handling languages with significant differences in grammatical structure and lexical morphology. Zhang et al. [
5] proposed a two-stage framework where the monolingual summary model is pre-trained with the given small-scale monolingual data in the first stage. Subsequently, another decoder is introduced to train the CLS and MS tasks simultaneously. However, these methods suffer from a mismatch in output languages, which prevents CLS from effectively aligning the parallel data.
Unlike these methods, our model innovatively unifies the MT and CLS tasks by using the same encoder and decoder to process both input and output texts in the same language. This design ensures efficient parameter sharing and strengthens the model’s translation capability. By comparing and aligning the outputs of the parallel training pairs, our model more accurately captures and retains key source information, significantly enhancing the accuracy of the generated summaries based on the given small-scale CLS data.
4. Methodology
As shown in
Figure 4, the parallel MT and CLS training pairs are simultaneously trained within a single model. The unified training allows the model to leverage the strengths of both tasks. In addition to learning from the gold reference, the outputs of the two tasks are aligned at both the probability and feature levels. This alignment ensures that the model learns to map similar inputs to similar outputs, regardless of the task. By doing so, the model can significantly improve its cross-lingual summarization capability by effectively utilizing the parallel translation pairs.
4.1. Unified Training
We unify the MT and CLS tasks into one task and train them in parallel in a single model. Specifically, given the MT sample
and the CLS sample
, the encoder encodes them into the hidden representations, and then the decoder generates their outputs
and
. Like a normal text generation task, we apply standard cross-entropy (CE) loss on these outputs:
where
is the one-hot embedding of the
t-th token of
Y.
We use one unified model rather than two separate models to train the MT and CLS tasks simultaneously for two reasons: (1) The parameters can be reduced by half, significantly saving training resources. (2) The two tasks can utilize the knowledge learned from each other without the need for transfer. The significant performance benefits of using one unified model can be verified by the ablation experiments.
4.2. Alignment between MT and CLS
When unifying the MT and CLS tasks into one task, and can be viewed as a positive sample pair with inputs of different lengths. The final goal of the model is to generate the same output whether the input is longer, like X, or shorter, like , which are like two augmentations of the same input to some extent. Therefore, to enhance the relationship of this training pair, we align their outputs at both the probability level and feature level.
This loss term pushes two probability distributions closer, encouraging the model to produce similar outputs when given similar inputs.
Feature-level Alignment. Although we wish the decoder to generate the same probability at each time step, achieving this is challenging due to the significant difference in input lengths between the two tasks. Therefore, a feature-level alignment is designed to alleviate this issue. The decoder layer includes a cross-attention sub-layer, allowing the decoder to focus on different key information from the two inputs. Given that is a summary of X, aligning the hidden representations of CLS with those of MS helps the model extract key information from X to be similar to , thereby implicitly enhancing the model’s summarization capability. Thus, we add the feature-level alignment to ensure the decoder summarizes the same information from X as it attends to from .
Concretely, during the decoding process, the hidden states of MT and CLS,
and
, can be obtained. Then, we group the
L layers’ hidden states into
K sub-modules and align the hidden states of the last layer of each sub-module using mean squared error (MSE) loss. Mathematically, each sub-module consists of
layers, and the index of the last layer of the
k-th sub-module can be denoted as
. Therefore, the formula for calculating MSE loss is as follows:
We adopt the module-based feature alignment because it offers more flexibility. In fact, if
, only the hidden state of the last layer is aligned. If
, each layer is aligned. Additionally,
K can be adjusted to adapt to other models with different numbers of layers to achieve the best performance. In the setting of mBART,
L is 12. We empirically set
, hence,
and
.
4.3. Training and Inference
In summary, during training, given a training triple
, the unified model simultaneously encodes
X and
and then aligns their outputs with the reference and aligns their outputs with each other by the combination of the above three losses:
The pseudo-code of the unified training process is shown in Algorithm 1.
During inference, given a test sample
X, the trained model can be directly used to perform the CLS task without additional architecture changes. Finally, given an input text in the source language
A, the model generates a concise summary in the target language
B.
Algorithm 1 Unified Training Algorithm. |
- 1:
Input: Training data ; - 2:
Initialize model with the parameters of mBART; - 3:
while not converged do - 4:
Randomly sample data ; - 5:
Forward the data and obtain the hidden states and the probability distribution ; - 6:
Forward the data and obtain the hidden states and the probability distribution ; - 7:
Calculate the cross-entropy loss using Equation ( 7); - 8:
Calculate the KL-divergence loss using Equation ( 8); - 9:
Calculate the mean squared error loss using Equation ( 9); - 10:
Update the model parameters by minimizing the loss in Equation ( 10); - 11:
end while
|
5. Experiments
In this section, we first introduce the datasets utilized in this study and outline the baseline methods. Next, we describe the experimental procedures and present the results. Lastly, we perform ablation studies and conduct analyses to evaluate the effectiveness of each component.
5.1. Dataset
We conduct experiments on two benchmark datasets: the Chinese-to-English CLS dataset Zh2EnSum and the English-to-Chinese CLS dataset En2ZhSum. These datasets were generated by Zhu et al. [
2] using a round-trip translation strategy on existing monolingual summarization datasets, and have been adopted to validate performance in many previous works [
1,
11]. Specifically, Zh2EnSum is converted from the LCSTS [
28], containing 1,693,713 training samples, 3000 validation samples, and 3000 test samples. En2ZhSum is converted from the CNN/DailyMail [
29] and MSMO [
30] datasets, containing 364,687 training samples, 3000 validation samples, and 3000 test samples. Therefore, each training sample in these two datasets consists of a source document, a monolingual summary, and a cross-lingual summary. Both the test sets of Zh2EnSum and En2ZhSum have been manually corrected.
Following the settings of previous works [
9,
11], we evaluate our method under both full-dataset and low-resource scenarios. For the full-dataset scenario, the entire dataset is used to train the model. For the low-resource scenarios, three different amounts of training samples (minimum, medium, and maximum) are randomly selected from the training set to train the model. The validation and test sets remain the same as in the full-dataset scenario. Detailed numbers for the different scenarios are presented in
Table 1.
5.2. Implementation Details
We use a multilingual pre-training model, mBART [
27], to initialize the model, which consists of 12 layers in both the encoder and decoder. The hyperparameter
K is set to 2. The optimizer used is AdamW [
31] with a learning rate of 5 × 10
−5. The maximum lengths of the input and output texts are set to 768 and 128, respectively. For Zh2EnSum, the batch size is set to 12. For En2ZhSum, due to its longer average text length, the batch size is reduced to 2, with gradient accumulation performed after every 6 batches. All models are trained on one Nvidia GeForce A100 GPU. During the inference stage, beam search (size 4) and trigram block are used to avoid repetition.
5.3. Baselines
Under the full-dataset scenario, we compare our method with the following baseline methods.
The results of these two pipeline methods are from [
2].
Under low-resource scenarios, MCLAS, KD, mBART-CLS, and mBART+MS are chosen as the baselines because they adapt well to limited-resource settings.
5.4. Auto Evaluation Metrics
ROUGE [
35] is a standard method for evaluating automatic summarization. The ROUGE score is computed based on the overlap of n-grams and sub-sequences between the reference summary
R and the generated summary
G. Following previous works [
2,
11], we report ROUGE-1, ROUGE-2, and ROUGE-L scores. ROUGE-1 and ROUGE-2 can be computed as follows:
where the denominator represents the number of n-grams in the reference summary
R, and the numerator represents the number of n-grams that occur in both the reference summary
R and the generated summary
G. ROUGE-L can be computed as follows:
where
is a hyperparameter and
represents the length of the longest common sub-sequence between the reference summary
R and the generated summary
G.
and
denote the lengths of the reference summary
R and the generated summary
G, respectively. All ROUGE scores are reported with a 95% confidence interval measured by the official script (The parameters for the ROUGE script are “-c 95 -r 1000 -n 2 -a”). For brevity, we use R-1, R-2, and R-L to represent ROUGE-1, ROUGE-2, and ROUGE-L, respectively.
5.5. Results
Full-dataset scenario. The results under the full-dataset scenario are shown in
Table 2.
Under the full-dataset scenario, our method achieves the best results on both Zh2EnSum and En2ZhSum, demonstrating the superiority of our approach. Additionally, we draw several conclusions based on all results:
The performance of the pipeline methods (GETran and GLTran) is significantly lower than that of the end-to-end methods, even when using Google Translator. For example, GLTran underperforms NCLS by 4.65/4.87/3.77 points on R-1/R-2/R-L, respectively, on En2ZhSum. This demonstrates that the end-to-end method effectively mitigates the error propagation issues inherent in the pipeline methods.
VHM and DKCS combine additional resources, such as large-scale machine translation data or entity association relationships, to achieve performance improvements of approximately 2–3 points compared to NCLS. However, they underperform compared to methods based on multilingual pre-trained models like mBART-CLS, by approximately 4–5 points, indicating that incorporating multilingual pre-trained models is highly beneficial for cross-lingual tasks.
Although NCLS+MS shows performance improvement over NCLS, mBART+MS shows a performance drop compared to mBART-CLS, even though it uses monolingual data and two separate decoders. For instance, on Zh2EnSum, NCLS+MS scores higher than NCLS by 1.49/0.72/1.34 on R-1/R-2/R-L, respectively, whereas mBART+MS scores lower than mBART-CLS by 0.47/1.13/0.73 points on the same metrics. This indicates that simple incremental changes cannot bring the expected improvement based on a strong baseline with significant performance. In contrast, our method achieves achieves improvements of 0.31/0.44/0.36 on R-1/R-2/R-L, respectively, compared to mBART-CLS.
Under the full-dataset scenario, MCLAS and KD utilize the same amount of monolingual data as our method. However, they show lower performance compared to our method. For instance, KD underperforms our method by 10.35/9.89/10.68 points on R-1, R-2, and R-L, respectively, on Zh2EnSum, indicating that our approach makes better use of the same amount of data.
Low-resource scenario. The results under the minimum, medium, and maximum scenarios are shown in
Table 3.
Our method approaches or surpasses the state-of-the-art performance across various scenarios with limited data. Several conclusions are drawn as follows:
MCLAS and KD, two multi-task frameworks specifically designed for low-resource scenarios, demonstrate strong performance under all low-resource conditions. For example, the gaps between KD and mBART-CLS (KD minus mBART-CLS) in the maximum scenario are −0.35/0.45/0.57 points on R-1/R-2/R-L, respectively, while in the full-dataset scenario, the gaps are −10.04/−9.45/−10.32 points on R-1/R-2/R-L, respectively. MCLAS and KD leverage additional large-scale monolingual data to pre-train the model in low-resource conditions, which contributes to their performance improvement.
Our method demonstrates stronger advantages in cases of limited data. On En2ZhSum, under the full-dataset scenario, our method only outperforms mBART-CLS by 0.3/0.35/0.21 points on R-1/R-2/R-L, respectively. However, under the minimum scenario, it shows significant improvements of 2.24/2.41/2.04 points on R-1/R-2/R-L, respectively. Furthermore, under the maximum scenario, the improvements are 2.84/2.55/2.81 points, respectively.
Our method achieves most (13 out of 18) of the best scores under all low-resource scenarios. Compared to MCLAS and KD, our method only utilizes the limited training triplets without additional monolingual data. This demonstrates that unifying the languages of input and output can better learn features from the parallel training pairs.
5.6. Human Evaluation
Beyond automatic evaluation, we also perform a human evaluation for a more accurate assessment. Specifically, we randomly select 20 samples from each low-resource scenario in the Zh2EnSum test set. Seven graduate students, fluent in both Chinese and English, are asked to independently assess the generated summaries and the gold references based on three aspects: informativeness (IF), conciseness (CC), and fluency (FL) [
9]. Following the best–worst scaling method [
36], each participant identifies the best and worst methods for all aspects. The final scores are determined by subtracting the percentage of times each method is selected as the worst from the percentage of times it is selected as the best, resulting in a range from −1 to 1. The results are displayed in
Table 4.
It can be observed that our method outperforms other automatic summarization methods across all scores. Although there is still a gap between our method and the gold reference, this gap decreases as the number of training samples increases. Notably, the conciseness of our method is even better than that of the gold reference under the maximum scenario.
5.7. Ablation Experiment
To validate the impact of each module, we conduct ablation experiments by removing each module. As a comparison for a unified model trained on both MT and CLS tasks, we use two separate models responsible for the different tasks, denoted as Our-Two. Additionally, when all modules are removed, the model degrades to mBART-CLS. The results are shown in
Table 5. It can be observed that each module contributes to the overall performance. The most notable performance gains are derived from feature alignment (−
) and unified training (Our-Two). Feature alignment enhances the model’s summarization capabilities, while unified training allows the model to improve its translation abilities and better leverage features from parallel MT data.
5.8. Hyperparameter Experiment
In this section, we present the results for different
K values on Zh2EnSum under the medium scenario in
Table 6; while, in theory, making the hidden states of each layer as similar as possible is preferable, setting
resulted in the lowest performance. Instead, a smaller
K is more appropriate. We hypothesize that this is due to the inconsistent encoding of positional information caused by the varying lengths of the MT and CLS inputs. Imposing too much penalty may lead to the loss of this information. Therefore, it is advisable to adjust
K based on different scenarios and models to maximize performance.
5.9. Length Error Analysis
The output of the MT task is comparable in length to the input, whereas the output of the CLS task is considerably shorter than the input. When unifying MT and CLS tasks into a single model for training, a potential concern is whether the CLS task will be affected by the MT task, resulting in excessively long outputs. Therefore, we conducted a length error analysis, quantifying the absolute length errors between the outputs of our method and the gold references, as well as those of mBART-CLS (trained directly on the CLS task). The results are presented in
Table 7.
It can be observed that our method does not produce overly long summaries; on the contrary, it generates summaries that are closer to the gold references compared to mBART-CLS across various scenarios. The length metrics indicate that, compared to a single-task model, the proposed unified model generates more precise summaries while preserving the critical information, resulting in enhanced summarization performance.
5.10. Cross-Attention Analysis
To verify that the model correctly focuses on important information in the input during decoding, we visualize the cross-attention maps of mBART-CLS and our method, as shown in
Figure 5. The tokens in the source document that correspond to the monolingual summary are highlighted in red. It can be observed that our method pays more pronounced and focused attention to these highlighted tokens. For instance, when decoding “maryland” and “usa”, our method pays more attention to the corresponding geographical names in the input (i.e., columns 1–4). This suggests that aligning the features of the MT and CLS outputs can effectively enhance the model’s summarization capability.