1. Introduction
Automatic Text Summarization (ATS) is the process of extracting and generating a coherent, fluent and meaningful summary by covering the most important information of a given text [
1] and is one of the fastest growing fields in Artificial Intelligence (AI), Machine Learning (ML) and Natural Language Processing (NLP). ATS is exponentially growing nowadays due to the vast amount of textual data that arises on a daily basis on the internet, such as the exponentially growing usage of social networks, online newspapers, and user reviews in online stores, to name a few. Alongside such rich sources of textual data, there are also essential textual data available in electronic books and novels, legal and biomedical documents, and scientific papers, amongst many others. In fact, and as an instance of the significant increase in today’s internet data, 90% of the data on the internet has been created in the last couple of years [
2]. Moreover, more than two billion websites are currently active and hosted somewhere on the internet.
Manually summarizing a text is a costly process in terms of time, cost and effort. Therefore, ATS is considered one of the essential fields in AI, ML and NLP. ATS automatically generates a summary (and reduces the size) of any text. ATS systems were developed as a time-saving method to address the issue of having to read lengthy texts on the same subject in order to understand the main point [
3]. In comparison to hiring a qualified human summary, it also costs less. Hence, the need for ATS systems has arisen, which encourages researchers and scientific communities to conduct various research in the field [
4,
5]. Search engine snippets that are produced after a document is searched and news websites that produce condensed news in the form of headlines to help with browsing are a few examples of applications for ATS [
6]. The summarization of clinical and biomedical texts is a further application, in addition to lawsuit abstractions [
4].
The methods for ATS are broadly categorized into
extractive,
abstractive, or
hybrid [
7]. Some assessment methods call for
extracting the text’s most crucial passages (usually sentences). Typically, either explicitly or implicitly, the length of the final summary is determined. Therefore, an
extractive algorithm can, for example, select 10 to 15 essential sentences from a document that contains around 50 phrases [
8].
Abstractive summarization functions as well as humans. The algorithm reads the text, determines what it says, and then uses word combinations to describe the material. Theoretically, this approach might offer a superior, more condensed memory. In fact, this is challenging since it calls for both correct application of the content and knowledge of it at the level of an educated human reader [
9].
In reality, most of the available ATS systems are mainly proposed to summarize texts written in English, with relatively little work being completed in other natural languages. There are fewer attempts on Arabic ATS, despite the fact that Arabic is among the top five most spoken languages in the world, with more than 20 nations using it as their official language and more than 400 million native and non-native speakers [
10]. This is owing to the difficulty of the structure, syntactic and morphology of Arabic, as well as the compression ratio seen when summarizing numerous texts as opposed to a single document.
Extractive summarization methods are the common approaches among the timid attempts for Arabic ATS. Such extractive methods produce factual, comprehensible summaries, but they often lack flow and are overly verbose [
11]. In order to solve this issue, abstractive models are flexible in their word selection and turn to generalization and paraphrasing in order to produce more fluid and cohesive descriptions. For Arabic abstractive models, which is the main focus of this paper, the architecture of dominant choice is sequence-to-sequence (seq2seq) [
12]. For example, Al-Maleh and Desouki [
13] use the pointer-generator network [
14]. Similarly, Wazery et al. [
15] suggest a more general RNN-based approach.
Most recently, and with the development of Transformer Language Models (TLMs) such as Bidirectional Encoder Representations from Transformers (BERT) [
16], Bidirectional and Auto-Regressive Transformers (BART) [
17], XLNet [
18], Robustly Optimized BERT (RoBERTa) [
19], Generative Pre-trained Transformer (GPT-3) [
20], and Text-To-Text Transfer Transformer (T5) [
21], NLP has experienced unprecedented advancements. TLMs can be described as pre-trained contextual language models with multilayer bidirectional self-attention mechanisms. For transformer encoders, pre-training and fine-tuning are the two key processes.
State-of-the-art results for a wide range of NLP tasks, including abstractive ATS [
22], are being witnessed nowadays thanks to TLMs [
16,
19,
23,
24].
Taking advantage of the breakthrough of TLMs, the literature has seen recent attempts at developing TLMs-based abstractive ATS either as multilingual systems functioning on various natural languages or specifically proposed as monolingual (e.g., Arabic). For example, Kamal Eddine et al. [
11] presented AraBART, the first Arabic model based on BART, where the encoder, as well as the decoder, are end-to-end pre-trained. Similarly, Kahla et al. [
25] have used pre-trained language models such as multilingual BERT, AraBERT, and multilingual BART by fine-tuning a variety of neural abstractive ATS systems for Arabic.
However, the literature is still lacking a comprehensive comparison among Arabic ATS, which we aim to address in this paper. In particular, the contribution of this work is four-fold:
A thorough comparison study among all existing abstractive TLMs-based Arabic and Arabic-supported multilingual ATS systems with various evaluation metrics.
Utilizing various existing diverse Arabic datasets for abstractive ATS, including Arabic Headline Summaries (AHS) [
13] and Arabic News Articles (ANA) [
26], to conduct a thorough comparison.
Empirically studying the impact of fine-tuning the TLMs for Arabic ATS on the resulting output summary.
Empirically studying the performance of TLMs and deep-learning-based Arabic ATS systems.
The remaining part of the paper proceeds as follows: The related work is presented in
Section 2, the text summarization methodology is covered in
Section 3, and the experiments and results are presented in
Section 4 and
Section 5, respectively.
Section 6 discusses the findings. Finally, in
Section 7, we give our conclusions and some recommendations for the future.
2. Background and Related Work
As early as the late 1950s, ATS attracted scientific communities to conduct research on text summarization [
1]. At the time, there was a particular focus on generating abstracts of technical documentation. Years later, the literature witnessed a kind of decline in the interest in the area of ATS until the renaissance of AI and its technologies.
The early approaches of ATS mainly utilized statistical models to solely select, copy and paste the essential part of the original text [
4]. For example, Edmundson [
27] proposed a method that adopts statistical techniques. Such statistical methods principally use information about the frequency and distribution of words to calculate the relative significance. The text summary is then produced using the sentences with the most significance. However, such early approaches were not able to generate abstractive text summarization due to the lack of understanding of the original text. As such, there was a need for more intelligent systems that were able to understand and analyze the semantics of the natural languages to address the various challenges of using the early statistical-based approaches.
As was previously mentioned, there are two basic categories into which the ATS techniques can be broadly divided: extractive and abstractive. Early research on ATS was essentially focusing on extractive methods. However, most recently, more focus has been shifted toward abstractive approaches. Given the aim of this paper, which is a comparative study of abstractive Arabic ATS, the related work discussed in this section will be limited to the abstractive related work.
Abstractive ATS systems require a deeper understanding and analysis of the original text [
28]. Abstractive ATS systems focus on generating a summary after understanding the main ideas in the original text without using the same sentences. Such abstractive approaches use NLP methods to create the summary text without copying sentences from the input (original) text. The abstractive ATS approaches are generally categorized into three main categories, structure-based, semantic-based and deep learning-based approaches [
29]. The structure-based approaches use pre-defined structures such as graphs and ontologies. Whereas the semantic-based methods mainly focus on using the natural language generation systems and text semantic representation to generate the summary.
Deep learning-based approaches use deep neural networks to build ATS systems, which tend to report encouraging results in the ATS systems. Precisely, the sequence-to-sequence learning (seq2seq) model has shown impressive results in abstractive ATS with the English language [
30]. For such approaches, Recurrent Neural Network (RNN) [
31] with an attention encoder–decoder is utilized. For example, Hou et al. [
30] proposed a seq2seq model for ATS with various phases such as the conversion of the dataset data to plain texts, storing the original text (news articles) and the summaries separately, word segmentation to process the data, and representing the words with pre-trained vectors. The experiments were conducted with a Chinese public dataset made available by NLPCC2017 shared task3 (
http://tcci.ccf.org.cn/conference/2017/taskdata.php, accessed date 2 November 2022). The dataset consists of 2K texts without matching summaries for testing and around 40K document-summary pairs for training. The reported results were 0.34, 0.21 and 0.30 on ROUGE-1, ROUGE-2 and ROUGE-L, respectively. Later, such steps are utilized for training the model with bidirectional and unidirectional Long Short-Term Memory (LSTM) for the encoder and decoder, respectively. Chen et al. [
32] have also proposed a method using the attention mechanism. Bidirectional gated recurrent units’ architecture has been utilized in the proposed method to perform the encoding and decoding tasks. Additionally, Gu et al. [
33] have added a copying mechanism to the neural model’s encoder–decoder to aid in the sequences learning. In this proposed approach, the copying mechanism was used to determine which portion of the input sequence should be attached to the appropriate location in the output sequence. The proposed approach was then evaluated on the recently released LCSTS [
34] dataset, a sizable dataset for short ATS, and reported a slight improvement over models without copying mechanism with an average of 2–4% in ROUGE scores.
Following the direction of using attention mechanisms in ATS systems, Vaswani et al. [
35] proposed the novel and currently well-known architecture “transformers”. Such architecture was, independently of using sequence recurrence or convolution, able to determine the input and output representations. It is also known for its efficiency in terms of training time and performance as compared to standard deep learning approaches. Most recently, due to the BERT breakthrough, pre-trained TLMs have gained a great deal of popularity in the fields of AI, ML and NLP, achieving state-of-the-art results in a variety of tasks, including ATS in general, and abstractive Arabic ATS in particular [
11].
Several review and survey articles have been proposed recently summarizing the efforts on Arabic ATS. For example, Elsaid et al. [
9] provide an overview of the recent research concerning the Arabic language with a particular focus on deep learning ATS approaches, as well as an explanation of the general architecture, advantages, and disadvantages of Arabic ATS approaches. Some light was also shed on two initial extractive BERT-based approaches for Arabic ATS, particularly the Elmadani et al. [
36] and Abu Nada et al. [
37] proposals using a multipurpose Arabic dataset (KALIMAT [
38]) with slightly more than 20K articles associated with their extractive summaries.
Nevertheless, as of yet, there are no comprehensive comparison studies among all existing deep TLMs-based Arabic ATSs that obtain SOTA results on various dedicated datasets. Hence, the goal of this paper is to address this gap.
3. Text Summarization Methodology
Text summarization is the act of separating long distributions into sensible passages or sentences. The technique extricates basic information while also guaranteeing that the section’s sense is saved. This abbreviates the time it takes to understand long materials, such as insightful articles, without ignoring basic data. The most widely recognized approach to encouraging a brief, solid, and natural summary of a lengthier text report, including highlighting the text’s essential centers are known as text summarization.
Text summarization presents a few issues, counting content distinctive confirmation, interpretation, frame time, and an examination of the subsequent summary. Perceiving significant expressions in the record and taking advantage of them to uncover applicable information to add to the synopsis are fundamental positions in an extraction-based summarization. As highlighted earlier, there are a few crucial text summarization types, as shown in
Figure 1. In this study, we will focus on the abstractive text summarization for the Arabic language with a single document input. Particularly, the sole focus will be on the TLMs-based approaches.
Abstractive ATS approaches are classified as structure, semantic, discourse structure and deep learning-based techniques. They require more examination of the input source text and are mostly founded on understanding the semantics of a given article, restructuring sentences at the word-level, and lastly, producing abstracts with fewer and more clear words [
39]. Summary generation can produce new sentences instead of just replicating sentences from the source record [
40]. Vaswani et al. [
35] recently shifted the direction and introduced a new deep learning-based model. The model is called a transformer and it makes use of several methods and mechanisms.
A transformer model is a neural network that learns the setting and, consequently, importance by following connections in successive information very much like the words in this sentence. Transformer models apply a propelling arrangement of numerical methods, called consideration or self-consideration, to distinguish unpretentious ways to be sure far-off information components in a series influence and rely upon one another. Transformers [
35] are among the most modern and one of the most remarkable classes of models designed to date.
They are driving a rush of advances in AI, ML and NLP, and some have been named transformer AI or transformer NLP. Encoder and decoder layers are part of the transformer model, and one is coupled to the other through layers of the feed-forward network and multi-head attention. The cosine and sine functions, which produce positional encoding, assist the model and recall the order and position of words. Self-attention is a method used by the encoder and decoder layer’s multi-head attention layer (see
Figure 2).
From transformers-based models, the revolution of TLMs has emerged. For example, a TLM that is based on encoders and is learned in both directions, Bidirectional Encoder Representations from Transformers (BERT) [
16], was introduced by Google AI. The BERT model’s inputs are encoded using a specific format that consists of three pieces: wordpiece tokenization embeddings, segment embeddings, and position embeddings. It should be noted that all sequences now start with the special “CLS” token.
Typically employed for classification tasks, this token can be seen as the representation of the whole input sequence. Additionally, each sentence ends with the unique separator symbol “SEP”. There are various versions of BERT for different languages, such as French camemBERT [
41], ArabicBERT [
42], AraBERT [
43] and CAMeLBERT [
44]. Likewise, Radford et al. [
45] presented the Generative Pre-training Transformer (GPT) model. A total of 12 decoders are utilized to construct the input embeddings. Byte Pair Encoding (BPE), an information pressure calculation appropriate for word division that takes into mind encoding rare and out-of-vocabulary (OOV) terms, is used to encode the data successions. This is fundamental since transformers (in contrast to RNNs) consider every one of the data tokens immediately and hence, have no idea of the request for the tokens. This model’s unidirectional nature is one of its limitations because the model was only designed to predict the next word from the current word, not the other way around. Hence, it was later enhanced with GPT-2 [
46] and GPT-3 [
20].
The primary commitment of TLMs was to pre-train one general TLM and fine-tune it straightforwardly for different tasks. For instance, without making significant task-specific architecture modifications, the pre-trained BERT model can be improved with just one additional output layer to produce cutting-edge models for a variety of applications, including ATS. In particular, we just insert the task-specific inputs and outputs (see
Figure 2) into BERT and fine-tune all the parameters from beginning to end for each task (for the ATS task in our case). Consequently, several pre-trained models were proposed and were fine-tuned and implemented mainly for ATS tasks in different natural languages, including Arabic, to give fairly good summaries, such as multilingual Bidirectional and Auto-Regressive Transformers (mBART) [
47], Pre-training with Extracted Gap-sentences for Abstractive Summarization (PEGASUS) [
48], and mT5 [
49], are the targeted models in our study and will be discussed in further detail in the experiment part (
Section 4.4). The overall methodology for TLMs-based ATS systems is summarized in
Figure 3, which is also the methodology we followed in this comparative study.
6. Discussion
For the ANA dataset, shown in
Table 2, PEGASUS-XSum and PEGASUS-Large, which are the two used versions of PEGASUS, report the best results on ROUGE-1 and ROUGE-2 with a good margin but were slightly beaten by AraBART on ROUGE-L and ROUGE-LSUM. Even though AraBART has a quarter size of parameters compared to the other models, it is still reporting the best or comparable results on all metrics on the ANA dataset because it is solely pre-trained and fine-tuned for Arabic ATS. mBART seems to be struggling irrespective of the used metric, which is also the case with the other two datasets, as we see later.
Table 3 presents the obtained results with AHS dataset. It shows that for this comparison, PEGASUS models report the top two results, but PEGASUS-XSum demonstrates superior performance. In contrast to the ANA results, AraBART appears to be struggling with the AHS dataset managing only to score half of what was achieved by PEGASUS-XSum. Results of a similar nature were also obtained in
Table 4 with the WikiHow dataset. In particular, the PEGASUS family tends to outperform other models. PEGASUS-LARGE reports the best performance scoring 95% in most metrics. Both mT5 and AraBART perform relatively well on some metrics but are not being able to achieve good results on ROUGE-2. It is also worth noting that the struggle is continuing with mBART.
The TLMs-based ATS, PEGASUS surpasses the baseline model with a big margin regardless of the used datasets or the evaluation metric. These particular results justify the rapidly growing use of TLMs for ATS systems.
Overall, according to the results detailed above, we notice that because of its nature and its dedication to the same type of tasks put in question, for abstractive text summarization, the PEGASUS models with the two used versions (PEGASUS-Large and PEGASUS-XSum) manage to obtain the best results. In the case of the BART multilingual version, mBART, the results are yet to be compared with superior models. However, the Arabic version, AraBART, shows many improvements on all datasets, especially with ANA. The highest reported results of the compared models were obtained with WikiHow datasets with the PEGASUS family. Furthermore, that might be explained by the nature of the models, as well as the length of the summary as an input at the time of training and its nature (e.g., title, highlight).
7. Conclusions
This paper offers a thorough comparative analysis between state-of-the-art TLM-based Arabic ATS models (e.g., mBART, mT5, PEGASUS, and AraBART) on various text summarization datasets, including Arabic News Articles (ANA), WikiHow, and Arabic Headline Summary (AHS). Precisely, the work presented in this paper makes three main contributions in total. A complete comparison analysis of all Arabic and Arabic-supported multilingual ATS systems that are based on abstractive TLMs was provided with multiple assessment metrics.
It also utilized various Arabic datasets currently available for abstractive ATS, including Arabic Headline Summary (AHS) and Arabic News Articles (ANA), to carry out a full comparison. Moreover, we conducted an empirical analysis of the effect of adjusting the TLMs for Arabic ATS on the output summary along with a comparison against deep-learning-based baseline approaches. The experimental results revealed that PEGASUS family models outperform the other TMLs compared and studied and showed superiority against the baseline deep-learning approach. The PEGASUS models with the two employed versions (PEGASUS-Large and PEGASUS-XSum) managed to obtain the best results because of their nature and the fact that they are dedicated to the same kind of tasks as those in question—abstractive text summarization. As part of our future work, we plan to focus our efforts on multimodal ATS as it is proven that using information from the visual modality, multimodal summarizing can raise the quality of the resulting summary.