Exploring GPT-4 Capabilities in Generating Paraphrased Sentences for the Arabic Language

Alsulami, Haya Rabih; Almansour, Amal Abdullah

doi:10.3390/app15084139

Open AccessArticle

Exploring GPT-4 Capabilities in Generating Paraphrased Sentences for the Arabic Language

by

Haya Rabih Alsulami

^1,2,*

and

Amal Abdullah Almansour

¹

Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

²

Department of Computer Science, College of Computing and Information Technology, Jeddah University, Al Kamil 25341, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4139; https://doi.org/10.3390/app15084139

Submission received: 16 February 2025 / Revised: 1 April 2025 / Accepted: 2 April 2025 / Published: 9 April 2025

Download

Browse Figure

Versions Notes

Abstract

:

Paraphrasing means expressing the semantic meaning of a text using different words. Paraphrasing has a significant impact on numerous Natural Language Processing (NLP) applications, such as Machine Translation (MT) and Question Answering (QA). Machine Learning (ML) methods are frequently employed to generate new paraphrased text, and the generative method is commonly used for text generation. Generative Pre-trained Transformer (GPT) models have demonstrated effectiveness in various text generation tasks, including summarization, proofreading, and rephrasing of English texts. However, GPT-4’s capabilities in Arabic paraphrase generation have not been extensively studied despite Arabic being one of the most widely spoken languages. In this paper, the researchers evaluate the capabilities of GPT-4 in text paraphrasing for Arabic. Furthermore, the paper presents a comprehensive evaluation method for paraphrase quality and developing a detailed framework for evaluation. The framework comprises Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Lexical Diversity (LD), Jaccard similarity, and word embedding using the Arabic Bi-directional Encoder Representation from Transformers (AraBERT) model with cosine and Euclidean similarity. This paper illustrates that GPT-4 can effectively produce a new paraphrased sentence that is semantically equivalent to the original sentence, and the quality framework efficiently ranks paraphrased pairs according to quality criteria.

Keywords:

Arabic; NLP; GPT-4; AraBERT; semantic similarity; paraphrasing; AI; LLM

1. Introduction

NLP focuses on the interaction between computers and human (natural) languages. It is a field considered to be at the intersection of computer science, Artificial Intelligence (AI), and computational linguistics [1]. AI is defined as “the automation of activities that we associate with human thinking, activities such as decision-making, problem-solving, and learning” [2]. NLP has numerous applications based on the textual form, for instance, a chatbot (chatter robot) [3], autocomplete suggestions [4], MT, QA [5], and paraphrase generation [6]. Paraphrasing transforms a given text into another text while keeping its semantic meaning [7]. Paraphrase generation significantly impacts several NLP applications because it allows the use of different word expressions to create similar text. Therefore, generating an alternative text for the same meaning enhances the performance of other NLP applications such as semantic parsing, machine translation, query formulation in web searching, question answering, and data augmentation [6,8]. ML is a branch of AI that investigates computer learnability based on data [9]. Recent advancements in ML have significantly impacted NLP [10], and there are many ML approaches to generating paraphrasing.

ML approaches include rule-based [11], sequence-to-sequence [12], reinforcement learning [6], transformer model [12], and generative models [13]. GPT is a generative model commonly used for text-generation purposes [14]. However, GPT could be used for purposes exceeding text generation. For instance, Refai et al. (2023) [10] use Arabic GPT-2 (AraGPT-2) for data augmentation, Yang et al. (2022) [15] utilize GPT-3 to improve Knowledge-based Visual Question Answering (VQA), and Sahib et al. (2023) [16] investigate the performance of ChatGPT-3.5 and ChatGPT-4.0 in proofreading and rephrasing a given paragraph in English. To our knowledge, GPT-4’s capability in Arabic paraphrase generation remains underexplored, although Arabic is one of the most widely spoken languages globally. This gap shows a significant research opportunity to evaluate GPT-4 capabilities in producing high-quality, lexically diverse, and semantically equivalent paraphrased sentences for Arabic text, which can be used across various downstream applications. Additionally, our proposed methodologies and approaches can be adapted to languages with similar morphological and syntactic structures.

This research explores and evaluates GPT’s capabilities in paraphrasing Arabic text. To optimize paraphrasing performance, this study identifies effective prompt engineering and inquiries about similarity scores for paraphrased and non-paraphrased pairs. In addition, paraphrase generation is covered for text generally in different domains but without studying the impact of domains [17]; thus, this study investigates the impact of text categories on the quality of generated paraphrased sentences. This research proposes a detailed evaluation framework for assessing paraphrasing quality, comprising lexical diversity, content retention, and semantic similarity. Furthermore, this work includes releasing a new Arabic paraphrasing corpus containing approximately 200 K pairs, which is designed to support future research and advancements in AI and Arabic NLP research. For instance, the corpus could be used to build a new paraphrasing model for text in specific domains, and the corpus might combine with datasets from other languages to generate a multilingual paraphrasing model. On the other hand, the corpus could be used to analyze AI-generated text from linguistic perspectives, such as investigating word selection and sentence structure. This paper is organized as follows. Section 2 reviews related work and fundamental concepts, while Section 3 outlines the research methodology. Section 4 then presents the results and discussion, focusing on the expected corpus statistics and their evaluation.

2. Paraphrasing Theory and Models

This section provides essential definitions and concepts, such as paraphrasing types and evaluation metrics. It also reviews related works by highlighting their research aims and illustrating their methodologies. Furthermore, it gives examples of datasets used in the relevant research papers.

2.1. Fundamental Concepts and Theoretical Foundations of Paraphrasing

In linguistics, there are many techniques for paraphrasing a given text. The most common technique is using synonyms. The synonym approach depends on identifying keywords within a given text and finding appropriate synonyms to substitute for them. The second technique is using a different form of a word. This approach is based on using another form of the word within a given text, such as changing an adjective to a noun, a noun to a verb, etc. The third technique is modifying the used tense from active tense to passive tense, and this approach depends on grammar structures for the target language. The fourth technique is altering the word order; simply, it keeps the words used but modifies the word order. The last technique combines all the previous methods by combining two or three techniques [18]. Table 1 illustrates paraphrasing methods by mentioning examples. Paraphrasing can be expressed on many levels: word level, phrase level, sentence level, paragraph level, or document level. LD is a primary principle in paraphrase evaluation [19], which is a measure to assess how diverse words or vocabulary are within a text [20]. Evaluating the quality of paraphrasing requires language experts to perform human evaluations.

Human evaluation is a critical component in numerous NLP tasks [7]. However, it remains a challenging stage across various tasks, which include machine summarization [21], machine translation [22], and text paraphrasing [19]. Human evaluation is both time-consuming and costly because it often requires human labor (a language expert) [22]. Additionally, human experts might evaluate a given text differently, leading to score variability. Furthermore, human resources are not reusable, meaning each evaluation cycle requires additional time, effort, and cost. Therefore, Papineni et al. (2002) [22] considered finding an automatic approach to evaluating textual data by a machine rather than a human. Next, other researchers conducted new methods for assessing textual data using a machine; these evaluators are called automatic metrics. Automatic metrics in NLP are essential tools for evaluating the quality of generated text, particularly in tasks such as machine translation and summarization. These metrics include widely recognized ones like BLEU, which is used for machine translation, and ROUGE, which is developed for machine summarization [21,22].

BLEU measures how close the given machine-generated translation (candidate) is to human translation (reference), which could be one, two, or three references. BLEU depends on calculating precision based on n-grams by finding the total number of overlapped n-grams between the machine-generated text and reference texts [23]. BLEU computes the geometric mean of the precision scores for many n-grams from unigram (1-gram) to four-gram (4-gram). Additionally, BLEU considers sentence length by applying a brevity penalty to overcome the issue of low short translation scores [24]. The metric has been widely adopted in various NLP tasks beyond translation, including text summarization, image captioning, and text paraphrasing [12,25]. ROUGE could be calculated in multiple n-grams, such as unigrams and bigrams, and includes many variants, such as ROUGE-L and ROUGE-N. ROUGE-L calculates the longest common subsequence between two given sequences (text and summary are in token format), which is used to capture the structure of sentences. Furthermore, ROUGE-N calculates the overlap of n-grams, while ROUGE-1 evaluates unigrams and ROUGE-2 assesses bigrams [26]. The score of ROUGE involves calculating recall, precision, and F1 scores based on n-gram matches [27]. ROUGE-Precision counts the numbers of n-grams in the candidate summary that appear in the reference text, while ROUGE-Recall calculates the number of n-grams in the reference text that are in the candidate. ROUGE-F1 is the harmonic mean of ROUGE-Precision and ROUGE-Recall [26]. Table 2 displays the formulas of both BLEU and ROUGE. However, BLEU and ROUGE cannot capture the semantic meaning of two given texts; therefore, other metrics measure how the meanings of two given texts are closely related [28]. BLEU and ROUGE range from zero to one [0–1], and higher values indicates better quality.

Semantic similarity is a crucial consideration in NLP, and it is used for various NLP applications, including text paraphrasing [19], data augmentation [10], text summarization [29], and machine translation [30]. Many metrics are used for semantic similarity, such as cosine similarity, Euclidean distance, and the Jaccard score. Table 3 provides the formula for each similarity metric. Cosine similarity depends on finding the angle between two given vectors regarding their direction. Its value ranges from −1 to 1, where 1 means identical meaning, 0 means dissimilar, and −1 means the opposite [31,32]. Euclidean distance calculates the distance between two vectors, where zero means identical vectors and a smaller distance means more semantic similarity [8]. At the same time, Jaccard similarity computes the intersections between two given sets of tokens, which are primarily words [33]. Cosine similarity and Euclidean distance are commonly used with word embedding.

Word embedding is an NLP method representing vocabulary words and phrases into numerical vectors [30,34]. For text embedding, the Bi-directional Encoder Representation from the Transformers (BERT) model [13] and the AraBERT model [35] are commonly used in a variety of NLP applications. The BERT model can understand the context of the words in a given sentence [36]. Moreover, it demonstrates a powerful language representation due to its ability to capture semantic and contextual meaning. BERT has led to significant advancements in multiple NLP tasks, such as sentiment analysis [37], sarcasm identification [38], information retrieval [39], question answering [36], machine translation [40], and classifying paraphrased text [41]. BERT utilizes the concept of deep bidirectional training of transformer architectures. BERT aims to capture the context of words in a sentence by moving in both directions (left and right) simultaneously [13]. BERT is generated using English text [42,43]. Other researchers depend on the BERT architecture to develop different versions of BERT using various datasets in different languages.

Antoun et al. (2020) [35] pre-trained the BERT model for three Natural Language Understanding (NLU) tasks, which are Sentiment Analysis (SA), Name Entity Recognition (NER), and QA. Moreover, they cover both Modern Standard Arabic (MSA) (formal language) and Dialectal Arabic (DA) (informal language). The AraBERT model is publicly available and could be used as a baseline for other NLP tasks because it is trained using 24 GB data with 64 K vocabulary tokens [35]. In addition, AraBERT is developed to capture the structural aspects of Arabic by utilizing the text tokenization stage. AraBERT shows its outperformance in various Arabic NLP tasks [35], for example, toxic tweet classification [44], emotion analysis [45], fake news detection [46], textual similarity [47], sentiment analysis [35], named entity recognition [35], and question-answering [35]. Therefore, AraBERT can understand and produce Arabic text more effectively [48], and its performance is the most advanced for many NLP tasks [34]. AraBERT is used in word embedding to represent each word into a vector of size 768 [30,34]. Similarity metrics and AraBERT embedding could be used to evaluate the performance of text paraphrasing models. Moreover, several approaches exist to generate new paraphrased text using AI methods.

These AI approaches include rule-based, sequence-to-sequence, reinforcement learning, and deep generative models. In the rule-based method, the AI model depends on evolving a system that could eliminate the number of possible alternative texts using defined constraints. The model focuses on the text that matches the defined constraints, and the model must identify which word is suitable for synonym substitution [11,49]. In a sequence-to-sequence (seq2seq) method, the AI model is developed based on the encoder–decoder paradigm [6]. This model contains three main parts: encoder, decoder, and context layer. A given input goes into the encoder to result in a vector, which is then used as input for the context layer. The context layer is a context vector of function h, which forwards the essence of the input to the decoder. Finally, the decoder creates a task-specific output as a sequence [50]. More complex additional steps are taken in the reinforcement learning and deep generative model.

The reinforcement learning model depends on two main mechanisms: the statistical mechanism and the dynamic programming mechanism. Both mechanisms enhance decision-making worldwide by assessing the benefit of taking action. The model works based on reward feeding after it chooses an action and then informs the agent of the reward and the updated state [51]. Its major aim is to find the ideal reward function. More specifically, paraphrasing is built based on the sequence-to-sequence model and adds the optimal reward function, which learns from a given pair of paraphrased texts [6]. The deep generative model is built based on the sequence-to-sequence model, and the model evolves into additional layers. The intermediate layers differ from one research to another. For illustration, Iyyer et al. (2018) [52] inserted a parser generator within the encoder–decoder. Mahmoud and Zrigui [53] preferred a hybrid neural network comprising the Convolutional Neural Network (CNN) with an attention model for the Arabic text. GPT is one of the trendiest generative models in AI.

GPT was introduced in 2018 by OpenAI, which shows a significant shift in language model development. GPT is built based on transformer architecture using the attention technique to understand and generate human-like text. GPT comes in many versions, such as GPT-1 [54,55], GPT-2 [55], GPT-3 [56], and GPT-4 [57]. Its main idea is to utilize a vast and diverse textual corpus to train the model using an unsupervised learning approach (unlabeled data), followed by fine-tuning downstream tasks using supervised learning (labeled data) [54]. A clear and specific prompt must be provided to guide the GPT behavior and ensure its generated outcomes meet the wanted requirements. The prompt means the instructions are provided to the GPT, and the prompt significantly enriches the model’s capabilities. Besides, each task requires a different prompt based on its requirements and context [58]. The next part mentions various works using multiple AI approaches for text generation. In addition, it spotlights different uses of GPT in numerous applications and cites some examples of GPT prompts. Furthermore, the next part covers some common datasets used in building AI models for text generation aims.

2.2. AI-Based Methods for Paraphrasing

A range of studies have employed diverse AI methodologies for text generation. In the context of a rule-based approach, McKeown (1983) [11] developed a rule-based paraphrase system for English text, Salloum and Habash (2011) [49] worked on improving the quality of Arabic–English statistical machine translation of dialectal Arabic text. Salloum and Habash [49] depend on morphological knowledge to define a lightweight rule-based approach to producing MSA paraphrasing of dialectal Arabic words. Regarding the seq2seq approach, Nagoudi et al. (2022) [17] introduced pre-trained transformer-based seq2seq models. They also investigate a variety of tasks, such as news title generation, question answering, machine translation from Arabic to a foreign language, and paraphrase generation. Their model is called AraT5, which is highly significant in Arabic NLP due to its effectiveness in tasks such as MT. Ormazabal et al. (2022) [59] focused on paraphrasing generation tasks using round-trip machine translation. More specifically, in their system, the encoder eliminates the unrelated information from a given translation sentence, and then the decoder uses the result to reconstruct the paraphrased sentence. Gudkov et al. (2020) [60] produced the first Russian paraphrase corpus and developed a paraphrasing model. Fu et al. (2019) [61] evolved a paraphrase generation system for the latent bag-of-words in English text.

In the context of the reinforcement approach, Sancheti et al. (2022) [62] developed a reinforcement learning-based paraphrase generation task using English text. Li et al. (2017) [6] enhanced paraphrase generation by evolving a novel deep reinforcement learning using an English textual dataset. In the context of the deep generative approach, Mahmoud and Zrigui (2021) [53] worked on detecting paraphrasing in Arabic, and their model is based on a CNN with attention model. Li et al. (2021) [27] employed a deep learning model for paraphrase detection from an English text. Iyyer et al. (2018) [52] evolved a model for English that introduces a novel approach to generating a paraphrase from a given sentence by defining a particular syntactic form. Table 4 outlines a comparison of the AI approaches used in paraphrasing. All GPT versions can produce paraphrasing but with various capabilities, and GPT can be employed in different applications, which will be explained in the following section.

GPT has been applied across many fields for different purposes. For example, Ding et al. (2023) [63] found that GPT-3 could be used as a general-purpose data annotator in NLP tasks. Surameey and Shakor (2023) [64] used GPT-3.5 as a debugging tool, and they concluded their experiment by asking to add GPT as part of a comprehensive debugging tool. Goyal et al. (2022) [65] used GPT-3 as a summarization tool, and Gutierrez et al. (2022) [66] investigated GPT-3’s ability in Information Extraction (IE) and Relation Extraction (RE) for bioinformatics datasets. Yang et al. (2022) [15] examined GPT-3’s ability to answer visual questions by converting each question into an equivalent caption, facilitating the model’s comprehension. Each piece of research defines a different prompt based on research objectives. For illustration, Ding et al. (2022) [63] defined three different prompts: (1) they provide GPT with a sentence and ask GPT to give a sentiment label (positive or negative), (2) they ask GPT to create a new sentence based on a required sentiment label, and (3) they give GPT a title and the required sentiment label then ask GPT to produce a new sentence in that title with that sentiment label. Furthermore, Goyal et al. (2022) [65] posted an entire article to the GPT and asked it to summarize the article into one, two, or three sentences. However, the GPT could produce a new dataset or expand an existing one. ML could generally be used to create new data items, which serve in further modeling stages.

For instance, Rafai et al. (2023) [10] used AraGPT-2 to expand the dataset and release a new version of the corpus; the produced corpus is then employed in sentiment text classification. Their experiments show the effectiveness of using AraGPT-2, which improves modeling performance by 7% to 13% for different datasets. Moreover, for the paraphrasing task by Nagoudi et al. (2021) [17] in their common AraT5 model, the MT model produces a new paraphrased sentence for a given text in English. This releases a new corpus of paraphrased pairs where the original text is the Arabic text that is human-generated, and its equivalent paraphrased text is MT-generated. The produced corpus is used to build a model for text paraphrasing, but they found that using MT to make a new paraphrased pair can capture the meaning slightly. Other researchers follow several methods and use multiple sources to collect paraphrasing pairs. The following part highlights some of the commonly used corpora.

Paraphrasing research suffers from the lack of parallel corpora [67]. Due to the complex morphology of the Arabic language, finding high-quality and well-structured data for paraphrasing is challenging [68]. Therefore, researchers utilize an open dataset in developing paraphrasing models. For instance, Arabic Language Generation (ARGEN) is an Arabic paraphrasing dataset containing 123.6 K pairs provided by adopting semantic similarity datasets and MT datasets [17]. Mahmoud and Zrigui (2021) [53] used Open-Source Arabic Corpora (OSAC) and the King Saud University Corpus of Classical Arabic (KSUCCA) to produce paraphrased pairs by substituting synonyms, which releases 1 K sentence pairs. BAR and Dershowitz (2021) [69] created a corpus of similar documents by extracting text from published news from several agencies. The extraction process is based on picking one published document and then finding one that achieves the highest similarity score, which launches 100 correctly paired documents. There are various datasets for other languages, such as English and Spanish.

Quora Question Pair (QQP) is an English dataset created from Quora questions, where a pair is annotated as a paraphrased question pair or not. The researchers depend on identifying the duplicated questions to be labeled as a paraphrased pair [70], and this delivers 150 K paraphrase pairs [7]. Twitter Uniform Resource Locator (TwitterURL) is a sentential paraphrase dataset in English extracted from Twitter. It groups tweets as paraphrased sentence pairs when tweets refer to the same URL to produce a corpus with 51 K sentence pairs [68]. Para-Phrase DataBase (PPDB) is a paraphrase dataset that includes phrasal and lexical paraphrases but does not contain sentence paraphrases [71]. It comprises 220 M paraphrase pairs, which are 8 M lexical paraphrases and 73 M phrasal paraphrases [7]. This dataset covers the English language but has an extended version for the Spanish language. The Spanish version contains 196 M paraphrase pairs, which are 33 M lexical paraphrases [71]. Table 5 summarizes dataset details; it is noticeable that there are limited corpora with tiny sizes. In summary, AI approaches are used to produce paraphrasing pairs, but paraphrasing suffers from a lack of paraphrasing corpora. The following section argues for the methodology used to investigate GPT capabilities in producing Arabic paraphrasing sentences and evaluate paraphrase quality using many metrics to release a high-quality paraphrasing corpus.

3. Methodology

This study investigates the capabilities of GPT in paraphrasing, with a focus on Arabic language text. A structured methodology was designed to achieve the research aim, comprising the following essential phases: (1) data selection, (2) data processing, (3) modeling, and (4) evaluation. Each phase plays a critical role in ensuring the reliability and validity of the generated paraphrasing. The following subsections illustrate each phase intensively, and Figure 1 shows an overview of the research methodology.

3.1. Data Selection

This phase requires an exploration of the open Arabic dataset and then the identification of a suitable dataset with the research aim. This phase includes (1) an exploration of open Arabic datasets, (2) an exploration of dataset format and statistics, (3) an investigation of text quality, and (4) a selection of the data. Intensive exploration of the dataset found several challenges in open datasets. For instance, (1) the text file is a mixture of HTML tags and Arabic sentences, (2) the dataset is formed of words rather than complete sentences or articles, (3) the dataset contains informal language such as dialectical phrases found in tweets, and (4) the corpus comprises very short sentences, some of which do not exceed three words. This research utilizes the Single-Labelled Arabic News Articles Dataset (SANAD). SANAD is an MSA textual data for classification purposes created by Einea et al. (2019) [72], and it is freely available. It contains 200 K articles from three newspapers (website-based): AlKhaleej, AlArabiya, and Akhbarona. Additionally, articles of the entire corpus are classified into seven categories: Culture, Finance, Medical, Politics, Religion, Sports, and Technology. This research selects articles from AlKhaleej only because its contents are manually explored and comprise an equal number of articles, which is 6500. Furthermore, this research scope includes culture, technology, and sports articles due to the limitation of research time. The dataset is cleaned, and there are no HTML tags, URLs, etc.; each article contains plain text without a title. Therefore, the dataset is ready for the next phase, which is the data processing phase.

3.2. Data Processing

The data processing phase covers (1) reading the article, (2) tokenization into sentences, and (3) saving each sentence separately. White space removal is needed before sentence tokenization. Tokenization means splitting a text into tokens, such as words, phrases, sentences, numbers, or punctuation [2,73]. Punctuation marks are used to identify the edges of sentences, which could be terminated by a full stop (.), question mark (؟), or exclamation mark (!), depending on the context of the sentence [74]. SANAD [72,75] data are collected in the document format. Thus, the data are tokenized into sentences by determining the edge of each sentence. For this task, we first depended on using NLTK [75] library sentence tokenization for the culture and technology categories. However, NLTK could not cover all possible cases in the dataset, such as the number of goals between circle brackets for the sports category. Later, tokenization uses regular expressions to catch up with Arabic sentence endings and consider the category’s nature. Additionally, the text requires cleaning steps, which include removing extra white spaces between the article’s sentences and spaces before and after punctuation. The phase produces a pool of sentences; each sentence is saved separately in a text file format. Every sentence is used individually in the modeling phase.

3.3. Modeling

Modeling is the process of using GPT-4 to produce a new paraphrased sentence. The modeling phase involves (1) reading a sentence, (2) asking GPT-4 to paraphrase it, and (3) saving the sentence and its paraphrasing as a pair. Prompt engineering plays a vital role in GPT results because the different prompts lead to differences in the performance of the Large Language Model (LLM) [76]. Additionally, the prompt must identify the needed outcome clearly and precisely. For instance, Goyal et al. (2022) [65] use three prompts of summarization for a given article by asking GPT to summarize a given article into a summary of one, two, and three sentences. This research conducts many primary experiments to identify a suitable prompt for this research. The experiments aim to answer the following questions: (1) Which language, Arabic or English, is most appropriate for writing a prompt? (2) What Arabic prompt is suitable for producing a new paraphrased sentence? (3) Can GPT paraphrase an entire article in a single API call? (4) If a sentence is given to the GPT and requests to paraphrase, does the GPT produce the same paraphrased sentence for 10 API calls? (5) Can GPT understand the different types of lexical paraphrasing methods, such as using different word forms or synonyms, or changing sentence tense?

The experiments found the following. (1) GPT could interpret paraphrasing commands and responses even if we use English or Arabic prompts to rephrase Arabic sentences. (2) GPT can understand many Arabic prompts to paraphrase a sentence, such as “أعد الصياغة” (rephrase it) and “اكتبها بطريقة أخرى” (rewrite it uses another way). However, GPT misunderstands “أعد الكتابة” (rewrite it), and it repeats the sentence without rephrasing it. (3) GPT can paraphrase an entire article once its length is within its range of capabilities, which means GPT cannot paraphrase a long article using an API call (meaning without splitting it into parts). (4) Calling GPT to paraphrase a sentence rapidly leads to creating a new paraphrased sentence for each call using different synonyms and sentence structures. However, GPT lost its ability to paraphrase and copy the original sentence as a paraphrased sentence after many calls. Thus, calling GPT several times to paraphrase a sentence does not lead to generating the best-paraphrased sentence. Sometimes, GPT produces more than one paraphrased sentence for one GPT API call, which releases a single original sentence and multiple paraphrased sentences, and this means there is no unification in results. Accordingly, an attached prompt is essential to force GPT to produce only one paraphrased sentence for each call.

After the primary experiment to explore the GPT prompt, this research selects a single prompt, which is, in Arabic “أعد صياغة التالي في جملة واحدة”, and its translation in English “rephrase the following in one sentence”. After exploring the Google Search engine and Google Schooler engine for paraphrasing in Arabic, it is noticeable that “أعد الصياغة” (rephrase it) and “أعد الكتابة” (rewrite it) are the most commonly used prompts, but “أعد الصياغة” (rephrase it) is the most used in academic research; thus, “أعد الصياغة” (rephrase it) is chosen as GPT prompt. Moreover, specifying the number of sentences needed, “جملة واحدة” (one sentence) is appended to the prompt. Table 6 demonstrates an example of a prompt in Arabic and the equivalent translation in English. The modeling stage is done by calling GPT API for each sentence separately, which means a single API call for each sentence is run in isolation from other related sentences (sentences from the same article). This research uses the gpt-4o-2024-05-13 version to have fixed results for all API calls. In addition, this research uses a Google Colab environment [77] with RAM [1.21–12.67] GB and T4 GPU for Python programming language (version 3.11.11). This research requires splitting the corpus into smaller parts and processing 200–500 articles daily using a Python script due to GPT API limitations. This phase releases an Arabic paraphrasing corpus, where the original sentence is human-generated text, and its paraphrased sentence is model-generated text. Exhaustive evaluation steps are taken to ensure the quality of the paraphrased pair.

3.4. Evaluation

This research depends on regular NLP metrics, LD, semantic similarity, and manual evaluation by Arabic language experts to evaluate the quality of paraphrased sentences and examine GPT-4 performance. More specifically, the evaluation phase comprises (1) calculating BELU and ROUGE, (2) finding similarity, (3) comparing sentence scores with thresholds, and (4) setting level. Furthermore, investigation by this research leads to defining the quality ranking framework for paraphrasing. For primary judgments, Arabic experts employ manual evaluation in this research to assess the quality of paraphrasing. A random corpus subset is chosen, and the research selects 120 pairs from every category, meaning 420 paraphrased pairs. The evaluators are chosen based on their academic qualifications, with a minimum requirement of a master’s degree or higher in Arabic linguistics.

Furthermore, each evaluator must have had work experience in Arabic for at least two years. The research implements many steps to ensure reliability and minimize potential bias or tampering. These steps are (1) multiple evaluations, (2) independent evaluation, (3) clear evaluation criteria, and (4) anonymous review. Three Arabic language experts were chosen for evaluation, and each expert evaluated the sample separately without communicating with other experts or seeing other evocators’ results. The evaluation criteria are defined clearly and in isolation from each other. In addition, an orientation session is done for each evaluator individually to ensure their understanding of linguistic evaluation criteria. An anonymous review is essential to reduce personal bias in evaluating AI-generated sentences. Therefore, the source of the original and paraphrased sentences is not provided, and the paraphrased sentence is not explicitly defined as a paraphrased sentence of the original sentence. Instead, the evaluation survey defines the sentences as sentence A and sentence B, then asks the expert to identify their semantic meaning and determine if they are equivalent. As a result, the research maintains objectivity and eliminates subjective opinions in human evaluation tasks.

The human evaluation combines multiple paraphrasing criteria from a linguistic perspective. Evaluation criteria for paraphrasing pairs include semantic similarity, fluency, syntax, structure, synonym substitution, and changing word forms. In addition, the evaluators are asked to rank the quality of the paraphrasing based on the overall criteria. For all criteria, the evaluator has to choose from four options: (1) excellent or applicable, (2) good, (3) poor, and (4) very poor or not applicable. Human evaluation for 420 pairs indicates the overall performance of GPT, but using other NLP metrics is an essential stage in evaluating the entire corpus. Appendix A shows some samples of paraphrased sentences in the human evaluation survey.

The evaluation of the whole corpus covers several steps: (1) finding the average score for all regular metrics and LD, (2) measuring similarity, and (3) defining the evaluation framework. The research utilizes the most prevalent NLP, BLEU, and ROUGE for word unigram. Additionally, LD is measured in two ways: calculating the average LD score of the original text and paraphrased text and the difference between their LD scores. The following explains semantic similarity and the paraphrase quality ranking framework.

3.5. Semantic Similarity

Regarding similarity, the semantic meaning of pairs is studied using word embedding with cosine similarity and Euclidean distance. Additionally, this research calculates the similarity score using a common metric, the Jaccard similarity. AraBERT is chosen for word embedding due to its efficiency in understanding context. More specifically, the “AraBERTv0.2-base” version is chosen for this research because it employs a special pre-segmentation approach that can interpret the complexity of Arabic morphology. This research adopts semantic similarity with word embedding to explore the following: (1) measuring semantics for each of the original and paraphrased sentences, and (2) studying similarity scores for paraphrased and non-paraphrased pairs. This research computes the similarity scores among sentences to discover the impact of paraphrase pairing on similarity scores. It considers three cases: (1) each sentence in the category with its equivalent paraphrased pair, (2) two sentences from the same category but not a paraphrased pair, and (3) two sentences from different categories that are not paraphrased pair. The three cases provide a full examination view for paraphrased pairs and non-pairs regarding semantic similarity. Therefore, 40 sentences are selected randomly with their equivalent paraphrased pairs, and then finding different combinations leads to 360 sentence pairs for all three cases.

3.6. Quality Ranking Framework for Paraphrasing

This research develops a quality ranking framework to evaluate paraphrasing pairs by finding the rounded average score for each metric. More specifically, these scores serve as thresholds to rank the quality of paraphrasing corresponding to the achievement of thresholds. The threshold value for each metric is presented in the Section 4. The quality ranking contains four levels: level-0, level-1, level-2, and level-3. Level-3 is for pairs with eight, seven, or six thresholds. Level-2 is assigned to pairs that have achieved five, four, or three of the thresholds. Level-1 is for pairs that have achieved two thresholds or one threshold. Lastly, level-0 is given when no thresholds are achieved. The quality ranking framework is shown in Algorithm 1. This framework is used to study the effect of the categories on text paraphrasing quality.

Algorithm 1: A quality ranking algorithm for each paraphrasing pair.
1:	//Calculate the score of paraphrasing quality
2:	//Input: two sentences that are original text and paraphrased text
3:	//Output: the final score of paraphrasing quality level
4:	Start,
5:	Ori_tokens ← word tokenization of original sentence
6:	Par_tokens ← word tokenization of paraphrased sentence
7:	Ori_vector ← embedding using AraBERT of original sentence
8:	Par_vector ← embedding using AraBERT of paraphrased sentence
9:	BLEU_1 ← calculate BLEU for Ori_tokens and Par_tokens
10:	ROUGE_1_recall ← calculate ROUGE-Recall for Ori_tokens and Par_tokens
11:	ROUGE_1_ precision ← calculate ROUGE-precision for Ori_tokens and Par_tokens
12:	ROUGE_1_ F1 ← calculate ROUGE-F1 for Ori_tokens and Par_tokens
13:	LD_avg ← calculate the average score of LD for Ori_tokens and Par_tokens
14:	Jaccard_sim ← calculate Jaccard similarity for Ori_tokens and Par_tokens
15:	Cosine_sim ← calculate cosine similarity between Ori_vector and Par_vector
16:	Euclidean_sim ← calculate Euclidean similarity between Ori_vector and Par_vector
17:	BELU_threshold ← values range [0.4, 1) \\ 1 is excluded
18:	ROUGE_recall_threshold ← values range [0.6, 1) \\ 1 is excluded
19:	ROUGE_precision_threshold ← values range [0.5, 1) \\ 1 is excluded
20:	ROUGE_F1_threshold ← values range [0.5, 1) \\ 1 is excluded
21:	Jaccard_threshold ← values range [0.5, 1) \\ 1 is excluded
22:	Cosine_threshold ← values range [0.8, 1) \\ 1 is excluded
23:	Euclidean_threshold ← values range (0, 3.5] \\ 0 is excluded
24:	Checking threshold for all metrics:
25:	If eight, seven, or six of them are achieved ← set quality level for the pair is Level-3
26:	If five, four, or three of them are achieved ← set quality level for the pair is Level-2
27:	If one or two of them are achieved ← set quality level for the pair is Level-1
28:	Else ← set quality level for the pair is Level-0
29:	Return quality level
30:	Stop.

4. Results and Discussion

This section shows the results of the modeling and evaluation stages. The modeling stage produces a paraphrased pair: the sentence from the article is the original sentence (Ori_sentence) (human-generated text), and the other is the newly generated sentence that is generated by GPT (Par_sentence). This stage produces 48,486 pairs for the culture category while releasing 74,323 for the technology category. Additionally, the sports category comes with 67,360 paraphrased pairs. Therefore, the total number of pairs is 190 K sentences, and Table 7 displays corpus statistics. The next part illustrates evaluation methods.

4.1. Human Evaluation

All criteria for human evaluation results are presented in Table 8, Table 9, Table 10, Table 11, Table 12, Table 13 and Table 14. Each table gives a descriptive analysis of evaluators’ responses for every evaluation criterion individually. In addition, the columns labeled 0, 1, 2, and 3 correspond to distinct quality levels, where 0 represents the lowest quality and 3 represents the highest. Specifically, each cell within these columns shows the average number of pairs assessed and classified under the respective quality level during the evaluation process. For illustration, Table 8 shows that Evaluator #1 assessed 95.71% of the sentence pairs as high quality, whereas Evaluator #2 and Evaluator #3 assessed 85.71% of the pairs as high quality within the culture category.

Regarding semantic similarity, all three evaluators highly rated paraphrased sentences for the different categories. We conclude that GBT 4 produces paraphrased sentences that are semantically equivalent, as presented in Table 9. Regarding fluency, most sentences are rated as excellent or good for all categories based on Table 10; thus, GBT 4 can generate sentences that look fluent and natural. The evaluators evaluate sentences as grammatically correct, as shown in Table 11. Evaluators find that GPT-4 follows Arabic grammar and rules because the sentence components are correctly formed. In Table 12, the first and third evaluators find the paraphrased sentences are not much restructured compared to the original text’s structure. More specifically, GPT does not change sentence structure much and preserves its original structure. In contrast, the second evaluator believes there are many changes in sentence structure and rates most of them as excellent. Based on the first and third evaluators’ opinions in Table 13, most sentences do not use different forms of words and paraphrasing. Thus, GPT does not change word forms to create paraphrased sentences. Still, GPT sometimes uses synonym substitution, as illustrated in Table 14, which reveals synonyms appearing in half of the pairs and rated as excellent or good. The overall quality of paraphrased pairs is ranked as excellent for all categories, as provided in Table 15.

4.2. Regular NLP Metrics

The newly generated paraphrased text is evaluated for the entire corpus using BLEU and ROUGE. Table 16 shows the BLEU, ROUGE, and LD scores. The average BLEU score is 0.5, ROUGE-1 precision is 0.7, ROUGE-1 recall is 0.6, and ROUGE-1 F1 is 0.6. The overall performance of the metric shows a high degree of overlap between the Ori_sentence and Par_sentence, with scores of 0.5 or higher. That performance means the quality of paraphrased sentences is satisfactory. In other words, for the three categories, BLEU and ROUGE scores achieved a moderate level of word unigram overlap between the Ori_sentence and Par_sentence, which is considered a sensible degree of similarity due to the use of synonyms. On the other hand, the average LD score is almost 0.8 for all three categories, which means that both Ori_sentence and Par_sentence are rich in vocabulary and phrases. The difference between their LDs is extremely low, which means one sentence has a slightly more diverse vocabulary than the other. After intensively investigating Ori_sentence and Par_sentence content, sentence length is the main cause of the difference in LD because Par_sentence length is often shorter than Ori_sentence. The metric results are explored by comparing research result ranges with those reported in previous studies for text paraphrasing in Arabic and other languages. Results of BLEU range from 0.5 to 0.6, which is similar to the Fu et al. (2019) paper [61], and higher than Li et al. (2017) [6] and Engomwan and Chali (2019) [12]. However, Nagoudi et al. (2021) [17] show results lower than other research due to the dependence on MT to produce the pair. ROUGE-1-recall scores for this research are close to the results of Li et al. (2017) [6] and Fu et al. (2019) [61], both of whom use the Quora dataset. Table 3 shows metric results, and Table 17 presents the investigation results.

4.3. Corpus Semantic Similarity

Examination of semantic similarities between each pair is summarized in Table 17. The average cosine similarity score is 0.89, which means that Ori_sentence and Par_sentence are similar in meaning and context. In other words, the vector of Ori_sentence and the vector of Par_sentence point in almost the same direction. The Euclidean distance is 3.16, which shows an overlap between the pair, but it is not identical due to the use of a variety of vocabulary and phrases. The Jaccard score is 0.35, which presents a small ratio of overlap, which is considered a good quality of paraphrasing because GPT-4 uses different terms to express the same meaning.

Table 17. Similarity scores for the three categories.

Category	Cosine Similarity	Euclidean Distance	Jaccard Similarity
Culture	0.89	3.21	0.33
Sport	0.89	3.1	0.35
Technology	0.89	3.16	0.36
Average	0.89	3.16	0.35

Bold means the best value for each similarity metric.

4.4. Semantic Similarity for Paraphrase Pairing and Non-Pairing

Three cases are considered to explore the similarities between paraphrased pairs and non-pairs and investigate their impact on semantic measurement scores, as demonstrated in Table 18. The cosine similarity can capture the paraphrased pair by giving high and moderate scores for sentences from the same categories but low scores for sentences from different categories. The Euclidean distance is 2.5 for the paraphrased pair, and the distance increases for sentences from the same category, while it reaches 5.6 for sentences from different categories. Jaccard similarity shows very low scores for unpaired sentences either from the same category or other categories, which are 0.04 and 0.03, respectively. This experiment concludes that the similarity metrics scores can capture each paraphrased pair’s semantic similarity.

4.5. Results of Quality Ranking Framework for Paraphrasing

This research produces a new corpus for Arabic paraphrasing, which contains 190 K pairs. After employing the quality ranking algorithm, the entire corpus is ranked into four levels based on their quality. The corpus statistics are shown in Table 19 below. Most corpus pairs are labeled Level-3, representing 52% of culture, 68% of technology, and 65% of sport. The content of each category and level is explored to identify its effects on the paraphrasing quality. The low level for culture is due to the inclusion of some poetry texts, in which each line of poetry is tokenized individually, and GPT produces a very short sentence that rewrites a line in a prose sentence. Additionally, the NLTK sentence tokenizer could not handle some cases in which an acronym contains many single letters, each followed by a dot, for instance, “م.ق” (M.K). Thus, the culture category includes some sentences that contain only one letter, which explains its low scores. The technology text is a mixture of English and Arabic words with some computing terminology. GPT attempts to rephrase a given sentence with limited structures or phrases in computing topics because the computing words appear in both Ori_sentence and Par_sentence. Subsequently, the similarity scores are increased for Cosine and Jaccard similarity, and Euclidean distance decreases; the technology category dominated the highest scores. However, including English and abbreviations in technology leads to inaccurate tokenization and modeling issues. The content of the text is manually filtered to remove unexpected tokens, and Table 20 shows its statistics. The sports category achieves fewer pairs in Level-0 due to an accurate tokenization method specifically defined for sports. GPT can analyze a given sentence in sports, generating a new sentence summarizing the match result. However, GPT misunderstands some sports texts that include the players’ numbers by treating player numbers as the number of goals in a match. Some examples of illustrations are presented in Table 21.

5. Conclusions and Future Work

Paraphrasing has a significant impact on other NLP applications, and GPT could be used to produce new paraphrased text. This paper explores GPT-4 performance in producing new paraphrased text at sentence level. Prompt engineering plays a prominent role in optimizing GPT performance. Thus, this research discovers multiple prompts for Arabic paraphrasing and identifies a suitable prompt that is both simple and understandable by GPT. The generated paraphrased sentences have been evaluated through numerous stages and for various purposes. This research investigates similarity scores for paraphrased and non-paraphrased pairs and finds that similarity metrics with AraBERT embedding can capture the semantics for paraphrased and non-paraphrased pairs. The regular NLP metrics find paraphrased pairs that contain a balanced scale of word unigram overlaps, with an average score of 0.5. In addition, the impact of text categories on paraphrasing has been studied, and the research finds that some of the technology and culture texts challenge GPT abilities due to the use of computing terminologies in English and inclusion of poetry text. This finding encourages a future investigation of other categories, such as finance and political articles. In addition, the developed quality framework effectively ranks paraphrasing into levels to produce a high-quality paraphrased corpus containing 190 K pairs. The new Arabic paraphrasing corpus could evolve a new paraphrasing model for Arabic or combine it with other languages’ corpora to generate a multilingual paraphrasing model. Furthermore, the corpus can be employed to study AI-generated text from linguistic perspectives, including word selection and sentence structure. This framework inspires the potential development of a new paraphrased quality algorithm that considers linguistic aspects such as synonym usage and sentence restructure between pairs.

Author Contributions

Conceptualization H.R.A. and A.A.A.; methodology, H.R.A. and A.A.A.; software, H.R.A.; validation, H.R.A.; formal analysis, H.R.A.; investigation, H.R.A. and A.A.A.; resources, H.R.A.; data curation, H.R.A.; writing—original draft preparation, H.R.A.; writing—review and editing, H.R.A.; visualization, H.R.A.; supervision, A.A.A.; project administration, A.A.A.; funding acquisition, H.R.A. and A.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

The use of GPT-4 in this research was partially supported by OpenAI (33.4%) and the University of Jeddah (66.6%). The Article Processing Charge (APC) was funded by the authors.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are not publicly available at this time as they are part of ongoing research and will be used in further investigations and deeper analysis required for academic degree completion.

Acknowledgments

We thank King Abdulaziz University, Jeddah University, and OpenAI. We sincerely thank King Abdulaziz University for its unlimited support and distinguished research environment. We appreciate Jeddah University’s efforts in encouraging their academic lecturers and funding part of this research budget by covering the cost of GPT-4 for the technology and culture categories. We extend our warm gratitude to OpenAI, which facilitates the creation of a researcher account and funds using GPT-4 for the sports category. We acknowledge the OpenAI ChatGPT model for machine translation from Arabic to English, which only translates text in Appendix A.

Conflicts of Interest

The authors declare that there are no conflicts of interest associated with this publication.

Appendix A

This is the human evaluation survey example in Arabic, which is translated into English.

Text Category	Sentence A	Sentence B	How Far Does Sentence B Maintain the Meaning of Sentence A?	How Far Does Sentence B Look Natural and Fluent in Terms of Language and Sentence Structure?	Does Sentence B Follow Standard Grammatical Arabic Rules?	How Far Has Sentence B Been Restructured Compared to Sentence A	Have Some of the Words in Sentence B Been Replaced With Different Forms of the Word in Sentence A?	Have Some Words in Sentence B Been Replaced With Synonyms in Sentence A?	Considering All the Previous Criteria, How Would You Assess the Overall Quality of Sentence B?
ثقافة	وأضافت سمو الشيخة بدور أن التجربة لقيت إقبالاً كبيراً من طرف الجمهور الذي تلقف كتب الدار بحفاوة، وقد نتج عن ذلك أن حصدت الدار مجموعة من الجوائز ووقعت عقوداً لترجمة عدد من إصداراتها إلى اللغات الأخرى.	أكدت سمو الشيخة بدور أن التجربة شهدت إقبالاً كبيراً من الجمهور، مما أسفر عن حصد الدار لعدد من الجوائز وتوقيع عقود لترجمة إصداراتها إلى لغات أخرى.	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable
تقنية	ويعمل البرنامجان الجديدان بخاصية المبادرة إلى توفير الحماية وبشكل مستمر على تأمين ومراقبة عمل الحاسوب لاكتشاف أي نوع من أنواع التهديدات الأمنية لمنعها من أي نشاط تخريبي.	يعمل البرنامجان الجديدان بشكل مستمر لتوفير الحماية وتأمين ومراقبة عمل الحاسوب لاكتشاف ومنع أي تهديدات أمنية.	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable	3: Excellent or applicable 2: Good 1: Poor 0: Very poor or not applicable

References

Thanaki, J. Python Natural Language Processing: Explore NLP with Machine Learning and Deep Learning Techniques for Natural Language Processing; Packt Publishing: Birmingham, UK, 2017. [Google Scholar]
Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, Global ed; Pearson: London, UK, 2021. [Google Scholar]
Ayanouz, S.; Abdelhakim, B.A.; Benhmed, M. A smart chatbot architecture based NLP and machine learning for health care assistance. In Proceedings of the 3rd International Conference on Networking, Information Systems & Security, Marrakech, Morocco, 31 March–2 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–6. [Google Scholar]
Burgueño, L.; Clarisó, R.; Gérard, S.; Li, S.; Cabot, J. An NLP-Based Architecture for the Autocompletion of Partial Domain Models. In Advanced Information System Engineering; Springer: Berlin/Heidelberg, Germany, 2021; pp. 91–106. [Google Scholar]
Lende, S.P.; Raghuwanshi, M.M. Question answering system on education acts using NLP techniques. In Proceedings of the 2016 World Conference on Futuristic Trends in Research and Innovation for Social Welfare, Coimbatore, India, 29 February–1 March 2016. [Google Scholar] [CrossRef]
Li, Z.; Jiang, X.; Shang, L.; Li, H. Paraphrase generation with deep reinforcement learning. arXiv 2017, arXiv:1711.00279. [Google Scholar]
Zhou, J.; Bhat, S. Paraphrase generation: A survey of the state of the art. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 5075–5086. [Google Scholar]
Singh, R.; Singh, S. Text Similarity Measures in News Articles by Vector Space Model Using NLP. J. Inst. Eng. Ser. B 2021, 102, 329–338. [Google Scholar] [CrossRef]
Kotu, V.; Deshpande, B. Data Science Concepts and Practice, 2nd ed.; Morgan Kaufmann: Cambridge, UK, 2019. [Google Scholar]
Refai, D.; Abo-Soud, S.; Abdel-Rahman, M. Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification. IEEE Access 2023, 11, 132516–132531. [Google Scholar] [CrossRef]
McKeown, K. Focus constraints on language generation. In Proceedings of the Eighth International Joint Conference on Artificial Intelligence—Volume 1, Karlsruhe, Germany, 8–12 August 1983; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1983. [Google Scholar]
Egonmwan, E.; Chali, Y. Transformer and seq2seq model for Paraphrase Generation. In Proceedings of the 3rd Workshop on Neural Generation and Translation; Birch, A., Finch, A., Hayashi, H., Konstas, I., Luong, T., Neubig, G., Oda, Y., Sudoh, K., Eds.; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 249–255. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Caldarini, G.; Jaf, S.; McGarry, K. A Literature Survey of Recent Advances in Chatbots. Information 2022, 13, 41. [Google Scholar] [CrossRef]
Yang, Z.; Gan, Z.; Wang, J.; Hu, X.; Lu, Y.; Liu, Z.; Wang, L. An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA. Proc. AAAI Conf. Artif. Intell. 2022, 36, 3081–3089. [Google Scholar] [CrossRef]
Sahib, T.M.; Alyasiri, O.M.; Younis, H.A.; Akhtom, D.; Hayder, I.M. A comparison between ChatGPT-3.5 and ChatGPT-4.0 as a tool for paraphrasing English Paragraphs. In Proceedings of the International Applied Social Sciences Congress, Valletta, Malta, 13–15 November 2023; pp. 471–480. [Google Scholar]
Nagoudi, E.M.B.; Elmadany, A.; Abdul-Mageed, M. AraT5: Text-to-Text Transformers for Arabic Language Understanding and Generation. arXiv 2021, arXiv:2109.12068. [Google Scholar]
Betti, M.J. Paraphrase in Linguistics. ResearchGate. Available online: https://www.researchgate.net/publication/357661190_Paraphrase_in_Linguistics/citation/download (accessed on 4 November 2024).
Shen, L.; Liu, L.; Jiang, H.; Shi, S. On the Evaluation Metrics for Paraphrase Generation. arXiv 2022, arXiv:2202.08479. [Google Scholar]
Rahayu, F.E.S.; Utomo, A.; Setyowati, R. Investigating Lexical Diversity of Children Narratives: A Case Study of L1 Speaking. Regist. J. 2020, 13, 371–388. [Google Scholar] [CrossRef]
Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; Association for Computational Linguistics: Philadelphia, PA, USA, 2002; pp. 311–318. [Google Scholar]
Sun, S.; Sia, S.; Duh, K. CLIReval: Evaluating Machine Translation as a Cross-Lingual Information Retrieval Task. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Philadelphia, PA, USA, 5 July 2002; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 134–141. [Google Scholar] [CrossRef]
Koehn, P.; Monz, C. Manual and Automatic Evaluation of Machine Translation between European Languages. In Proceedings on the Workshop on Statistical Machine Translation; Koehn, P., Monz, C., Eds.; Association for Computational Linguistics: New York, NY, USA, 2006; pp. 102–121. Available online: https://aclanthology.org/W06-3114 (accessed on 12 September 2024).
Mulimani, D.; Patil, P.; Chaklabbi, N. Image Captioning using CNN and Attention Based Transformer. In Data Science and Intelligent Computing Techniques; Soft Computing Research Society: New Delhi, India, 2023; pp. 157–166. [Google Scholar] [CrossRef]
Zieve, M.; Gregor, A.; Stokbaek, F.J.; Lewis, H.; Mendoza, E.M.; Ahmadnia, B. Systematic TextRank Optimization in Extractive Summarization. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing; Mitkov, R., Angelova, G., Eds.; INCOMA Ltd.: Varna/Shoumen, Bulgaria, 2023; pp. 1274–1281. Available online: https://aclanthology.org/2023.ranlp-1.135 (accessed on 23 November 2024).
Li, B.; Liu, T.; Wang, B.; Wang, L. Enhancing Deep Paraphrase Identification via Leveraging Word Alignment Information. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7843–7847. [Google Scholar] [CrossRef]
Chandrasekaran, D.; Mago, V. Evolution of Semantic Similarity—A Survey. Acm Comput. Surv. 2020, 54, 41. [Google Scholar] [CrossRef]
Mohamed, M.; Oussalah, M. SRL-ESA-TextSum: A text summarization approach based on semantic role labeling and explicit semantic analysis. Inf. Process. Manag. 2019, 56, 1356–1372. [Google Scholar] [CrossRef]
Zou, W.; Socher, R.; Cer, D.; Manning, C. Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; Association for Computational Linguistics: Florence, Italy, 2013; pp. 1393–1398. [Google Scholar]
Rahutomo, F.; Kitasuka, T.; Aritsugi, M. Semantic Cosine Similarity. In Proceedings of the 7th International Student Conference on Advanced Science and Technology ICAST 2012, Seoul, Republic of Korea, 29–30 October 2012; Volume 4, pp. 1–2. [Google Scholar]
Timkey, W.; van Schijndel, M. All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality. arXiv 2021, arXiv:2109.04404. [Google Scholar] [CrossRef]
Rieger, J.; Koppers, L.; Jentsch, C.; Rahnenführer, J. Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs. arXiv 2020, arXiv:2003.04980. [Google Scholar]
Habbat, N.; Anoun, H.; Hassouni, L. AraBERTopic: A Neural Topic Modeling Approach for News Extraction from Arabic Facebook Pages using Pre-trained BERT Transformer Model. Int. J. Comput. Digit. Syst. 2023, 14, 1–8. [Google Scholar] [CrossRef] [PubMed]
Antoun, W.; Baly, F.; Hajj, H. AraBERT: Transformer-based Model for Arabic Language Understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
Zheng, W.; Lu, S.; Cai, Z.; Wang, R.; Wang, L.; Yin, L. PAL-BERT: An Improved Question Answering Model. Comput. Model. Eng. Sci. 2024, 139, 2729–2745. [Google Scholar] [CrossRef]
Bello, A.; Ng, S.-C.; Leung, M.-F. A BERT Framework to Sentiment Analysis of Tweets. Sensors 2023, 23, 506. [Google Scholar] [CrossRef]
Eke, C.I.; Norman, A.A.; Shuib, L. Context-Based Feature Technique for Sarcasm Identification in Benchmark Datasets Using Deep Learning and BERT Model. IEEE Access 2021, 9, 48501–48518. [Google Scholar] [CrossRef]
Wang, J.; Huang, J.X.; Tu, X.; Wang, J.; Huang, A.J.; Laskar, M.T.R.; Bhuiyan, A. Utilizing BERT for Information Retrieval: Survey, Applications, Resources, and Challenges. ACM Comput. Surv. 2024, 56, 185. [Google Scholar] [CrossRef]
Wu, X.; Xia, Y.; Zhu, J.; Wu, L.; Xie, S.; Qin, T. A study of BERT for context-aware neural machine translation. Mach. Learn. 2022, 111, 917–935. [Google Scholar] [CrossRef]
Wahle, J.P.; Ruas, T.; Meuschke, N.; Gipp, B. Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection. In Proceedings of the 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Champaign, IL, USA, 27–30 September 2021; pp. 226–229. [Google Scholar] [CrossRef]
Chelba, C.; Mikolov, T.; Schuster, M.; Ge, Q.; Brants, T.; Koehn, P.; Robinson, T. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. arxiv 2014, arXiv:1312.3005. [Google Scholar]
Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 19–27. [Google Scholar]
El Koshiry, A.M.; Eliwa, E.H.I.; El-Hafeez, T.A.; Omar, A. Arabic Toxic Tweet Classification: Leveraging the AraBERT Model. Big Data Cogn. Comput. 2023, 7, 170. [Google Scholar] [CrossRef]
Al-Twairesh, N. The Evolution of Language Models Applied to Emotion Analysis of Arabic Tweets. Information 2021, 12, 84. [Google Scholar] [CrossRef]
Al-Yahya, M.; Al-Khalifa, H.; Al-Baity, H.; AlSaeed, D.; Essam, A. Arabic Fake News Detection: Comparative Study of Neural Networks and Transformer-Based Approaches. Complexity 2021, 2021, 5516945. [Google Scholar] [CrossRef]
Abo-Elghit, A.H.; Hamza, T.; Al-Zoghby, A. Embedding Extraction for Arabic Text Using the AraBERT Model. Comput. Mater. Contin. 2022, 72, 1967–1994. [Google Scholar] [CrossRef]
Mohdeb, D.; Laifa, M.; Zerargui, F.; Benzaoui, O. Evaluating transfer learning approach for detecting Arabic anti-refugee/migrant speech on social media. Aslib J. Inf. Manag. 2022, 74, 1070–1088. [Google Scholar] [CrossRef]
Salloum, W.; Habash, N. Dialectal to standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties; Association for Computational Linguistics: Edinburgh, Scotland, 2011; pp. 10–21. [Google Scholar]
Jurafsky, D.; Martin, J.H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition; Prentice Hall: Saddle River, NJ, USA, 2020. [Google Scholar]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement Learning: A Survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Iyyer, M.; Wieting, J.; Gimpel, K.; Zettlemoyer, L. Adversarial example generation with syntactically controlled paraphrase networks. arXiv 2018, arXiv:1804.06059. [Google Scholar]
Mahmoud, A.; Zrigui, M. Semantic similarity analysis for corpus development and paraphrase detection in Arabic. Int. Arab J. Inf. Technol. 2021, 18, 1–7. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf?utm_source=chatgpt.com (accessed on 1 January 2025).
Radford, A.; Jeffrey, W.; Rewon, C.; David, L.; Dario, A.; Ilya, S. Language models are unsupervised multitask learners. OpenAI blog 2019, 1, 9. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D.C. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv 2023, arXiv:2302.11382. [Google Scholar]
Ormazabal, A.; Artetxe, M.; Soroa, A.; Labaka, G.; Agirre, E. Principled Paraphrase Generation with Parallel Corpora. In Proceedings of the Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Dublin, Ireland, 2022. [Google Scholar] [CrossRef]
Gudkov, V.; Mitrofanova, O.; Filippskikh, E. Automatically ranked Russian paraphrase corpus for text generation. arXiv 2020, arXiv:2006.09719. [Google Scholar]
Fu, Y.; Feng, Y.; Cunningham, J.P. Paraphrase generation with latent bag of words. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Sancheti, A.; Srinivasan, B.V.; Rudinger, R. Entailment relation aware paraphrase generation. Proc. AAAI Conf. Artif. Intell. 2022, 36, 11258–11266. [Google Scholar]
Ding, B.; Qin, C.; Liu, L.; Chia, Y.K.; Li, B.; Joty, S.; Bing, L. Is GPT-3 a Good Data Annotator? In Proceedings of the Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Toronto, ON, Canada, 2023. [Google Scholar] [CrossRef]
Surameery, N.M.S.; Shakor, M.Y. Use Chat GPT to Solve Programming Bugs. Int. J. Inf. Technol. Comput. Eng. 2023, 31, 17–22. [Google Scholar] [CrossRef]
Goyal, T.; Li, J.J.; Durrett, G. News summarization and evaluation in the era of gpt-3. arXiv 2022, arXiv:2209.12356. [Google Scholar]
Gutierrez, B.J.; McNeal, N.; Washington, C.; Chen, Y.; Li, L.; Sun, H.; Su, Y. Thinking about GPT-3 In-Context Learning for Biomedical IE? Think Again. In Findings of the Association for Computational Linguistics: EMNLP 2022; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022. [Google Scholar] [CrossRef]
Lan, W.; Qiu, S.; He, H.; Xu, W. A continuously growing dataset of sentential paraphrases. arXiv 2017, arXiv:1708.00391. [Google Scholar]
Saad, M.; Ashour, W. OSAC: Open Source Arabic Corpora. In Proceedings of the 6th International Conference on Electrical and Computer Systems (EECS’10), Lefke, Cyprus, 25–26 November 2010. [Google Scholar]
Bar, K.; Dershowitz, N. Deriving paraphrases for highly inflected languages from comparable documents. In Proceedings of the 24th International Conference on Computational Linguistics—Proceedings of COLING 2012: Technical Papers, Mumbai, India, 8–15 December 2012. [Google Scholar]
Wang, Z.; Hamza, W.; Florian, R. Quora. Question Pairs Dataset. Kaggle 2018. Available online: https://www.kaggle.com/datasets/quora/question-pairs-dataset (accessed on 14 November 2024).
Ganitkevitch, J.; Van Durme, B.; Callison-Burch, C. PPDB: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–14 June; Association for Computational Linguistics: Florence, Italy, 2013; pp. 758–764. [Google Scholar]
Einea, O.; Elnagar, A.; Al Debsi, R. SANAD: Single-label Arabic News Articles Dataset for automatic text categorization. Data Brief 2019, 25, 104076. [Google Scholar] [CrossRef]
Nabankema, H. Evaluation of Natural Language Processing Techniques for Information Retrieval. Eur. J. Inf. Knowl. Manag. 2024, 3, 38–49. [Google Scholar] [CrossRef]
Yagi, S.; Elnagar, A.; Yaghi, E. Arabic punctuation dataset. Data Brief 2024, 53, 110118. [Google Scholar] [CrossRef]
Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python; O’Reilly Media: Sebastopol, CA, USA, 2009; Available online: http://www.nltk.org/book/ (accessed on 5 February 2025).
Wang, L.; Chen, X.; Deng, X.; Wen, H.; You, M.; Liu, W.; Li, Q.; Li, J. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digit. Med. 2024, 7, 41. [Google Scholar] [CrossRef] [PubMed]
Research Google. Google Colaboratory. 2024. Available online: https://colab.research.google.com/ (accessed on 3 March 2025).

Figure 1. An overview of the research methodology.

Table 1. Examples of paraphrasing types from a linguistic perspective.

Paraphrasing Methods	Language	Original Sentence	Paraphrased Sentence
Synonyms substitution	Arabic	قرأ الطفل القصة بمتعة	قرأ الطفل القصة بسرور
Synonyms substitution	English	The child reads the story with enjoyment	The child reads the story happily
Different forms of a word	Arabic	قرأ الطفل القصة بمتعة	تمتع الطفل بقراءة القصة
Different forms of a word	English	The child reads the story with enjoyment	The child enjoyed by reading the story
Changing tense	Arabic	قرأ الطفل القصة بمتعة	قرأت القصة بمتعة
Changing tense	English	The child reads the story with enjoyment	The story was read with enjoyment
Altering the word order	Arabic	قرأ الطفل القصة بمتعة	بمتعة قرأ الطفل القصة
Altering the word order	English	The child reads the story with joy.	With joy, the child reads the story.

Table 2. Formulas of BLEU and ROUGE.

Metric	Formula
BLEU-Precision	$P r e c i s i o n = \frac{C o u n t o f m a t c h i n g n - g r a m s}{T o t a l n - g r a m s i n c a n d i d a t e}$
ROUGE-Recall	$R e c a l l = \frac{C o u n t o f m a t c h i n g n - g r a m s}{T o t a l n - g r a m s i n r e f e r e n c e}$
ROUGE-Precision	$P r e c i s i o n = \frac{C o u n t o f m a t c h i n g n - g r a m s}{T o t a l n - g r a m s i n c a n d i d a t e}$
ROUGE-F1 Score	$F 1 s c o r e = \frac{2 * R O U G E - P r e c i s i o n * R O U G E - R e c a l l}{R O U G E - P r e c i s i o n + R O U G E - R e c a l l}$

Table 3. Formula of similarity metrics.

Metric	Formula	Description
Cosine Similarity	$\frac{\sum_{i = 1}^{n} A_{i} * B_{i}}{\sqrt{\sum_{i = 1}^{n} {A_{i}}^{2}} * \sqrt{\sum_{i = 1}^{n} {B_{i}}^{2}}}$	The cosine score is calculated by finding the angle between two vectors (A and B) and focusing on their orientation rather than magnitude.
Euclidean Distance	$\sqrt{\sum_{i = 1}^{n} {(A_{i} - B_{i})}^{2}}$	Euclidean distance is computing the straight-line distance between two points (vectors) on a plane.
Jaccard Score	$J (A, B) = \frac{\| A \cap B \|}{\| A \cup B \|}$	The Jaccard score is calculated by finding the intersection between two given sets of tokens.

Table 4. A comparison summary of the AI approaches used in paraphrasing.

Paper	AI Approach	Language	Aim
McKeown [11]	Rule-based	English	He develops a rule-based paraphrase system by using pre-defined constraints.
Salloum and Habash [50]	Rule-based	Dialectal Arabic text	They aim to improve the quality of Arabic–English statistical machine translation of the dialectal Arabic text.
Nagoudi et al. [17]	Seq2seq	Arabic	They generate a novel Arabic benchmark for Arabic paraphrase generation.
Ormazabal et al. [60]	Seq2seq	English and France	They focus on paraphrasing generation tasks by using round-trip machine translation.
Gudkov et al. [60]	Seq2seq	Russian	They produce the first Russian corpus for paraphrase generation
Fu et al. [61]	Seq2seq	English	They evolve a paraphrase generation system for the latent bag-of-words in English text.
Sancheti et al. [62]	Reinforcement learning	English	They aim to initiate a new task of entailment relation-aware paraphrase generation.
Li et al. [6]	Reinforcement learning	English	They enhance paraphrasing generation by evolving a novel deep-reinforcement
Mahmoud and Zrigui [53]	Deep generative	Arabic	They work on detecting paraphrases
Li et al. [27]	Deep generative	English	They work on paraphrase detection
Iyyer et al. [52]	Deep generative	English	They develop a model for paraphrase generation

Table 5. Summarization of dataset details.

Dataset	Language	Size
ARGEN [17]	- Arabic	123.6 K paraphrase pairs
Merged OSAC and KSUCCA [54]	- Arabic	1 K sentence pairs
Arabic corpus BD [70]	- Arabic	100 documents correctly pairing
QQP [71]	- English	150 K paraphrase pairs
TwitterURL [68]	- English	51 K sentence pairs
PPDB [72]	- English - Spanish	220 M phrasal and lexical paraphrases

Table 6. An example of prompt engineering.

Language	Prompt	Example
Arabic	أعد صياغة التالي في جملة واحدة	أعد صياغة التالي في جملة واحدة: “علم البيانات هو مجموعة من التقنيات التي تستخدم لاستخراج قيمة من البيانات.”
English	Rephrase the following in one sentence.	Rephrase the following in one sentence: “Data science is a collection of techniques used to extract value from data” [9].

Table 7. Corpus statistics for the three categories.

Category	Original Sentence	Paraphrase Sentence
Culture	48,486	48,486
Technology	74,323	74,323
Sport	67,360	67,360
Total	190,169	190,169

Table 8. Criterion of semantic similarity.

	Evaluator #1				Evaluator #2				Evaluator #3
	0 *	1	2	3	0	1	2	3	0	1	2	3
Culture	0.71	1.43	2.14	95.71	1.43	7.86	5.00	85.71	2.14	1.43	10.71	85.71
Sport	0.00	0.00	1.43	98.57	0.00	6.43	5.00	88.57	2.86	2.86	7.14	87.14
Technology	4.29	5.00	2.86	87.86	8.57	5.00	10.00	76.43	6.43	5.71	3.57	84.29