The Detection of Fake News in Arabic Tweets Using Deep Learning

Alyoubi, Shatha; Kalkatawi, Manal; Abukhodair, Felwa

doi:10.3390/app13148209

Open AccessArticle

The Detection of Fake News in Arabic Tweets Using Deep Learning

by

Shatha Alyoubi

^1,*,

Manal Kalkatawi

²

and

Felwa Abukhodair

²

¹

Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Rabigh 21911, Saudi Arabia

²

Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(14), 8209; https://doi.org/10.3390/app13148209

Submission received: 15 May 2023 / Revised: 28 June 2023 / Accepted: 10 July 2023 / Published: 14 July 2023

(This article belongs to the Special Issue Applied Intelligence in Natural Language Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Fake news has been around for a long time, but the rise of social networking applications over recent years has rapidly increased the growth of fake news among individuals. The absence of adequate procedures to combat fake news has aggravated the problem. Consequently, fake news negatively impacts various aspects of life (economical, social, and political). Many individuals rely on Twitter as a news source, especially in the Arab region. Mostly, individuals are reading and sharing regardless of the truth behind the news. Identifying fake news manually on these open platforms would be challenging as they allow anyone to build networks and publish the news in real time. Therefore, creating an automatic system for recognizing news credibility on social networks relying on artificial intelligence techniques, including machine learning and deep learning, has attracted the attention of researchers. Using deep learning methods has shown promising results in recognizing fake news written in English. Limited work has been conducted in the area of news credibility recognition for the Arabic language. This work proposes a deep learning-based model to detect fake news on Twitter. The proposed model utilizes the news content and social context of the user who participated in the news dissemination. In seeking an effective detection model for fake news, we performed extensive experiments using two deep learning algorithms with varying word embedding models. The experiments were evaluated using a self-created dataset. The experimental results revealed that the MARBERT with the convolutional neural network (CNN) model scores a superior performance in terms of accuracy and an F1-score of 0.956. This finding proves that the proposed model accurately detects fake news in Arabic Tweets relating to various topics.

Keywords:

Arabic language; Twitter; fake news; news credibility; deep learning; word embedding; natural language processing

1. Introduction

In recent years, the rapid growth of social networks has facilitated the exchange of news among users. Social networks can be utilized to inform society regarding the latest news, but they can also be a source of fake news. Twitter is considered one of the most widespread social networks in the Arabian area [1]. Posting news on Twitter is less costly in terms of both money and time than any other medium. Its simplicity and lack of content monitoring enable fake news to reach a wide range of users rapidly [2]. Fake news refers to false, misleading, and fabricated news delivered intentionally [3]. Fake news dissemination aims to deceive the audience for political, social, or financial gains. Consequently, fake news imposes significant risks on individuals, organizations, and governments [4]. Thus, there is an urgent need for efficient techniques to detect and eliminate fake news in social networks to prevent its negative impacts.

Many fact-checking websites, such as Anti-Rumors Authority and Misbar, have been implemented to check news veracity propagated on the internet in an early attempt to reduce the impact of fabricated news. These websites depend on human experts to manually confirm or reject the validity of the news [2]. It consumes time and effort to deal manually with a large volume of news and is not scalable. Most recently, the adoption methods of machine learning and deep learning have become popular in tackling anomaly detection problems such as fake news detection [5]. Two types of learning techniques are currently adopted in these methods to construct automated systems for false news detection on social networks, which are news content-based learning and social context-based learning. News content-based methodology mainly focuses on the writing style of the text content of news to discover syntactic or semantic patterns to classify news. The social context-based methodology primarily analyses user behaviour and engagement in social media. Social context features can be explored from the user profile, discussions, and connected networks among users [4].

Informal writing styles, spelling errors, using diverse dialects, etc., makes processing Arabic content on social media more difficult. Other challenges that aggravate the processing complexity are the massive vocabulary and complex morphological patterns of the Arabic language. Moreover, the limited availability of Arabic datasets. These difficulties result in little research focused on detecting and eliminating fake news in Arabic. Few studies have proposed models using deep learning algorithms to identify fabricated news posted on the Twitter platform [3,5,6]. In general, these models were trained to target a specific topic of fake news posts and relied only on the textual content of the Tweets to produce a classification. Furthermore, the detection performance of the current models still requires improvement. In an effort to address the aforementioned problems, we proposed a deep learning model that employs news content, text features, and social context, specifically user profile features, to identify the truthfulness of news in Arabic Tweets. To find a suitable model for detecting fake news in Arabic Tweets, we conducted experiments using two deep learning techniques: convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM), and five-word embeddings were explored to extract the textual representation. Broadly speaking, word embeddings are mainly classified into classic and context-based word embeddings. Classic word embeddings learn the embeddings and disregard the contextual information and meaning within the sentence. Hence, each word is represented by a distinct and unique vector. Whereas context-based embeddings learn diverse text representations for the same word depending on the context within the text [4]. In this study, we empirically evaluated both classic word embeddings (Keras embedding layer, word2vec [7], and fastText [8]) and context-based embeddings (ARBERT and MARBERT [9]) with deep learning models. As far as we know, no previous study has investigated ARBERT and MARBERT as textual feature extractors in the area of news credibility detection.

This study proposed a model for detecting fake news that differs from other detection systems that used a deep learning approach and were trained on the Arabic dataset. The proposed model benefited from both the text of the news Tweet and the news publisher’s characteristics, whereas previous works utilized only news text. Furthermore, the model was trained to detect fake news related to diverse domains instead of a specific topic, such as COVID-19. The main contributions of this study are as follows:

Construction of a comprehensive, manually labelled dataset that contains Arabic fake and real news Tweets.
Proposed a deep learning model for eliminating and detecting fake news on different topics.
Exploited the news content and social context for more accurate detection of news credibility.
Investigated the influence of contextual embeddings and compare them to classic word embeddings.
Ascertained the generalization capability of the proposed model using a publicly available dataset.
Verified the efficiency of the proposed model architecture by comparing it with current models.

The remainder of this work is arranged as follows: Section 2 reviews recent studies on credibility detection in Arabic news. Section 3 illustrates the proposed methodology in detail. Section 4 shows a series of conducted experiments, obtained results, and a discussion. Finally, the conclusion of this work and directions for further research in this field are summarized in Section 5.

2. Related Work

This section discusses the most recent related work to detect fake news using machine learning and deep learning. Detecting fake news in Arabic is still in its infancy compared with other languages, such as English. The models are discussed below and summarized in Table 1.

Detection models based on news content features make up the majority of current studies in news truth verification. Helwe et al. [10] developed an approach using two CNN models to assess Arabic weblog posts’ credibility. Both models have similar layers, except the embedding layer. The first model used pre-trained word-level embeddings, while the second used character-level embeddings. Each model was trained on a labelled dataset in the first iteration, and then the predictions of unlabelled data for each model were picked to re-train the other model. In the experiment, they compared the proposed model to a support vector machine (SVM) trained with a TF-IDF feature representation, CNN-trained with character-level vector representation, CNN-trained with word-level vector representation, and a combined model based on Word-CNN and Char-CNN. Their proposed model scored the highest F1-score of 0.63. The fundamental limitation of this work is that the amount of labelled data is small.

The previous study [10] focused on examining the credibility of news blogs. Here, some works [11,12,13] assessed news verification models using fabricated news generated from real news stories by modifying their semantics. Jude Khouja [11] proposed a system for claim verification based on the textual information of news. This work introduced a publicly available Arabic News Stance (ANS) dataset to determine the claims’ veracity. The author acquired a subset of news titles from the Arabic news texts (ANT) dataset. The news titles were modified to generate fake claims. Two approaches have been used to train and test a generated dataset for claim classification: long short-term memory (LSTM) and pre-trained BERT model. This study reported that LSTM achieved the highest result for false claims recognition with an F1-score of 0.643. Nagoudi et al. [12] developed a method for automatically manipulating real multi-topic news to generate a fake news dataset, AraNews. They used transformer-based pre-trained models to detect manipulated Arabic news. Furthermore, they experimented with various modelling settings to examine the impact of their generated data on fake news verification models compared to a human-created fake news dataset. The authors reported that automatically generated news positively affects the fake news detection task. Compared to previous work [11], it achieved a better improvement with an F1-score of 0.0576. The ANS and AraNews datasets were utilized by another work [14]. This work intended to examine the performance of language models such as AraBERT, QARiB, and AraGPT2 when applied to the Arabic fake news detection task. Each model was trained and evaluated using the ANT and AraNews datasets. The results showed that AraBERT and QARiB revealed some ability to identify false news with a similar accuracy of 0.80. In both experiments, AraGPT2 achieved the lowest accuracy. Himdi et al. [13] proposed a machine learning model to assess the veracity of Arabic news articles. They gathered factual news articles related to a single domain, which is the Hajj. After this, the acquired dataset was utilized to construct fake news articles relying on crowdsourcing. They extracted a set of linguistic features, including emotional, syntactical, polarity, and part of speech. The extracted features were used to train three classifiers, Naïve Bayes (NB), Random Forest (RF), and SVM, to detect Arabic false news. The results demonstrated that the extracted linguistic features could effectively detect fake news, and the best classifier was RF, which had a 0.79 accuracy rate.

Another approach exploited textual features defined in [15] to automatically identified satire news. They released a dataset that was collected from a variety of news websites. They analysed the linguistic properties of news and concluded that false news involves highly positive and negative keywords and tends to be written in a more subjective tone. Machine learning and deep learning models have been trained to identify satirical news. A CNN with pre-trained word embeddings achieved the highest performance with an accuracy of 0.98.

During the COVID-19 pandemic, a lot of false information was disseminated through various social networking applications. The effect of misinformation is not confined to individual lives but also includes society and the economy. Several studies [3,6,14,16,17,18] have concentrated on assessing the credibility of information related to the spread of COVID-19 in Arabic communities via social media platforms such as Twitter.

Alqurashi et al. [3] constructed a large COVID-19 dataset and made it available to the public. The dataset is split up into two sets. The first set is used to construct two Arabic word embedding models based on the word2vec and fastText tools. Two Arabic native speakers volunteered and annotated another set manually. Then, they assembled a feature vector for each Tweet using TF-IDF and word embedding models. Machine learning and deep learning models have been implemented to discover misinformation. Generally, all models’ accuracy improved when pre-trained word embedding models were utilized as feature extractors. They reported that the XGBoost classifier outperformed all other models. The AUC optimization technique was used to deal with the issue of imbalanced training data. Unfortunately, it failed to improve the performance of all classifiers.

Ameur and Aliane [6] claimed that the performance of the fake news detection model is strongly correlated with the dataset size that the model will be trained on. Therefore, they created a large manually labelled dataset of fake Arabic news regarding COVID-19. To assess the accuracy of their dataset, they used it to train and evaluate three pre-trained transformer-based models as baseline models. Additionally, two transformer models were further pre-trained using COVID-19 data. Experimental results demonstrated that the additional training improved the results compared to baseline models for fake news verification.

Al-Yahya et al. [14] conducted a comprehensive work investigating several neural network and language models for recognizing and eliminating fake Arabic news on social networks. The models were trained and evaluated on ArCOV19-Rumors and tested on COVID-19-Fakes. The findings showed that transformer-based models outperformed the neural network-based approaches in classifying COVID-19 news as fake or real. QARiB achieved the best result with a reported accuracy of 0.95. Nevertheless, low accuracy was achieved by QARiB when generalized on a new dataset.

Mahlous and Al-Laith [16] suggested an automatic system for identifying false Arabic news relating to the COVID-19 crisis on Twitter. They manually and automatically constructed annotated datasets. Four phases were performed to prepare the Arabic datasets associated with COVID-19. Firstly, they collected 5.5 million Tweets using a collection of hashtags that existed during the COVID-19 pandemic. Secondly, they used two fact-checkers to generate a list of terms relevant to COVID-19 rumours. Thirdly, they only returned Tweets containing at least one of these keywords. Lastly, a small sample of Tweets was manually annotated. For the remaining Tweets, they selected the best-performing classifier trained on the manually annotated dataset to classify unlabelled Tweets as fake or real automatically. Their experiment employed various supervised machine learning classifiers to train and validate it on manually and automatically annotated datasets. They used four feature extraction techniques and stemming and rooting techniques to perform further pre-processing on collected Tweets. The findings revealed that the Logistic Regression (LR) classifier scored a high performance with an F1-score of 0.933 in both the manually and automatically annotated datasets. Their research failed to enhance the classification result using stemming and rooting techniques. A significant drawback is that the list of fake keywords regarding COVID-19 is limited. Thus, their model is not appropriate for use on other datasets.

Qasem et al. [17] proposed a comprehensive approach that comprised two models. The first model was used to assess the reliability of Twitter posts relating to COVID-19, whereas tracking and identifying the source user who released the misleading news is the purpose of another model. They performed extensive experiments to detect fake news, including standalone and ensemble-based machine learning algorithms. The fake news detection approach comprises a stacking-based ensemble model that was applied to the Tweet’s text and the Genetic Algorithm-based SVM (GA-SVM) model that worked on the content- and user-based features. The feature maps of both models were combined to construct a new training set, which was then fed into the GA-SVM classifier to predict the Tweet as a rumour or a non-rumour. The best F1-score achieved by the detection model was 0.935.

Amoudi et al. [18] presented a comparative study investigating the performance of state-of-the-art machine learning and deep learning models for rumour detection. They relied on the ArCOV19-Rumors dataset to train and evaluate their models. In the first experiment, they investigated the effect of n-gram, TF-IDF, and word2vec on traditional machine learning classifiers. Then, they used two ensemble learning techniques, voting and stacking techniques, that combined the best classifiers to enhance the overall performance. The results showed that stacking classical classifiers performed better than single models, with an accuracy of 0.81. In the second experiment, they explored the influence of utilizing different optimizers with deep learning models. The LSTM and BiLSTM with the RSMprop optimizer earned the highest accuracy among the other neural networks with a score of 0.80.

Combining information from both news content and social context sources may result in a better detection rate. Incorporating social context features, such as user behaviour, user profile, etc., with other news content features is uncommon in deep learning-based studies. Several efforts have been made to propose models for detecting misinformation using traditional machine learning algorithms utilizing both news content and social context aspects [1,19,20,21,22].

Thaher et al. [1] developed a machine learning-based model to detect Arabic fake news on Twitter. They examined eight well-known machine learning methods with multiple feature combinations. These features can be related to the news content itself or to the user who posted the news Tweet. The highest F1-score reached was 0.839, with the LR algorithm utilizing user profile and content-based features along with the TF-IDF representations of the textual data. The experimental results showed that the Harris Hawks Optimization (HHO) successfully improved the model’s performance by approximately 3.0%. HHO was used as a feature selection approach to filter out irrelevant and redundant features.

Sabbeh and Baatwah [19] developed a model to assess the authenticity of news propagated by Twitter users. The proposed model comprised four modules. The first module is responsible for extracting content and social context features. The second module verified news against popular and reliable news sites, such as al-Arabiya (https://www.alarabiya.net). The third module was employed to evaluate the polarity of users’ replies. Three traditional machine learning algorithms, Decision Tree (DT), NB, and SVM, have been trained on extracted features. The DT classifier obtained the best accuracy at 0.899. From the findings, they implemented the DT model in the last module for the classification task.

Alzanin and Azmi [20] started by collecting Tweets associated with real or fake news. They utilized Anti-Rumor’s authority to obtain fake news. They aimed to construct a system able to detect rumours early and before anyone issues a clarification. Therefore, they removed any Tweet that contained a comment denying a rumour. A set of previously proposed features have been implemented for fake news classification, in addition to two new features suggested by them. The first feature investigates if the Tweets contain a link to sensitive content. The second feature checks if credible accounts follow the user. The k-best method has been used to assess whether features are useful for fake news detection. Generally, user-based features appeared more useful, and they achieved the best classification result with the expectation-maximization (EM) algorithm. Semi-supervised and unsupervised learning models are used to train their proposed model. The semi-supervised EM achieved a better result than NB when the training size exceeded 60%. Varied results of experimented with unsupervised EM. It is based on the initial value of the model parameters. The limitation of this work involved it relied on a single fact-checking site as a source of rumours and the unavailability of the dataset.

Mouty and Gazdar [21] introduced a machine learning model to evaluate the credibility of Arabic Tweets. This work focused on using the most effective content features as well as news publisher features that have been suggested and investigated in prior works. Moreover, a new feature was proposed by them. This feature measures the similarity score between the two Twitter user names (screen name and user name). The new feature did not significantly improve classification results based on the obtained results. They reported that RF performed the best, with an F1-score of 0.776.

Jardaneh et al. [22] designed a Tweet classification model that relied on the textual content, user profile, and sentiment for positive or negative words. They exploited four supervised machine learning classifiers to generate a fake news classification model, including RF, DT, AdaBoost, and LR. The experimental evaluation showed that their model performance increased after applying sentiment features. The shortcomings of this work include considering solely the emotion of users who shared the news Tweets and the small size of the dataset.

After reviewing the recent work on false news detection, we note the following gaps. The detection of Arabic Tweets containing fabricated news is still in its early stages. As a result, few Arabic datasets for detecting fake news Tweets are publicly available to the research community. There have been few studies conducted to address the Arabic Tweets’ veracity using deep learning techniques, and the majority of them focus on detecting fake news relating to a certain topic, such as COVID-19. Additionally, they relied only on Tweets text to produce a classification. A major challenge for these identification systems is that the underlying textual characteristics vary under different fake news. For this reason, models that used only textual content of news Tweets may have a generalizability issue. In contrast, machine learning models commonly use news content and social context features to more accurately identify various types of fake news. However, the existing models’ detection performance still needs to be improved.

Table 1. A summary of existing detection approaches for fake news in Arabic.

Ref	Year	Dataset	Topic	Classification Approach	Feature Type		Textual Feature Representations	Result
Ref	Year	Dataset	Topic	Classification Approach	News Content	Social Context	Textual Feature Representations	Result
[19]	2018	800 Tweets	General	NB, SVM, DT	✔	✔	-	Accuracy 0.899
[20]	2019	177 Tweets	General	EM	✔	✔	-	F1-score 0.80
[10]	2019	268 labeled blog posts, 20,392 unlabeled blog posts	General	CNN	✔		Word2vec(CBOW), char-level embeddings	F1-score 0.63
[21]	2019	9000 Tweets	General	RF, SVM, DT, NB	✔	✔	-	F1-score 0.776
[22]	2019	1862 Tweets	Syrian crisis	LR, RF, DT, AdaBoost	✔	✔	-	Accuracy 0.76
[11]	2020	4547 news	General	LSTM, mBERT	✔		Word-level embeddings, char-level embeddings, mBERT	F1-score 0.643
[12]	2020	AraNews (97,310 news), ATB (48,655 news), ANS (4547 news)	General	mBERT, AraBERT, XLM-RBase, XLM-RLarg	✔		mBERT, AraBERT, XLM-RBase, XLM-RLarg	F1-score 0.70
[15]	2020	6895 news articles	Political	NB, XGBoost, CNN	✔		BOW, TF-IDF, fastText	F1-score 0.984
[1]	2021	1862 Tweets	Syrian crisis	KNN, DT, NB, LR, LDA, SVM, RF, XGboost	✔	✔	TF, TF-IDF, BoW	Accuracy 0.82
[16]	2021	37,000 Tweets	COVID-19	NB, LR, SVM, MLP, RF, XGB	✔		BOW, TF-IDF	F1-score 0.933
[6]	2021	10,828 Tweets	COVID-19	AraBERT, mBERT, distilBERT-multi, mBERT COV19, AraBERT COV19	✔		AraBERT, mBERT, distilBERT-multi, mBERT COV19, AraBERT COV19	F1-score 0.9578
[14]	2021	COVID-19-Fakes (70,959 Tweets), ArCOV19-Rumors (3032 Tweets), ANS (4091 news), AraNews (108,194 news)	COVID-19, general	CNN, RNN, GRU, AraBERT v1, AraBERT v2, AraBERT v02, QARiB, Ar-Electra, Marbert, Arbert	✔		Word2vec, fastText, doc2vec, glove, AraBERT v1, AraBERT v2, AraBERT v02, QARiB, Ar-Electra, MARBERT, Arbert	F1-score 0.95
[3]	2021	8786 Tweets	COVID-19	XGB, RF, NB, SVM, SGD, CNN, RNN, CRNN	✔		TF-IDF, word2vec, fastText	F1-score 0.54
[17]	2022	3157 Tweets	COVID-19	LR, KNN, CART, SVM, NB, RF, AdaBoost, Bagging, ExtraTree	✔	✔	TF-IDF, glove	F1-score 0.935
[18]	2022	4299 Tweets	COVID-19	RF, DT, XGBoost, SVM, KNN, NB, SGD, LR, RNN, BiRNN, GRU, BiGRU, LSTM, BiLSTM	✔		N-Gram, TF-IDF, word2vec	Accuracy 0.81
[13]	2022	1098 news articles	Hajj	SVM, RF, NB	✔		-	F1-score 0.79

3. Methodology

This section explains the methodology adopted to develop a classification model for detecting fake news on Twitter. This work seeks to find suitable detection model configurations in the context of the fake news issues by performing experiments with various word embeddings and deep learning algorithms. Figure 1 depicts the detailed phases of the proposed methodology. First, we collected data from the Twitter platform using Twitter API and then performed preprocessing, including data cleaning and normalization techniques on the collected Tweets. The next step is transforming the Tweet’s text into a representative multidimensional vector by applying one of the textual feature extraction methods. Textual feature extraction methods are divided into classic and contextualized methods. From classic methods, we investigated the Keras embedding layer and two pre-trained embedding models, which are word2vec and fastText. We also considered ARBERT and MARBERT as contextualized feature extraction methods. Then, either CNN or BiLSTM is used to learn embedded vectors and extract deep semantic features. After this, textual features are concatenated with user features extracted from the user profile. Finally, the output feature vector representing the news is forwarded to the fully connected layers to learn features and classify Tweets as fake or real. Each of these phases is discussed in detail in the subsections that follow.

3.1. Tweet Collection

Twitter allows researchers, developers, and businesses to take advantage of its massive data repository through Twitter API [23]. Collecting data, monitoring customer satisfaction, controlling business account, evolving interaction experience, etc., all becomes easy with Twitter API. Twitter has several endpoints for collecting data from Twitter’s platform. Recently, it released a new endpoint for academic research to facilitate the study of recurring concerns in the research community. It grants access to the full archive of all historic Tweets with more features that support collecting more unbiased, precise, and complete datasets. Whereas other Twitter endpoints can retrieve Tweets for the last seven days only.

Tweets carrying false news were compared to true news Tweets. In this work, we strive to build a dataset containing news on a variety of topics (politics, sports, health, etc.). In addition, we need to gather as many fake news Tweets as possible. Therefore, the full archive search endpoint may be the best choice in order to obtain the right data. We followed the following steps to gain access to the full archive search endpoint. Initially, we applied for a developer account for the academic research track. After the developer account had been reviewed and approved, we generated an application inside the account that specified goals and some description of our project. Once the application creation is completed, the platform then returns the access keys and bearer token that will use one of these to connect to Twitter API and retrieve Tweets.

To specify what Tweets we are looking for, we prepared a list of news, whether real or fake. For Tweets of fake news, we extracted a list of keywords relevant to the fake news from fact-checking websites, such as Anti-Rumors Authority (http://norumors.net/ accessed on 12 February 2022) and Misbar (https://misbar.com/ accessed on 25 February 2022), which examine the truthfulness of news stories propagated on social networks based on human experts and then made publicly available. For Tweets of real news, we relied on many credible Twitter accounts to retrieve real news, such as Al-Riyadh (https://www.alriyadh.com/ accessed on 18 March 2022) and Al-Ekhbariya (https://www.alekhbariya.net accessed on 21 March 2022). After this, we sent a GET request written in Python to Twitter API to retrieve Tweets. The URL of a GET request usually carries three values, an authorization header that involves the bearer token, the endpoint’s URL, and the query parameters. The query parameters are used to filter the Tweets based on the study case and limit irrelevant Tweets. An example of this is retrieving Tweets that contain a certain keyword. The academic search endpoint is accessed through Twitter API v.2.0. It just delivers the ID and text field of a Tweet, and all other fields need to be requested through query parameters, in contrast to API v.1.0, which delivers all other Tweet fields automatically. If the request is successful, the response retrieved from the Twitter API is returned in JSON (JavaScript Object Notation) format. The JSON response is broken down into separate columns, and the results are saved in the CSV format. The retrieved Tweets were filtered as follows: Tweets removed with fewer than three Arabic words; Retweets are discarded from a given Tweet; removal of irrelevant Tweets, duplicated Tweets, and Tweets that contained truth clarification for specific fake news. We ended up with a total of 5000 Tweets after the filtering step, of which 4000 were real news Tweets and 1000 were fake news Tweets. A sample of some collected Tweets is shown in Table 2.

3.2. Tweets Preprocessing

The non-availability of writing rules in social media applications makes some, if not most, Twitter users tend to use an informal writing style. Thus, Twitter data becomes very noisy and contains a lot of spelling mistakes. Text preprocessing is essential to prepare the Tweets before feeding them into the feature extraction and classification stage. Despite leading to enhancing the performance of the model, it can sometimes increase the model’s processing time and computational complexity [24]. Therefore, we performed light preprocessing. In this work, the preprocessing stage includes two common preprocessing techniques, which are data cleaning and normalization (see Table 3 for an example):

The cleaning technique filters out the noisy data and irrelevant text. Cleaning the Tweets included the following steps:
-
Removing user mentions (@), hashtag signs (#), URLs, punctuation marks, white spaces, and emojis.
-
Removing non-Arabic words and characters.
-
Removing repeated letters of more than two letters.
The normalization technique unifies the disparate shapes of characters into a single canonical form. This was performed by carrying out the following steps:
-
Replacing non-Arabic numerals with Arabic numbers.
-
Unifying variants of the Arabic letters (alif, waw, ya, and ha) into a single form.

3.3. Features Extraction

At this stage, we aimed to extract news text and user features that propagate news on the Twitter platform.

3.3.1. Textual Features Extraction

Neural network models cannot deal with raw text directly. Instead, the feature extraction method is required to map words from their unstructured forms to fixed-length vectors of continuous real numbers [25]. Various methods of feature extraction exist in natural language processing (NLP). Pre-trained word embeddings have received considerable interest in recent years, displaying superior performance in different text classification domains [24,26,27,28]. The Keras embedding layer and various pre-trained word embeddings are considered in this study. We explored pre-trained models extracting static textual representations, where embeddings are not allowed to update during the training process of the model. The benefit of utilizing this approach for a task-specific model is that it does not need parameter training, and therefore it consumes less memory and time.

Keras Embedding Layer
Each word in the Keras embedding layer is represented by an embedding vector of a fixed length that is defined as part of the model, often 100, 200, or 300 dimensions. An embedding vector is learned during training the neural network on a specific NLP task. It requires each word to be represented using one-hot representation and padded with zeros for texts shorter than the maximum length. The embedding layer is initialized with small random numbers and learns the optimal values of the embeddings for all words using the backpropagation algorithm [29].
Word2vec
Word2vec is one of the most famous pre-trained word embeddings that is based on continuous bag-of-words (CBOW) or skip-gram (SG) learning models for obtaining vector representations for different words. These models use neural networks to learn word representation but differ in their input and output variables. CBOW learns word vectors by training the model to predict a target word given its context, while in the SG model, the input word is used to predict the surrounding context words [30]. In [31], the word2vec model is used to release pre-trained word vectors for the Arabic language called AraVec, which has been trained on two different Arabic content sources, Twitter and Wikipedia. In our experiment, we chose the version of AraVec built using Twitter data by applying CBOW for 100 and 300 dimensions with a window size of 3.
FastText
This is essentially an extension of the word2vec model. FastText treats each word as a bag of character n-grams instead of a word whole [29]. A vector representation is generated for each character n-gram, and the sum of these representations gives the final word vector. This feature enables it to find the vector representation for rare, misspelled, and out-of-vocabulary words [32]. FastText provides pre-trained word vectors for 157 different languages, including Arabic, which have been trained on datasets composed of a mixture of Common Crawl and Wikipedia [8]. These models were learned dense vectors using CBOW with a window size of 5 and character n-grams of length 5. In our experiment, we selected the 100–300-dimension Arabic version of pre-trained FastText embedding.
ARBERT and MARBERT
BERT (Bidirectional Encoder Representations from Transformers) is a bidirectional transformer model that was pre-trained on a massive unlabelled English dataset. The bidirectional capability of BERT means that the model can read each word from both the right and left side of a text statement simultaneously; this helps the model to truly and deeply perceive the meaning of the word by all of its surrounding words [33]. BERT has many pre-trained model variations that are trained on the source dataset with a different domain or language. These models can be used either to extract numerical representations of text data or to fine-tune these models on a diverse range of NLP tasks that may not be relevant to the task the model was trained on to provide state-of-the-art predictions [32]. A variety of pre-trained BERT-based language models for Arabic are available, namely AraBERT [34], QARiB [35], ARBERT, and MARBERT [9]. This work focuses on ARBERT and MARBERT to extract contextualized embeddings. ARBERT is trained on Modern Standard Arabic (MSA) text extracted from six resources covering different Arabic countries. MARBERT is a large-scale model trained on 6 billion Arabic Tweets. Therefore, it has shown promising results for social media tasks. The following paragraph will explain the steps to extracting contextualized embeddings using BERT-based models.
First, the input text is passed to the tokenizer to truncate it into tokens. The tokenizer uses the word-piece vocabulary size of around 100K. The word-piece tokenizer starts by searching for the whole word in the vocabulary, and if it does not exist, it iteratively splits this word into sub-words until the predefined words are reached. The advantage of this is that it forces the model to represent any word, and hence overcome the OOV issue [33]. After decomposing the text into tokens, it transforms the list of tokens into numerical IDs corresponding to vocabulary indices, indicated as input IDs. To make all input sequences have a fixed length, padding is performed, and to distinguish between tokens and padding elements in the input sequence, the attention mask was created. Input IDs and attention masks are fed into the model, giving an embedding vector of 768 for each token as the output. Two different approaches were considered here to use their embeddings: word-level embedding, which encodes each token as a 768-dimensional vector of the numerical values, and sentence-level embedding, which averages all the tokens’ vectors to create a single 768-dimensional vector for the entire sentence.

3.3.2. User Features Extraction

User profiles are readily available on social networks; therefore, investigating the characteristics associated with news spreaders (i.e., those who posted the news on the Twitter platform) would help enhance the early detection of news credibility. Using user features in this work was encouraged by Liu and Wu [2]. They conducted an extensive user features analysis and found that fake news publishers behave significantly different from normal users. Several user-based features on social media have been used in previous work [2] to determine fake news spreaders as the number of favourites, the number of accounts they follow, and the number of followers. In this work, we adopted twelve user-based features presented in [2] to recognize the characteristics of their behaviours. These features include profile description length, screenname length, username length, is the account verified, is the account protected, follower count, following count, Tweet count, listed count, account age, profile URL, and location information.

3.4. Proposed Model Architecture

Deep learning techniques have been demonstrated to be effective in classification tasks that are highly related to fake news detection, such as sentiment analysis, spam detection, etc. [24]. An essential advantage of deep learning methods is that they are capable of effectively extracting the deep features of the input data during the training process. The powerful representation learning ability of the deep learning algorithms leads to reduces feature dimensions, and hence the computational costs [24]. CNN and RNN (LSTM and BiLSTM) are the most commonly used deep learning algorithms to identify news veracity in other languages, specifically the English language. They have attracted much interest because they have the most effective architecture for a more accurate and efficient performance [36]. To find the best classification performance, two deep learning models are tested in this study to handle the detection of fake news, namely CNN and BiLSTM.

3.4.1. CNN Architecture

CNN learns local patterns from the given training data using convolution calculation. It has been known that the CNN architecture significantly impacts the overall performance of the model [37]. In this work, the layered architecture of the CNN-based model has three parallel convolutional layers with different kernel sizes. Each convolutional layer is followed by a max pooling layer to reduce the dimensionality of the feature maps without affecting the network’s efficiency. Then, the concatenation layer takes the output features from the pooling layers and combines them in a single matrix. Finally, the flattened layer is needed to reshape the concatenated features into a one-dimensional vector.

3.4.2. BiLSTM Architecture

LSTM tackles the problem of limited memory concerning previous data in traditional neural networks by having neurons called memory cells to keep track of the most relevant information. The LSTM networks can capture long-term dependencies in the given data [38]. We preferred BiLSTM instead of LSTM because it can understand context-dependent sequences by scanning the data both forwards and backwards simultaneously [30]. In this work, the architecture of the BiLSTM-based model contains a single BiLSTM layer followed by a one-dimensional global max pooling layer.

3.4.3. Fully Connected Layers

The outcome of the text features from the BiLSTM or CNN models is concatenated to the user features vector. The output vector is then passed through two of fully connected layers with the ReLU activation function for learning. Dropout is applied to the dense layers in the network to avoid model overfitting. Finally, we set a classification layer with a SoftMax activation function to yield the probability distribution across all labels.

4. Experiments and Results

This section briefly describes the setup and results of the experiments that were conducted to evaluate the performance of the proposed deep learning model on the collected dataset.

4.1. Experimental Setup

Several experiments have been carried out to enhance the proposed deep learning model’s effectiveness. All experiments were performed on Google Colab. Google Colab is based on Jupyter Notebook, a web-based environment, and runs Python source code entirely on Google’s cloud servers. It provides popular Python libraries to implement machine learning algorithms [2]. Keras, a Python wrapper for TensorFlow, was used to implement the proposed model. The dataset was split into five equally sized sets using sklearn’s KFold module. At each round of cross-validation, we kept one fold as the testing set, and the remaining folds were utilized for training. The training was performed for 40 epochs with a batch size of 64. Early stopping technology was employed during model training to obtain a more generalized model and reduce overfitting. The parameters for early stopping are as follows: the validation loss is the monitoring value, and the patience (number of epochs with no improvement) is set to 10. When the validation loss no longer improves, the training process will be terminated, and the model’s weights are restored with the smallest validation loss. Table 4 shows a summary of the common parameters.

To discover the best configuration of hyperparameters for the proposed models, randomly chosen experiments were conducted over crucial hyperparameters of a different values. Each experiment was examined using two deep learning models CNN and BiLSTM, combined with one of the five-word embedding approaches. Table 5 presents a list of hyperparameters and their experimental ranges for each of the deep learning architectures. In our proposed models, three types of layers construct the deep learning models: convolutional or BiLSTM layers, pooling layers, and dense layers. The hyperparameter-tuning process for the convolutional layer involves evaluating two parameters, namely the kernel size and the number of filters. The convolutional layer takes the contiguous sequence of features based on the kernel size as the input and performs the convolution operation to produce a feature map. The number of feature maps is specified by the number of filters. In the BiLSTM layer, we investigated the effectiveness of the number of neural units. In the dense layers, the number of neurons was also explored. Dropout was employed as a regularizer to keep the model from overfitting on the training data and make it generic by minimizing the number of trainable parameters in each iteration. The performance of the model was estimated using various dropout rates. In addition, various optimizers and learning rates were evaluated by the classification accuracy. The final configuration for the model’s hyperparameters are chosen based on a subset that achieves the best performance on the testing set.

4.2. Evaluation Metrics

The performance of our model is reported using weighted standard classification metrics, including accuracy, recall, precision, and F1-score, to consider the class imbalance. These metrics can be calculated based on the confusion matrix. The confusion matrix for the binary classification is represented in true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN).

Accuracy (A) is the average between the number of samples classified accurately and the overall number of samples.

$A = \frac{(T P + T N)}{(T P + T N + F P + F N)}$

(1)
Precision (P) is the average between the number of positive samples classified accurately and the overall number of samples classified as positive.

$P = \frac{T P}{(T P + F P)}$

(2)
Recall (R) is the average between the number of positive samples classified accurately and the overall number of samples that should be classified as positive.

$R = \frac{T P}{(T P + F N)}$

(3)
F1-score (F1) is the harmonic mean between the precision and recall scores.

$F 1 = 2 * \frac{(P * R)}{(P + R)}$

(4)

For each experiment, we calculated the average for each metric obtained from the five-fold test sets.

4.3. Experimental Results

We performed the classification experiments with different parameters and selected the final configuration that achieved the best possible performance on the testing set. The optimal configuration for both the CNN- and BiLSTM-based models are described in detail below and summarized in Table 6 and Table 7.

The architecture of the CNN-based model comprises three parallel convolution layers, three max pooling layers, followed by two fully connected layers. A fixed value to the number of filters is used for the CNN layers, which is 64 with kernels of sizes 1, 2 and 3. The two dense layers contain 512 and 256 neurons, respectively, and ReLu is used as the activation function. A dropout of 0.5 is applied to the output of each dense layer. The optimization technique Adamax was used with a learning rate of 0.004, and categorical_crossentropy was used as the loss function. The design of the BiLSTM-based model consists of a BiLSTM layer, a global max pooling layer, followed by two fully connected layers. The first layer is a BiLSTM layer, forward and backward LSTMs, with 128 neural units. The output features are passed through the two dense layers of the 512 and 256 neurons, respectively, with the ReLu activation function. Dropout layers with a probability of 0.3 are used after each dense layer. The Adamax optimizer with a learning rate of 0.002 is used as the optimization function while the loss function was set to categorical_crossentropy.

After tuning the models’ hyperparameters, the two models were trained with five different feature extraction methods. In this work, four famous word embedding models (word2vec, fastText, ARBERT, and MARBERT) and the Keras embedding layer were used. We relied on using published pre-trained word embedding models instead of training models to generate word embeddings. Training the models from scratch would require a large volume of data and is often time consuming. Table 8 and Table 9 show the evaluation results using CNN- and BiLSTM-based models. Both models were evaluated by applying a five-fold cross-validation, and the final results are reported by averaging the performance metrics yielded from all the testing sets. The performance metrics include accuracy, precision, recall, and F1-score.

Based on the obtained results, it is obvious that all results ranged between 0.9300 and 0.9563 in terms of the F1-score. Therefore, these results demonstrate that our approach is highly effective at distinguishing fake news from real news Tweets. As seen in Figure 2 and Figure 3, the best classification results were obtained when the models used pre-trained context-based models (ARBERT and MARBERT) to extract embeddings at the word level. MARBERT provides a slight improvement over the results achieved by ARBERT. Using MARBERT, we achieved an evaluation result of 0.9563 with CNN and 0.9545 with BiLSTM in terms of the F1-score; while we attained an F1-score of 0.9531 with CNN and 0.9542 with BiLSTM using ARBERT. To evaluate the effect of the embedding dimension on model performance, we set the Keras embedding layer and FastText to 100–300 dimensions, and the word2vec with 100 and 300 dimensions. Overall, the Keras embedding layer outperformed the FastText and word2vec models by obtaining an F1-score of 0.9497 with an embedding size of 200. Further analysis was performed on the model architecture to investigate the effect of integrating user features with textual features. We experimented with the best performing model based on the results reported in Table 8 for MARBERT-CNN, with only the textual features of the news Tweets. Figure 4 presents the obtained results for the combination of textual and user features, and with only textual features. When only textual features were used, the evaluation result for the CNN model was decreased by approximately 3.0%. This result proves that incorporating user features with textual features improves the prediction performance of the proposed model.

4.4. Impact of Oversampling and Undersampling

To overcome the issue of class imbalance and show the proposed model performance on a balanced dataset, oversampling and undersampling techniques were applied to handle dataset imbalance. Oversampling is the process of balancing the majority and minority classes by increasing the number of samples in the minority class. This can be performed either by randomly replicating original samples or by generating synthetic samples. Samples replication, however, may lead to a rise in the probability of overfitting. Introducing synthetic samples into the data space is a more effective approach. An example of the synthetic oversampling method is SMOTE (synthetic minority oversampling technique). The idea of SMOTE is that synthetic samples are generated in a space between the randomly selected minority samples and their k-nearest neighbours of the same class. Undersampling is the process of balancing the majority and minority classes by randomly discarding samples in the majority class [39]. In our experiments, we examined the oversampled and undersampled dataset with the best performing model based on the results reported in Table 8. Figure 5 provides the model results when applying oversampling and undersampling techniques on the dataset. As depicted in Figure 5, the oversampling and undersampling techniques did not enhance the model’s performance.

4.5. Discussion

Based on the experimental results, it seems that two types of context-based word embeddings (ARBERT and MARBERT) can improve upon other classic word embeddings. Context-based word embedding has the advantage of learning word representation based on context. It generates the embedding for the word after acquiring the contextual meaning of the word in the sentence. Therefore, it will give a different embedding for the same word based on the context within which the word appears, whereas classic word embeddings represent each word by a unique and distinct vector without considering the word’s context. However, the ARBERT and MARBERT models failed to improve the model’s performance when used in sentence_level embedding generation. The experimental findings for the classic word embeddings indicated that the models performed well when using Keras-generated word embeddings. The Keras embedding layer was configured to learn word embeddings during the process of training the model on task-specific corpora. Although FastText and word2vec were pre-trained on huge general corpora, the Keras embedding showed a better performance with CNN and BiLSTM. Each word embedding model is fed to the CNN- and BiLSTM-based models. In most situations, CNN exhibited a slight improvement over BiLSTM. Moreover, it is computationally less expensive compared to BiLSTM. ARBERT and word2vec acquired a better performance when BiLSTM was applied to learn the embedded vectors.

Lastly, to assess the effectiveness of the best performing model, MARBERT-CNN, we further tested it on the dataset introduced by [40]. It is important to mention that the original dataset comprises 2708 Tweets, but we were only able to retrieve 1642 Tweets. This may be because the Tweet’s author has protected or deleted their Tweets, and thus could not be reached via the Twitter API. It is also possible that the Twitter API has restricted access to those Tweets due to the privacy policy. The model obtained 0.7467, 0.7490, 0.7467, and 0.7400 for accuracy, precision, recall, and F1-score, respectively. Despite the dataset containing Tweets related to a specific topic, i.e., the Syrian crisis, the proposed model achieved an acceptable accuracy. This result proves that the proposed model accurately detects fake news related to various topics.

4.6. Comparison with Models from the Literature

Further experiments were carried out to evaluate the effectiveness of the architecture of the proposed MARBERT-CNN model. Therefore, we used two public Arabic Twitter datasets covering the topic of COVID-19. The first dataset, named ArCOV19-Rumors [41], which contains 3584 Tweets, 1753 are false news and 1831 are true news. The second dataset, COVID-19 misinformation, was introduced by [3]. A total of 8783 Tweets were collected, consisting of 1311 false news and 7475 true news. We used Twitter API to retrieve Tweets from the two datasets. It is worth noting that Twitter API was unable to provide access to a total of 803 Tweets from ArCOV19-Rumors and 2820 Tweets of COVID-19 misinformation.

Table 10 presents the comparison results of our model, MARBERT-CNN, against previous models on the ArCOV19-Rumors and COVID-19 misinformation datasets. Based on reported results, the proposed MARBERT-CNN model architecture achieved the best result among the other deep learning-based models for the two datasets.

5. Conclusions and Future Work

Combating and eliminating fake news across social networking platforms is challenging. A few studies have been performed to mitigate fake news in the Arabic language, and the outcome of those studies still needs improvement. In this study, we presented a deep learning approach to filter fake news in Arabic Tweets. To estimate the effectiveness of our proposed deep learning approach, we created a novel dataset containing news Tweets that were labelled as real or fake. We explored the efficiency of the Keras embedding layer with well-known pre-trained word embeddings (word2vec, fastText, ARBERT, and MARBERT) using two deep learning algorithms (CNN and BiLSTM) to find an efficient model in the context of the false news detection problem. According to the findings of this study, contextual embeddings, ARBERT and MARBERT, produced the most optimal word representations, which were reflected in the model performance. Regarding deep learning algorithms, the model architecture that relied on CNN performed better than the BiLSTM for most situations. The best performing model was obtained by combining MARBERT and CNN, with high accuracy and an F1-score of 0.956. Incorporating user features with textual features was shown to aid in the performance improvement of false news detection. Compared to existing deep learning models for false news detection, the advantage of our work is that it can be applied to news on different topics. Another advantage is detecting news before its spreads and infects society and news sites. This is due to utilizing both news content and the profile of the user who posted the news. All of these features are readily available at an early stage in the propagation of news.

Although this work achieved promising results, some limitations and challenges still need to be resolved in the future. First, the collected dataset contained few fake news Tweets compared to real news Tweets. Hence, further training on a balanced larger dataset is required to enhance the model generalization. Second, incorporating user characteristics resulted in an enhanced model performance by 3%. We intend to explore more effective user features to be able to distinguish fake news publishers from legitimate users. Additionally, this work could be improved by employing fine-tuned Arabic BERT-based models. Lastly, recent fake news Tweets often contain deceptive images and videos that look like true news to attract more attention from readers. Future work should concentrate on developing a model to recognize both fake textual and visual content.

Author Contributions

Conceptualization, S.A.; data curation, S.A. and M.K.; methodology, S.A.; software, S.A.; supervision, M.K. and F.A.; writing—original draft, S.A.; writing—review and editing, M.K. and F.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset generated during this study can be found here: https://github.com/ShathaAbdulrhaim/Real-and-Fake-News-Tweets-Dataset.git.

Conflicts of Interest

The authors declare no conflict of interest.

References

Thaher, T.; Saheb, M.; Turabieh, H.; Chantar, H. Intelligent detection of false information in arabic tweets utilizing hybrid harris hawks based feature selection and machine learning models. Symmetry 2021, 13, 556. [Google Scholar] [CrossRef]
Liu, Y.; Wu, Y.F.B. Fned: A deep network for fake news early detection on social media. ACM Trans. Inf. Syst. (TOIS) 2020, 38, 1–33. [Google Scholar] [CrossRef]
Alqurashi, S.; Hamoui, B.; Alashaikh, A.; Alhindi, A.; Alanazi, E. Eating garlic prevents COVID-19 infection: Detecting misinformation on the Arabic content of Twitter. arXiv 2021, arXiv:2101.05626. [Google Scholar]
Kaliyar, R.K.; Goswami, A.; Narang, P. DeepFakE: Improving fake news detection using tensor decomposition-based deep neural network. J. Supercomput. 2021, 77, 1015–1037. [Google Scholar] [CrossRef]
Al-Sarem, M.; Alsaeedi, A.; Saeed, F.; Boulila, W.; AmeerBakhsh, O. A novel hybrid deep learning model for detecting COVID-19-related rumors on social media based on LSTM and concatenated parallel CNNs. Appl. Sci. 2021, 11, 7940. [Google Scholar] [CrossRef]
Ameur, M.S.H.; Aliane, H. Aracovid19-mfh: Arabic COVID-19 multi-label fake news & hate speech detection dataset. Procedia Comput. Sci. 2021, 189, 232–241. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26. [Google Scholar]
Grave, E.; Bojanowski, P.; Gupta, P.; Joulin, A.; Mikolov, T. Learning word vectors for 157 languages. arXiv 2018, arXiv:1802.06893. [Google Scholar]
Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E.M.B. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. arXiv 2020, arXiv:2101.01785. [Google Scholar]
Helwe, C.; Elbassuoni, S.; Al Zaatari, A.; El-Hajj, W. Assessing arabic weblog credibility via deep co-learning. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, Florence, Italy, August 2019; pp. 130–136. [Google Scholar]
Khouja, J. Stance prediction and claim verification: An Arabic perspective. arXiv 2020, arXiv:2005.10410. [Google Scholar]
Nagoudi, E.M.B.; Elmadany, A.; Abdul-Mageed, M.; Alhindi, T.; Cavusoglu, H. Machine generation and detection of Arabic manipulated and fake news. arXiv 2020, arXiv:2011.03092. [Google Scholar]
Himdi, H.; Weir, G.; Assiri, F.; Al-Barhamtoshy, H. Arabic fake news detection based on textual analysis. Arab. J. Sci. Eng. 2022, 47, 10453–10469. [Google Scholar] [CrossRef] [PubMed]
Al-Yahya, M.; Al-Khalifa, H.; Al-Baity, H.; AlSaeed, D.; Essam, A. Arabic fake news detection: Comparative study of neural networks and transformer-based approaches. Complexity 2021, 2021, 5516945. [Google Scholar] [CrossRef]
Saadany, H.; Mohamed, E.; Orasan, C. Fake or real? A study of Arabic satirical fake news. arXiv 2020, arXiv:2011.00452. [Google Scholar]
Mahlous, A.R.; Al-Laith, A. Fake news detection in Arabic tweets during the COVID-19 pandemic. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 778–788. [Google Scholar] [CrossRef]
Qasem, S.N.; Al-Sarem, M.; Saeed, F. An ensemble learning based approach for detecting and tracking COVID19 rumors. Comput. Mater. Contin. 2021, 70, 1721–1747. [Google Scholar]
Amoudi, G.; Albalawi, R.; Baothman, F.; Jamal, A.; Alghamdi, H.; Alhothali, A. Arabic rumor detection: A comparative study. Alex. Eng. J. 2022, 61, 12511–12523. [Google Scholar] [CrossRef]
Sabbeh, S.F.; Baatwah, S.Y. Arabic News Credibility on Twitter: An Enhanced Model Using Hybrid Features. J. Theor. Appl. Inf. Technol. 2018, 96, 2327–2338. [Google Scholar]
Alzanin, S.M.; Azmi, A.M. Rumor detection in Arabic tweets using semi-supervised and unsupervised expectation–maximization. Knowl.-Based Syst. 2019, 185, 104945. [Google Scholar] [CrossRef]
Mouty, R.; Gazdar, A. The effect of the similarity between the two names of twitter users on the credibility of their publications. In Proceedings of the 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR), Spokane, WA, USA, 30 May–2 June 2019; pp. 196–201. [Google Scholar]
Jardaneh, G.; Abdelhaq, H.; Buzz, M.; Johnson, D. Classifying Arabic tweets based on credibility using content and user features. In Proceedings of the 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 9–11 April 2019; pp. 596–601. [Google Scholar]
Twitter API Documentation. Available online: https://developer.twitter.com/en/docs (accessed on 18 January 2022).
Alharthi, R.; Alhothali, A.; Moria, K. A real-time deep-learning approach for filtering Arabic low-quality content and accounts on Twitter. Inf. Syst. 2021, 99, 101740. [Google Scholar] [CrossRef]
Hegazi, M.O.; Al-Dossari, Y.; Al-Yahy, A.; Al-Sumari, A.; Hilal, A. Preprocessing Arabic text on social media. Heliyon 2021, 7, e06191. [Google Scholar] [CrossRef] [PubMed]
Alwehaibi, A.; Roy, K. Comparison of pre-trained word vectors for arabic text classification using deep learning approach. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 1471–1474. [Google Scholar]
El-Alami, F.z.; El Alaoui, S.O.; Nahnahi, N.E. Contextual semantic embeddings based on fine-tuned AraBERT model for Arabic text multi-class categorization. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 8422–8428. [Google Scholar] [CrossRef]
Ashi, M.M.; Siddiqui, M.A.; Nadeem, F. Pre-trained word embeddings for Arabic aspect-based sentiment analysis of airline tweets. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2018, Cairo, Egypt, 1–3 September 2018; Springer: Cham, Switzerland, 2019; pp. 241–251. [Google Scholar]
Bogale Gereme, F.; Zhu, W. Fighting fake news using deep learning: Pre-trained word embeddings and the embedding layer investigated. In Proceedings of the 2020 3rd International Conference on Computational Intelligence and Intelligent Systems, Tokyo, Japan, 13–15 November 2020; pp. 24–29. [Google Scholar]
Saleh, H.; Alhothali, A.; Moria, K. Detection of hate speech using bert and hate speech word embedding with deep model. Appl. Artif. Intell. 2023, 37, 2166719. [Google Scholar] [CrossRef]
Soliman, A.B.; Eissa, K.; El-Beltagy, S.R. Aravec: A set of arabic word embedding models for use in arabic nlp. Procedia Comput. Sci. 2017, 117, 256–265. [Google Scholar] [CrossRef]
d’Sa, A.G.; Illina, I.; Fohr, D. Bert and fasttext embeddings for automatic detection of toxic speech. In Proceedings of the 2020 International Multi-Conference on: “Organization of Knowledge and Advanced Technologies” (OCTA), Tunis, Tunisia, 6–8 February 2020; pp. 1–5. [Google Scholar]
Alammary, A.S. BERT models for Arabic text classification: A systematic review. Appl. Sci. 2022, 12, 5720. [Google Scholar] [CrossRef]
Antoun, W.; Baly, F.; Hajj, H. Arabert: Transformer-based model for arabic language understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
Abdelali, A.; Hassan, S.; Mubarak, H.; Darwish, K.; Samih, Y. Pre-training bert on arabic tweets: Practical considerations. arXiv 2021, arXiv:2102.10684. [Google Scholar]
Thompson, R.C.; Joseph, S.; Adeliyi, T.T. A Systematic Literature Review and Meta-Analysis of Studies on Online Fake News Detection. Information 2022, 13, 527. [Google Scholar] [CrossRef]
Kaliyar, R.K.; Goswami, A.; Narang, P. FakeBERT: Fake news detection in social media with a BERT-based deep learning approach. Multimed. Tools Appl. 2021, 80, 11765–11788. [Google Scholar] [CrossRef]
Rai, N.; Kumar, D.; Kaushik, N.; Raj, C.; Ali, A. Fake News Classification using transformer based enhanced LSTM and BERT. Int. J. Cogn. Comput. Eng. 2022, 3, 98–105. [Google Scholar] [CrossRef]
Mohammed, R.; Rawashdeh, J.; Abdullah, M. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 7–9 April 2020; pp. 243–248. [Google Scholar]
Al Zaatari, A.; El Ballouli, R.; ELbassouni, S.; El-Hajj, W.; Hajj, H.; Shaban, K.; Habash, N.; Yahya, E. Arabic corpora for credibility analysis. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portoroz, Slovenia, 23–28 May 2016; pp. 4396–4401. [Google Scholar]
Haouari, F.; Hasanain, M.; Suwaileh, R.; Elsayed, T. ArCOV19-rumors: Arabic COVID-19 twitter dataset for misinformation detection. arXiv 2020, arXiv:2010.08768. [Google Scholar]

Figure 1. The architecture of the proposed methodology.

Figure 2. The experimental results of the CNN-based model.

Figure 3. The experimental results of the BiLSTM-based model.

Figure 4. The effect of utilizing user features on the performance of the MARBERT-CNN model.

Figure 5. The experimental results of the oversampling and undersampling techniques.

Table 2. An example of some collected Tweets with their translation.

Tweet Label	Tweet Text
Real	القوات الروسية تسيطر على مفاعل #تشيرنوبل النووي. #عاجل #اوكرانيا #روسيا Russian forces take control of the #Chornobyl nuclear reactor. #Urgent #Ukraine #Russia
Fake	.خبر عاجل جداً #على حسب المرصد الإخباري، إصطدام قمر صناعي هندي بمحطة الفضاء الدولية أثناء صيانتها astronaut Michael Collins https://t.co/6ucb9alyWO Very urgent news #according to the news observatory, an Indian satellite collided with the International Space Station during its maintenance. astronaut Michael Collins https://t.co/6ucb9alyWO
Real	لاعبو منتخب #إنجلترا يثيرون حالة من الجدل خلال مواجهة #سويسرا.. بسبب خوض نجوم ”الأسود الثلاثة“ لجزء من المباراة بدون أسماء على قمصانهم #عينك_على_العالم https://t.co/EOPcdUhz1X The players of the #England national team raise a state of controversy during the confrontation with #Switzerland. because the “Three Lions” stars played part of the match without names on their shirts #YourEye_On_theWorld https://t.co/EOPcdUhz1X
Fake	الجزائر تفرض حظر جوي وبحري على المملكة المغربية، بداية من الغد، ممنوع مرور السفن والطائرات النقل المغربية بالمجال الجوي والبحري الجزائري. Algeria has imposed an air and sea embargo on the Kingdom of Morocco; starting tomorrow, it is forbidden for Moroccan ships and planes to pass through Algerian air and sea space.

Table 3. A sample Tweet before and after the preprocessing steps.

Raw Tweet	Processed Tweet

Table 4. Common parameters of the proposed model.

Parameter	Value
Epoch	40
Batch size	64
Data split	Cross validation (Five folds)
Validation split	0.1
Early stopping	Monitor = validation loss, patience = 10

Table 5. Hyperparameters of the proposed model.

Processing Layer	Parameter	Experimental Range
CNN	Kernel size	[1, 2, 3, 4, 5, 6, 7]
CNN	No of filters	[16, 32, 64, 128]
BiLSTM	No of LSTM units	[16, 32, 64, 128]
Dense	No of neurons in the dense layer	[128, 256, 512, 1024]
	Dropout rate	[0.1–0.5]
	Optimizer	[‘Adamax’, ‘Adam’, ‘SGD’]
	Learning rate	[ 0.001, 0.002, 0.003, 0.004]

Table 6. The CNN-based model’s hyperparameters.

Parameter	Value
Kernel size	1, 2, 3
No of filters	64
No of neurons in the dense layer	512, 256
Dropout rate	0.5
Optimizer	Adamax
Learning rate	0.004

Table 7. The BiLSTM-based model’s hyperparameters.

Parameter	Value
No of LSTM units	128
No of neurons in the dense layer	512, 256
Dropout rate	0.3
Optimizer	Adamax
Learning rate	0.002

Table 8. The experimental results of the CNN-based model.

Embedding Model	Embedding Level Method	Dim	CNN
Embedding Model	Embedding Level Method	Dim	Accuracy	Precision	Recall	F1-Score
Keras Embedding Layer	Word_level	100	0.9458	0.9462	0.9458	0.9459
		200	0.9502	0.9500	0.9502	0.9497
		300	0.9498	0.9500	0.9498	0.9496
Word2vec	Word_level	100	0.9330	0.9324	0.9330	0.9315
Word2vec		300	0.9322	0.9311	0.9322	0.9307
FastText	Word_level	100	0.9410	0.9407	0.9410	0.9405
		200	0.9376	0.9331	0.9376	0.9370
		300	0.9484	0.9482	0.9484	0.9481
ARBERT	Word_level	768	0.9538	0.9534	0.9538	0.9531
ARBERT	Sentence_level		0.9380	0.9375	0.9380	0.9368
MARBERT	Word_level	768	0.9564	0.9564	0.9564	0.9563
MARBERT	Sentence_level		0.9446	0.9462	0.9446	0.9448

Table 9. The experimental results of the BiLSTM-based model.

Embedding Model	Embedding Level Method	Dim	BiLSTM
Embedding Model	Embedding Level Method	Dim	Accuracy	Precision	Recall	F1-Score
Keras Embedding Layer	Word_level	100	0.9478	0.9472	0.9478	0.9474
		200	0.9444	0.9435	0.9444	0.9437
		300	0.9454	0.9457	0.9454	0.9454
Word2vec	Word_level	100	0.9470	0.9464	0.9470	0.9464
Word2vec		300	0.9460	0.9455	0.9460	0.9456
FastText	Word_level	100	0.9300	0.9305	0.9300	0.9300
		200	0.9336	0.9336	0.9336	0.9335
		300	0.9356	0.9354	0.9356	0.9353
ARBERT	Word_level	768	0.9548	0.9543	0.9548	0.9542
ARBERT	Sentence_level		0.9476	0.9470	0.9476	0.9470
MARBERT	Word_level	768	0.9548	0.9546	0.9548	0.9545
MARBERT	Sentence_level		0.9426	0.9437	0.9426	0.9428

Table 10. Comparison of the proposed model with the results obtained by previous models on publicly available datasets.

Dataset	Ref.	Approach Used	Performance Metric
Dataset	Ref.	Approach Used	Accuracy	F1-Score
ArCOV19-Rumors	[18]	BiLSTM	0.8000	0.7900
ArCOV19-Rumors	Proposed	MARBERT-CNN	0.8630	0.8603
COVID-19 misinformation	[3]	CNN-RNN	0.7000	0.5400
COVID-19 misinformation	Proposed	MARBERT-CNN	0.8707	0.8377

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alyoubi, S.; Kalkatawi, M.; Abukhodair, F. The Detection of Fake News in Arabic Tweets Using Deep Learning. Appl. Sci. 2023, 13, 8209. https://doi.org/10.3390/app13148209

AMA Style

Alyoubi S, Kalkatawi M, Abukhodair F. The Detection of Fake News in Arabic Tweets Using Deep Learning. Applied Sciences. 2023; 13(14):8209. https://doi.org/10.3390/app13148209

Chicago/Turabian Style

Alyoubi, Shatha, Manal Kalkatawi, and Felwa Abukhodair. 2023. "The Detection of Fake News in Arabic Tweets Using Deep Learning" Applied Sciences 13, no. 14: 8209. https://doi.org/10.3390/app13148209

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Detection of Fake News in Arabic Tweets Using Deep Learning

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Tweet Collection

3.2. Tweets Preprocessing

3.3. Features Extraction

3.3.1. Textual Features Extraction

3.3.2. User Features Extraction

3.4. Proposed Model Architecture

3.4.1. CNN Architecture

3.4.2. BiLSTM Architecture

3.4.3. Fully Connected Layers

4. Experiments and Results

4.1. Experimental Setup

4.2. Evaluation Metrics

4.3. Experimental Results

4.4. Impact of Oversampling and Undersampling

4.5. Discussion

4.6. Comparison with Models from the Literature

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI