1. Introduction
In the past decade, the rapid spread of large volumes of online information among an increasing number of social network users is observed. It is a phenomenon that has often been exploited by malicious users and entities, which forge, distribute, and reproduce fake news and propaganda [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19]. Fake news is intentionally forged information, which is distributed either to deceive and make false information believable, or to make verifiable facts doubtful [
2,
5,
7,
8,
9,
10,
11,
12,
15,
19,
20,
21]. Propaganda is another relative term for information which promotes specific political motives and other agendas [
1,
8,
9,
10,
11,
12,
16,
18,
21,
22].
The language used in forging fake news is deceptive, in the sense that it is intended to provoke and aggravate the users emotionally and lead them to spread the fake news [
5,
11,
12,
15,
16,
17,
19,
20,
23], (e.g., “You thought this is on behalf of the people in Hong Kong. On the contrary, it is a rascality of putting the “false freedom” label on the will of most of Hong Kong people.”). Another common indicator of deceptive language is the promotion of only one viewpoint, and thus being highly subjective [
12,
16,
20,
22], e.g., (“@feituji1994 I think we should supporting the Hong Kong Government.”). Additionally, grammatical and spelling mistakes, as well as the use of the same limited set of words are characteristic properties of deceptive language [
7,
11,
12,
16]. The recent development of natural language processing (NLP), data mining, and machine learning tools has led to a more qualitative understanding of the features of deceptive language (linguistic features), as well as of the features of malicious users and entities (network account features) [
1,
2,
4,
5,
7,
8,
11,
12,
14,
15,
16,
17,
18,
19,
22].
Fake news detection is the ability to define the truthfulness of information by analyzing its contents and related features [
7,
11]. Due to the unstructured and noisy data, the dynamic nature of news, and the increasing number of users, automated solutions for fake news and deception detection in social networks are required [
1,
2,
6,
8,
10,
12,
14,
15,
16,
17,
18,
19,
21,
22]. Consequently, fake news and deception detection on social networks present unique challenges and has become an emerging research field, with future directions in data-oriented, feature-oriented, model-oriented, and application-oriented issues [
1,
2,
3,
4,
5,
8,
11,
12,
15,
16,
18,
22,
24,
25].
Unlike previous works, our work presents the following novelty and contribution:
While the problem of fake news detection has been tackled in the past in a number of ways, most reported approaches rely on a limited set of existing, widely accepted and validated real/fake news data. The present work builds the pathway towards developing a new Twitter data set with real/fake news regarding a particular incident, namely the Hong Kong protests of the summer of 2019. The process of exploiting the provided fake tweets by Twitter itself, as well as the process of collecting and validating real tweet news pertaining to the particular event, are described in detail and generate a best practice setting for developing fake/real news data sets with significant derived findings.
Another novelty of the proposed work is the form of the input to the learning schema. More specifically, tweet vectors are used, in a pairwise setting. One of the vectors in every pair is real and the other may be real or fake. The correct classification of the latter relies on the similarity/diversity it presents when compared to the former.
The high performance of fake news detection in the literature relies to a large extent on the exploitation of exclusively account-based features, or to the exploitation of exclusively linguistic features. Unlike related work, the present work places high emphasis on the use of multimodal input that varies from word embeddings derived automatically from unstructured text to string-based and morphological features (number of syllables, number of long sentences, etc.), and from higher-level linguistic features (like the Flesh-Kincaid level, the adverbs-adjectives rate, etc.) to network account-related features.
The proposed deep learning architecture is designed in an innovative way, that is used for the first time for fake news detection. The deep learning network exploits all aforementioned input types in various combinations. Input is fused into the network at various layers, with high flexibility, in order to achieve optimal classification accuracy.
The input tweet may constitute the news text or the news header (defined in detail in
Section 4). Previous works have used news articles headers and text as the two inputs for pairwise settings. However, this is the first time that tweets are categorized to headers and text based on their linguistic structure. This distinction in twitter data for fake news detection is made for the first time herein, accompanied by an extensive experimental setup that aims to compare the classification performance depending on the input type.
Our work provides a detailed comparison of the proposed model with commonly used classification models according to related work. Additionally, experiments with these models are conducted, in order to assess and compare directly their performance with that of the proposed pairwise schema, by using the same input.
Finally, an extensive review of the recent literature in fake news detection with machine learning is provided in the proposed work. Previous works with various types of data (news articles, tweets, etc.), different categories of features (network account, linguistic, etc.), and the most efficient network architectures and classification models are described thoroughly.
The rest of this paper is structured as follows:
Section 2 discusses the recent related work regarding fake news detection from social networks, including the most common types of data and efficient machine learning techniques.
Section 3 describes the creation and preprocessing of the data sets used in our experiments.
Section 4 outlines the methodology regarding the feature set (
Section 4.1), the embedding (
Section 4.2), and the network architecture (
Section 4.3).
Section 5 presents the experiments’ implementation, both for real header and real text input.
Section 6 discusses the experiments’ results, and compares them to recent related work.
Section 7 discusses the findings, concludes the paper, and presents some guidelines for future work.
2. Related Work
The spread of fake news has caused severe issues, having a great impact on major social events. Consequently, the recent related work regarding fake news detection from social networks is vast and several researchers have attempted to organize it and identify the most common types of data and machine learning techniques. Vishwakarma and Jain [
8] listed the recent methods and data sets for fake news detection based on the content type of news they are applied to—the input data being either text or images. The review of Perera [
22] offered an overview of the deep learning techniques for both manual and automatic fake news detection, identified 7 different levels of fake news based on the context, as well as on the motive for their creation and diffusion, and analyzed their processing by algorithms implemented for social media. Alam and Ravshanbekov [
12] provided a definition for fake news and discussed the positive impact of combining NLP and deep learning techniques in automatic fake news detection. In a survey by Merryton and Augasta [
4], baseline classifiers and deep learning techniques for fake and spam messages detection were overviewed, and the most common NLP preprocessing methods and tools, as well as the mostly used linguistic feature sets and data sets, were discussed. Han and Mehta [
13] identified several fake news types and linguistic features, evaluated the performance of baseline classifiers and the performance of deep learning techniques regarding fake news detection, and compared them in the basis of balancing accuracy and lightweightness. Shu et al. [
2] collected the existing definitions of fake news in the recent related work, identified the differences among the features, and the impact of fake news on social and on traditional media, and discussed the recent fake news detection approaches.
Regarding ensemble learning and reinforcement learning, there are certain works achieving high performance. Agarwal and Dixit [
5] used the LIAR data set for the fake news class, and a data set from Kaggle, consisting of 20,801 news reports from the USA, for the real news class, resulting in a binary classification framework. They extracted credibility scores and other linguistic features from the text, and both data sets were normalized and tokenized. Python-based tools and libraries (Scikit-Learn, pandas, numpy, Keras, NLTK) were used for data preprocessing and the experiments. They created an ensemble, consisting of a Support Vector Machine (SVM), a Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), a k-Nearest Neighbor (KNN), and a Naive Bayes classifier, that used Bag of Words, Term Frequency–Inverse Document Frequency (TF-IDF), and n-grams. Their model achieved up to 97% accuracy with the LSTM. Wang et al. [
6] developed the WeFEND framework for automatic annotation of news articles, which used user reports from WeChat as a form of weak supervision for fake news detection. They extracted textual and linguistic features from the data and conducted experiments with reinforcement learning, using the Linguistic Inquiry and Word Count (LIWC) and LSTM, reaching an accuracy value of up to 82%.
There are several approaches that explore the significance of textual and linguistic features for fake news detection. Nikiforos et al. [
1] created a novel data set, consisting of 2366 tweets in English, regarding the Hong Kong protests of August 2019. Both network account and linguistic features were extracted from the tweets, while several features were identified as determinant for fake news detection. Their approach considered binary classification, and SMOTE over-sampling was applied to address class imbalance. The feature extraction, the SMOTE over-sampling and the experiments were conducted in the RapidMiner Studio. The performance of baseline classifiers, i.e., Naive Bayes and Random Forest, was evaluated, the final model achieving up to 99% accuracy. Zervopoulos et al. [
18] also created a data set regarding the same events. It consisted of 3908 tweets in English, and Chinese translated into English (fake news class), and 5388 tweets in English from news agencies and journalists (real news class). They used exclusively linguistic features, translated Chinese tweets into English with Google’s Translation API, and identified linguistically relevant tweets. Python, Scikit-Learn, and NLTK were used for the preprocessing and the experiments. They evaluated the performance of Naive Bayes, SVM, C4.5, and Random Forest classifiers, achieving an average of 92.1% F1 score, and the best results were obtained with Random Forest. Jeronimo et al. [
20] used a data set consisting of 207,914 news articles of 2 major mainstream media platforms in Brazil, collected from 2014 to 2017 (domains: Politics, Sports, Economy, and Culture), (real news class), and 95 news of 2 fact-checking services in Brazil (fake news class), collected from 2010 to 2017. The features were extracted by calculating the semantic distance between the data and 5 subjectivity lexicons (argumentation, presupposition, sentiment, valuation, and modalization) with Scikit-Learn. They conducted experiments with XGBoost, Random Forest (using Bag of Words and TF-IDF modeling), obtaining higher performance for inter domain scenarios. Mahyoob et al. [
11] used 20 posts from PolitiFact as real news and 20 posts from Facebook as fake news, deriving 6 classes in total. They performed a qualitative and a quantitative data analysis with the QDA tool, comparing the posts on the basis of their linguistic features. Wang et al. [
26] created LIAR, a new, publicly available data set for fake news detection. It consisted of approximately 12,800 manually labeled short statements of various topics from Politifact. Surface-level linguistic patterns were used for the experiments with hybrid CNNs, setting a benchmark for fake news detection on the novel data set. Shu et al. [
27] presented a novel fake news data repository, FakeNewsNet. It contained 2 data sets with various features, including news content, social context, and spatiotemporal information. They also discussed the potential use of the FakeNewsNet on fake news and deception detection in social media. Ruchansky et al. [
28] proposed a hybrid deep learning model for fake news and deception detection, by using features that included information regarding text and user behavior. They achieved up to 82.9% accuracy with experiments with a data set consisting of 992 tweets, 233,719 users, and 592,391 interactions.
Regarding deep learning, there are certain works achieving high performance. Sansonetti et al. [
19] created a novel data set, consisting of 568,315 tweets that reference news indexed on PolitiFact, 62,367 news (34,429 fake news, 29,938 real news) referenced by tweets, and 4022 user profiles (2013 who publish mostly fake news, 2008 who publish mostly real news). They used both network account and linguistic features, and conducted experiments for offline and online analysis with CNN, LSTM, dense layer, and baseline classifiers (SVM, kNN), achieving up to 92% accuracy. Kumar et al. [
16] compared different ensembles for binary classification on 1356 news from Twitter and 1056 real and fake news from PolitiFact. They created a data set per topic, and then they tokenized and encoded them. They used BeautifulSoup, Python, GloVe, and GPy. They conducted experiments with embeddings, CNN, and LSTM (ensemble and bidirectional). The CNN and bidirectional LSTM ensembled network with attention mechanism achieved the highest accuracy (88.78%). Alves et al. [
21] created a novel, binary class data set, consisting of 2996 articles written in Brazilian Portuguese, collected from May to September 2018. The data set was normalized and tokenized, and Keras and TensorFlow were used. The experiments were conducted with a bidirectional and a regular LSTM and a dense layer. The 3-layer deep bidirectional LSTM with trainable word embeddings achieved accuracy up to 80%. Victor [
3] used the PHEME data set and the LIAR data set, and conducted experiments with a deep two-path CNN and a bi-directional Recurrent Neural Network (RNN) for supervised and unsupervised learning, achieving up to 83% accuracy. Koirala [
10] created a novel data set of 4072 news articles from Webhose.io, regarding fake news about COVID-19. They used linguistic features and conducted experiments with baseline classifiers, LSTM and dense layer, achieving an accuracy value between 70% and 80%.
Pairwise learning schemata are very popular in machine learning. The training data consist of lists of items that are specifically ordered within each list. Koppel et al. [
29] presented a simple pairwise learning model for ranking. Experiments with the LETOR MSLR-WEB10K, MQ2007, and MQ2008 data sets were performed by using the Tensorflow library and its implementation of the Adam-Optimizer. Dong et al. [
7] used the PHEME data set for semi-supervised, binary classification with baseline classifiers, LSTM, and a deep two-path learning model containing 3 CNNs; both labeled and unlabeled data were used to train the model. Their performance was better than supervised learning models in the case where the distribution between the training and test data sets differed, and it proved to be more resistant to overfitting. Agrawal et al. [
14] used tweets containing multimedia content; the training set consisted of approximately 5000 real news and approximately 7000 fake news, and the test set consisted of approximately 1200 real news and approximately 2500 fake news. They fused a pairwise ranking approach and a classification system, using image-based features, Twitter user-based features, and tweet-based features. For the classification a deep neural network, logistic regression, and SVM were used, along with n-grams and doc2vec vectors. The ranking was derived from the calculation of the distance between the features (contextual comparison) of tweets of the same topic (by hashtag). The ranking system outputs were incorporated within the classification system. They achieved accuracy up to 89% for real news and 78% for fake news. Bahad et al. [
17] used 2 unstructured news data sets from the open machine learning repository (Kaggle) for binary classification. The experiments were conducted with LSTM, RNN, and CNN, using Python and TensorFlow. The highest accuracy, up to 98%, was achieved by the bi-directional LSTM-RNN. Abdullah et al. [
15] used tokenized news from 12 distinct categories, and the prediction of the category determines the fake from the real news (12 classes). The experiments were conducted on Kaggle’s cloud, with CNN, LSTM, and dense layer, achieving up to 97.5% accuracy. In a machine learning setting, Mouratidis et al. [
30] presented a general deep learning architecture for learning to classify parallel translations, using linguistic information, of 2 machine translation model outputs and 1 human (reference) translation. They showed that the learning schema achieves the best score when information from embeddings and simple features are used for small data sets. Augenstein et al. [
31] used a framework that combines information from embeddings in a multi-task learning experiment. They evaluated their approach on a variety of parallel classification tasks for sentiment analysis, and showed that, when the learning framework utilizes the ranker scores, the classification system outperforms a simple classification system.
More specifically, in this work, the learning schema is inspired by the architecture proposed for machine translation evaluation by Mouratidis et al. [
30], and transferred to the domain of fake news detection, as described in
Section 4. We define the input for this architecture based on the data set of [
1] and according to the work of Augenstein et al. [
31]. Augenstein et al. [
31] have used news articles’ headers and text as the two inputs for pairwise settings. However, this is the first time that tweets are categorized to headers and text based on their linguistic structure, as described in
Section 3. The aim of this work was to identify the best practice setting for fake news detection. The proposed model exploits different input types (e.g., word embeddings, morphological and higher-level linguistic features) in various combinations. Input is fused into the model at various layers, with high flexibility, in order to achieve optimal classification accuracy. A detailed comparison of the proposed model with commonly used classification models according to related work is also presented.
3. Data
The data set used in our work is that of Nikiforos et al. [
1]. It consists of 2363 tweets in English, regarding the Hong Kong protests of August 2019, and 23 features (described in
Section 4 and
Section 4.1). The fake news tweets (fake tweets) of the data set (272 in total) were collected from 936 Twitter accounts that originated from the People’s Republic of China, which were suspended in August 2019, due to violations of Twitter’s manipulation policies, aiming to thwart the political convictions and notions of the Hong Kong protest movement. The real news tweets (real tweets) of the data set (2092 in total) were collected from 9 Twitter accounts of renowned news agencies: BBC Asia, BBC News (World), CCTV, China Daily, China Xinhua News, China.org.cn, Global Times, People’s Daily (China), and SHINE. The aim was to include and represent true and valid information in the data set. The real tweets were originally 2133, posted from August 2019 to December 2019. Preprocessing was considered necessary, in order to ensure that the tweets: (a) contain text, (b) are written in English, and (c) are relevant with the Hong Kong political movement of August 2019. 2092 remained after preprocessing.
The tweet text is used as input to the proposed neural network (described in
Section 4). To this end, the tweets were divided into 4 distinct categories, depending on the class (real/fake) and the type of the tweet text (header/text). Therefore, the resulting categories are (a) real header, (b) real text, (c) fake header, and (d) fake text. As headers (real or fake), we consider the tweets that make a single-sentence statement (e.g., “Black terror: The real threat to freedom in Hong Kong”), in a form similar to newspaper headlines. Tweets that are longer than one sentence (e.g., “People with ulterior motives attempt to make waves in Hong Kong through the “color revolution”, inciting student groups and Hong Kong citizens who do not know the truth, besieging the police headquarters and intending to undermine Hong Kong’s stability”) are considered as text (real or fake). There are two tweet inputs for the pairwise setting per experiment,
T1 and
T2. For the first experiment,
T1 is a real header and
T2 can be either a real text, or a fake text, or a fake header. For the second experiment,
T1 is a real text and
T2 can be either a real header, or a fake text, or a fake header.
Table 1 presents more details about the corpora. Imbalance between the two classes was observed, the fake tweet class being the minority class. Consequently, we applied the SMOTE filter to the minority class. Using SMOTE over-sampling [
32], the total number of tweets increased from 2363 to 3766.
7. Conclusions
Unlike previous works, our work presents the following novelty and contribution. While the problem of fake news detection has been tackled in the past in a number of ways, most reported approaches rely on a limited set of existing, widely accepted, and validated real/fake news data. The present work builds the pathway towards developing a new Twitter data set with real/fake news regarding a particular incident, namely the Hong Kong protests of the summer of 2019. The process of exploiting the provided fake tweets by Twitter itself, as well as the process of collecting and validating real tweet news pertaining to the particular event, are described in detail and generate a best practice setting for developing fake/real news data sets with significant derived findings.
Another novelty of the proposed work is the form of the input to the learning schema. More specifically, tweet vectors are used, in a pairwise setting. One of the vectors in every pair is real and the other may be real or fake. The correct classification of the latter relies on the similarity/diversity it presents when compared to the former. The high performance of fake news detection in the literature relies to a large extent on the exploitation of exclusively account-based features, or to the exploitation of exclusively linguistic features. Unlike related work, the present work places high emphasis on the use of multimodal input that varies from word embeddings derived automatically from unstructured text to string-based and morphological features (number of syllables, number of long sentences, etc.), and from higher-level linguistic features (like the Flesh-Kincaid level, the adverbs-adjectives rate, etc.) to network account-related features.
The proposed deep learning architecture is designed in an innovative way, that is used for the first time for fake news detection. The deep learning network exploits all aforementioned input types in various combinations. Input is fused into the network at various layers, with high flexibility, in order to achieve optimal classification accuracy. The input tweet may constitute the news text or the news header (defined in detail in
Section 4). Previous works have used news article headers and text as the two inputs for pairwise settings. However, this is the first time that tweets are categorized to headers and text based on their linguistic structure. This distinction in twitter data for fake news detection is made for the first time herein, accompanied by an extensive experimental setup that aims to compare the classification performance depending on the input type.
Our work provides a detailed comparison of the proposed model with commonly used classification models according to related work. Additionally, experiments with these models are conducted, in order to assess and compare directly their performance with that of the proposed pairwise schema, by using the same input. Finally, an extensive review of the recent literature in fake news detection with machine learning is provided in the proposed work. Previous works with various types of data (news articles, tweets, etc.), different categories of features (network account, linguistic, etc.), and the most efficient network architectures and classification models are described thoroughly.
More specifically, the deep learning architecture by Mouratidis et al. [
30] is used as a basis to fake news detection, whereas the input for this architecture is based on the data set of [
1], and defined according to the work of Augenstein et al. [
31], who compared news headers and text through their pairwise framework to detect fake news text.
Our main results show high overall accuracy performance of the proposed deep learning architecture in fake news detection. For both experiments, the performance is increased after SMOTE over-sampling. For Experiment 2, where T1 is real text, the performance is better than that of Experiment 1 prior to SMOTE. Consequently, the Experiment 2 setting is the most efficient for fake news detection, as it does not require SMOTE over-sampling to achieve better results. This also indicates that the correlation of the real text with text in general is greater than that of the real header. The correlation of the real header with the rest of the data is increased after SMOTE over-sampling, and thus for the framework of Experiment 1 the number of data affects the performance.
Experiment 2 results also indicate that the real text (as
T1) is highly correlated with data (
T2, either real header, fake header. or text), compared to the respective correlation of the real header (as
T1) with data (
T2, either real text, fake header, or text) of Experiment 1. The latter correlation is slightly improved after SMOTE over-sampling, leading to the conclusion that the number of data affects the performance of the Experiment 1 framework. Additional experiments with Naive Bayes, Random Forest, and SVM were also run, using the WEKA framework as backend [
37], in order to compare directly our experimental results with earlier work. More specifically we achieved up to 99% accuracy with Naive Bayes [
1], 92.1% average F1 score with Random Forest [
18], and up to 92% accuracy with CNN [
19]. The proposed deep learning architecture outperforms the state-of-the-art classifiers, while achieving high F1 score for both fake and real tweets detection. The Random Forest classifier detected successfully all of the real tweets and quite well the fake tweets. The Naive Bayes and SVM classifiers faced problems in identifying the real tweets from the fake ones. In conclusion, the proposed deep learning architecture, using 18 features and information (embeddings) from the tweet text, achieves the best accuracy results.
In future work, we will aim to test a different model configuration (e.g., different kinds of neural network layers). Apart from the pairwise classification schema that is used in this paper, we will test other classification schemata, for identifying fake content. In addition, the proposed model will be tested to a wider field of problems for fake content detection, e.g., spams. Finally, it is worth exploring further data sets and other content formats (e.g., multimedia content, photos, videos) in the proposed model.