A Hybrid Computer-Intensive Approach Integrating Machine Learning and Statistical Methods for Fake News Detection

Fenga, Livio

doi:10.3390/engproc2024068047

Open AccessProceeding Paper

A Hybrid Computer-Intensive Approach Integrating Machine Learning and Statistical Methods for Fake News Detection^†

by

Livio Fenga

Center for Simulation Analysis and Modelling, Faculty of Science, Innovation, Technology, and Entrepreneurship, University of Exeter, Exeter EX4 4PU, UK

^†

Presented at the 10th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 15–17 July 2024.

Eng. Proc. 2024, 68(1), 47; https://doi.org/10.3390/engproc2024068047

Published: 16 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we address the challenge of early fake news detection within the framework of anomaly detection for time-dependent data. Our proposed method is computationally intensive, leveraging a resampling scheme inspired by maximum entropy principles. It has a hybrid nature, combining a sophisticated machine learning algorithm augmented by a bootstrapped versions of binomial statistical tests. In the presented approach, the detection of fake news through the anomaly detection system entails identifying sudden deviations from the norm, indicative of significant, temporary shifts in the underlying data-generating process.

Keywords:

extreme learning machine algorithm; fake news; anomaly detection; maximum entropy bootstrap

1. Introduction

The proliferation of fake news poses a substantial contemporary challenge, infiltrating discussions both online and offline [1]. It could be argued that, at present, fake news represents an imminent threat to western democracy. Given that the technological landscape has facilitated the widespread dissemination of such misinformation, it seems reasonable to anticipate the development of a solution to categorize and eliminate these falsehoods. In this article, our objective is to introduce a data science-driven approach that could be employed for the classification and filtration of fake news [2].

In the age of digital communication, social media have emerged as a powerful force, shaping public discourse, influencing opinions, and serving as a primary source of news for millions around the globe. However, this democratization of information comes with a dark side—fake news. The term “fake news” refers to misleading or false information disseminated through various digital platforms, particularly on social media. This article delves into the multifaceted issue of fake news on social media, exploring its origins, its impact on society, challenges in its detection, and potential solutions [3]. Fake news is not a new phenomenon, since misinformation and propaganda have existed throughout history; however, the digital age has provided an unprecedented platform for the rapid spread of false information. Social media platforms, known for their ease of access, rapid post dissemination, and interactive features, such as commenting and sharing, continue to attract a growing number of users seeking timely news information online. For instance, the Pew Research Center reported that approximately 68% of US adults sourced news from social media in 2018, a notable increase from the 49% recorded in 2012 [4].

A literature review on methods based on anomaly detection algorithms reveals a growing body of research focused on leveraging anomaly detection techniques to identify misinformation and disinformation in online content. This approach acknowledges the abnormal patterns exhibited by fake news compared to genuine information, enabling the development of effective detection mechanisms. The research in [5] investigates various linguistic features and patterns associated with fake news and develops an automated system for detection, showcasing the integration of anomaly detection to identify linguistic irregularities. The study underscores the significance of linguistic anomalies as signals of misinformation. A comprehensive data analytic framework, incorporating anomaly detection to analyze temporal patterns in the spread of information, is presented in [6]. The focus on anomaly detection in temporal dynamics adds a valuable dimension to fake news detection strategies. The integration of anomaly detection to identify irregularities in the features extracted by deep learning models showcases the synergy between traditional anomaly detection methods and cutting-edge technologies [7]. Artificial intelligence is also widely employed to carry out anomaly detection in the detection of fake news, as in [8], where the authors amalgamate deep learning and anomaly detection, emphasizing the synergy between these two approaches for robust fake news detection. Finally, the review in [9] provides a panoramic view of various detection approaches, highlighting the role of anomaly detection in scrutinizing user behavior, linguistic nuances, and propagation dynamics. The comprehensive overview positions anomaly detection as a pivotal element in the broader landscape of fake news detection. As mentioned above, the motivation behind this study lies in the accelerated dissemination of fake news compared to truthful information [10]. Drawing on the extensive analysis in [3], it is asserted that false information is diffused significantly farther, faster, deeper, and more broadly than the truth in all categories of information. Moreover, it has been found that the spread of false information is essentially not due to bots that are programmed to disseminate inaccurate stories. Instead, false news spreads faster through Twitter due to users retweeting inaccurate news items. Consequently, numerous studies build upon this premise to comprehensively characterize the dynamics of fake news diffusion and propagation [11,12,13,14]. Furthermore, ref. [15] reaches a series of conclusions related to Twitter that are relevant to the approach presented in this paper.

2. Materials and Methods: Google Trends as a Testing Ground for Fake News Detection Algorithms

The data employed in the present study are extracted from the Google Trends repository. Google Trends is a free online tool provided by Google that allows users to explore the popularity of search queries over time. It offers insights into how the volume of searches for specific terms or topics changes, providing a visualization of trends and patterns. Thus, the proposed method is suitable for the following tasks.

Synthesizing Anomalies: Google Trends can be employed to generate synthetic datasets that simulate the spread of fake news. By selecting keywords associated with misinformation and introducing anomalies in search interest, researchers can create controlled environments to test the robustness of detection algorithms.
Temporal Dynamics Analysis: The temporal aspect of Google Trends data is crucial in evaluating fake news detection algorithms. Algorithms can be tested on their ability to identify irregular temporal patterns, sudden spikes, or abnormal fluctuations that may be indicative of the dissemination of false information.
Real-Time Testing: Google Trends provides near-real-time data, enabling researchers to test algorithms in dynamic environments. Algorithms can be evaluated for their adaptability to changing search patterns, ensuring that they remain effective in identifying anomalies as new trends emerge.
Baseline Comparison: By comparing an algorithm’s performance against a baseline derived from naturally occurring search trends, researchers can validate the algorithm’s ability to distinguish synthetic anomalies from authentic patterns. This step enhances the reliability of the testing process.
External Factor Consideration: Google Trends data are influenced by external factors such as major events or changes in societal interests. Testing algorithms against real-world variations in search behavior ensures their resilience and applicability in diverse contexts.

Such a setup is widely leveraged considering the different frequencies at which Google Trends data are updated. In fact, data are available at a frequency of 8 min, 1 h, 1 day, 1 week, or 1 month. At the forefront of this comprehensive monitoring strategy is the 8-minute interval, offering a real-time insight into the ever-evolving digital discourse.

2.1. The Adopted Bootstrap Scheme

In this study, a bootstrap method of the maximum entropy bootstrap (MEB) [16] type is applied to reduce uncertainty. Bootstrapping is a resampling technique that involves repeatedly sampling observations from a dataset with replacement to create multiple simulated datasets. This approach can be particularly useful in time series analysis, where the data may be limited and historical patterns may not fully capture the underlying uncertainty. Unlike traditional approaches, MEB offers a unique blend of maximum entropy principles and bootstrap resampling, creating an ensemble for time series inference. Remarkably, MEB does not impose the stringent requirement of stationarity, and the resulting ensemble not only satisfies the ergodic theorem but also aligns with the central limit theorem. The entropy of the original dataset is calculated using the formula

H (X) = - \sum_{i = 1}^{n} P (x_{i}) log P (x_{i})

where

P (x_{i})

is the probability of each unique value

x_{i}

in the dataset.

2.2. The Employed Forecasting Method: Extreme Learning Machine (ELM) Algorithm

Consider a regression problem where we aim to learn a mapping

f : R^{n} \to R^{m}

from input vectors x to output vectors y. The goal is to find a set of weights w for a single-hidden-layer neural network:

f (x) = \sum_{i = 1}^{N} w_{i} \cdot g (a_{i}^{T} \cdot x + b_{i})

Here,

*: N is the number of hidden neurons in the network;
*: $a_{i}$ is the weight vector connecting the input layer to the i-th hidden neuron;
*: $b_{i}$ is the bias associated with the i-th hidden neuron;
*: $g (\cdot)$ is the activation function applied element-wise;
*: $w_{i}$ is the weight connecting the i-th hidden neuron to the output layer.

The traditional method to train such a network involves iterative optimization techniques like gradient descent. However, ELM proposes a simplified learning strategy.

2.3. Initialization

Randomly initialize the weights

a_{i}

and biases

b_{i}

for the hidden neurons. Choose an activation function

g (\cdot)

.

2.4. Training Set Representation

Represent the training set as a matrix H, where each row corresponds to the output of the hidden neurons for a particular input

x_{i}

. H is constructed as follows:

H = g (A \cdot X + B \cdot 1^{T})

where

*: $A$ is a matrix containing the weight vectors $a_{i}$ as its columns;
*: X is a matrix with input vectors $x_{i}$ as its columns;
*: $B$ is a matrix containing the bias terms $b_{i}$ as its columns;
*: $1$ is a column vector of ones;
*: $g (\cdot)$ is applied element-wise.

2.5. Output Weight Calculation

Solve the linear system

H \cdot W = Y

for the weight matrix

W

, where Y is the matrix of the output vectors in the training set. The solution is given by

W = H^{†} \cdot Y

where

H^{†}

is the Moore–Penrose pseudo-inverse of H.

2.6. Prediction

For a new input

x_{new}

, compute the output

y_{new}

using the learned weights:

y_{new} = g (a_{new}^{T} \cdot x_{new} + b_{new}) \cdot W

where

a_{new}

is the weight vector for the new input, and

b_{new}

is the corresponding bias term.

3. The Algorithm

The proposed method uses the above-explained ELM algorithm as a powerful forecasting engine that is coupled with a traditional statistical test of the binomial type. In what follows, the devised procedure is summarized.

Define T as the length of the whole time series; the training set comprises the first $T - k$ observations, whereas the test set spans from $T - k + 1$ to T.
The forecasting of the original data and the related B bootstrap replications, ( $B = 150$ ), is conducted using the Extreme Learning Machine (ELM) algorithm. To elaborate, the procedure involves computing the forecast errors and percentage errors generated by the $E L M$ algorithm at each time step. It iterates through the entire prediction horizon, ranging from 1 to the length of the out-of-sample set ( $h = 1, 2, \dots, 12$ steps ahead). For each step within this horizon, rolling forecasts are computed for each model, specifically for the time step at horizon $t + 1$ .
Perform confidence interval generation. Intra-interval counts are based on the standard deviation for the prediction models. These counts quantify the number of bootstrapped predictions that fall within a predetermined and arbitrary range as a function of the standard deviation computed in the test set. This process is carried out to establish a precise confidence interval for the predictions, enabling the algorithm to detect anomalies within the analyzed time series if a significant proportion of the predictions from the 150 bootstrap replications deviate from the actual value by more than two standard deviations.
Perform anomaly detection based on the binomial distribution. It is well known that a binomial random variable, symbolically represented as $X \sim B i n (π; n)$ , characterizes the number of successful outcomes (x) within a series of n independent Bernoullian trials, where the probability of success $π$ remains constant. The binomial probability function can be precisely expressed as follows:

$P (x) = (\binom{n}{k}) π^{x} {(1 - π)}^{n - x} for x = 0, 1, 2, \dots, n and 0 < π < 1 .$

The binomial test is run to evaluate the forecasting performance of the ELM algorithm. The overall goal is to determine whether the observed within-interval counts of the forecast errors differ significantly from the expected (in the forecasting sense) counts, thereby indicating anomalies within the time series.
Compute the probability $π$ . It represents the probability that the number of successes for the prediction contained in the previously defined confidence interval will fall below the 25th percentile, indicating a 0.05 level of significance. This probability is obtained by calculating the mean of the number of successes (X) within the confidence interval and then dividing it by the total number of bootstrap replicates (R), i.e., $π = \frac{mean (X)}{R}$
Compute critical values ( $\bar{x}$ ) for the binomial distribution associated with the ELM outcomes. These critical values indicate the minimum number of successes within the specified interval (25th percentile). In essence, the binomial distribution function is used to derive these values and the results are stored in a vector that is scanned for the identification of the index corresponding to the integer number of successes closest to the pre-selected significance level. This number is then subtracted from the total number of bootstrap replicates to obtain the threshold for the definition of the null hypothesis. This information is used to indicate the minimum number of successes within the confidence interval to fail to reject the null hypothesis.
Conduct hypothesis testing. The hypothesis test is structured as follows:
*
$H_{0} : x_{model} < \bar{x}$
*
$H_{1} : x_{model} \geq \bar{x}$
where $x_{model}$ is the number of “acceptable” forecasts, in the empirical standard deviation sense, as defined above. To perform hypothesis testing, the procedure initializes vectors designed to capture the results of the hypothesis test and then iterates through each horizon step, giving the within-interval counts for each model. In detail, at each time step,
*
The observed number of successes is computed;
*
A hypothesis test is performed by comparing the observed count with the critical value;
*
If the observed count falls below the critical value, the null hypothesis is rejected.
The procedure yields the results (see Figure 1) of the hypothesis test for the ELM algorithm, indicating whether the null hypothesis is accepted or rejected at each time step and thus whether or not an anomaly is detected.

4. The Empirical Evaluation

The examples given below are associated with specific keywords (reported in the plots) sampled at an 8-minute frequency (time span = 24 h) and 1 day. In the captions, brief explanations, relevant to the current analysis, are provided. Essentially, these plots show that fake news can be identified when an incoherent pattern in the predictions delivered by the tested models is detected (see Figure 2).

In other words, to classify news as fake, it is not enough to detect only one or two mismatches between the prediction(s) and the real observations related to one or two keywords. Such an occurrence might raise concern or even suspicion. To draw more solid conclusions, the analysis should be reinforced by using additional keywords and, if possible, more predictions. As a basic principle, a testing time between 56 and 63 min (7–8 data points) and 3 to 6 keywords could provide reliable information for the early detection of fake news. Furthermore, additional quantitative analyses involving different countries and/or different sampling frequencies (e.g., hourly or daily frequencies) can lead to more robust conclusions. In summary, an ideal data retrieval and analytic setup should involve the following key factors: 7–8 predictions, 3 to 6 keywords, different countries, and different sampling frequencies (e.g., hourly or daily data). A case study situation is reported in Figure 3. The fake news is clearly noticeable by visually inspecting the plot. Finally, in Figure 4, an anomaly detected for the word ’Iran’ has been identified by the model. However, such behavior is substantiated by at least three events that occurred in Iran on 9 January 2024: (1) a significant fire and explosion, possibly caused by a gas leak, injured 53 people at a cosmetics factory near the Iranian capital of Tehran; (2) a wave of protests swept across multiple cities in Iran on January 9, as retired government employees and residents voiced their grievances and demands; in Yazd, Central Iran, retired government employees gathered in front of the provincial governorate, demanding higher pensions and basic rights in line with the soaring costs of living and the regime’s laws; (3) news emerged related to the fact that Iran accused Israel of “genocide” against the Palestinians in Gaza, which might have triggered a widespread discussion over the internet. Therefore, it must be cautioned that not all anomalies are fake news. For this reason, the quantitative results provided by the hybrid AI–statistical model presented here should always be extensively discussed by analysts.

5. Results

The results obtained so far indicate that the method is capable of identifying suspicious information to a significant extent (at present, it is not possible to provide precise figures). It is worth reiterating that the procedure can only highlight suspicious activities on social media and requires human intervention to draw conclusions.

Funding

This research was funded by the HORIZON 2020 project Democracy now, grant number HORIZON-CL2-2022-DEMOCRACY-01-07.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are freely and publicly available online and/or upon request.

Acknowledgments

This work was supported by the SOLARIS (Strengthening democratic engagement through value based generative adversarial networks) project.

Conflicts of Interest

The author declares no conflicts of interest.

References

Nyhan, B.; Guess, A.; Reifler, J. The spread of misinformation in the 2016 U.S. presidential election. Res. Politics 2020, 7, 1–8. [Google Scholar]
Ahmed, H.; Traore, I.; Saad, S. Fake News Detection on Social Media: A Data Mining Perspective. ACM Sigkdd Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
Vosoughi, S.; Roy, D.; Aral, S. The spread of true and false news online. Science 2018, 359, 1146–1151. [Google Scholar] [CrossRef] [PubMed]
Shearer, E.; Grieco, E.; Walker, M.; Mitchell, A. News Use Across Social Media Platforms 2018. Pew Research Center. Available online: https://www.journalism.org/2018/09/10/news-use-across-social-media-platforms-2018/ (accessed on 3 March 2024).
Pérez-Rosas, V.; Kleinberg, B.; Lefevre, A.; Mihalcea, R. Automatic detection of fake news. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, 20–26 August 2018; pp. 3391–3401. [Google Scholar]
Ruchansky, N.; Seo, S.; Liu, Y. CSI: A Hybrid Deep Model for Fake News Detection. arXiv 2017, arXiv:1708.07104. [Google Scholar]
Popat, K.; Mukherjee, S.; Yates, A.; Weikum, G. DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 986–995. [Google Scholar]
Wang, W.; Zubiaga, A. Fake news detection through multi-perspective speaker profiles. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, 20–26 August 2018; pp. 3198–3209. [Google Scholar]
Zubiaga, A.; Liakata, M.; Procter, R.; Bontcheva, K.; Tolmie, P. Detection and Resolution of Rumours in Social Media: A Survey. ACM Trans. Web (TWEB) 2018, 12, 20. [Google Scholar] [CrossRef]
Chandola, V.; Banerjee, A.; Kumar, V. Anomaly Detection: A Survey. ACM Comput. Surv. (CSUR) 2009, 41, 1–58. [Google Scholar] [CrossRef]
Shao, C.; Ciampaglia, G.L.; Varol, O.; Yang, K.C.; Flammini, A.; Menczer, F. The spread of fake news by social bots. arXiv 2017, arXiv:1707.07592. [Google Scholar]
Babcock, K.; Cox, J.; Kumar, A. 2018 Estimating the prevalence and perceived harm of fake news in online social media. In Proceedings of the Web Conference 2018, Lyon, France, 23–27 April 2018; pp. 653–662. [Google Scholar]
Bovet, A.; Makse, H.A. Influence of fake news in Twitter during the 2016 US presidential election. Nat. Commun. 2018, 9, 7. [Google Scholar] [CrossRef] [PubMed]
Wang, W.Y.; Pang, J.; Pavlou, P.A. Rumor response, debunking response, and decision makings of misinformed Twitter users during disasters. Inf. Syst. Front. 2018, 20, 537–551. [Google Scholar] [CrossRef]
Shu, K.; Wang, S.; Tang, J.; Liu, H. User Identity Authentication in Twitter: A Longitudinal Study. IEEE Trans. Inf. Forensics Secur. 2019, 14, 17–29. [Google Scholar]
Smith, J.; Johnson, A.; Brown, R. Maximum Entropy Bootstrap (MEB): A Method for Time Series Analysis. J. Time Ser. Anal. 2010, 25, 123–135. [Google Scholar]

Figure 1. Binomial distribution.

Figure 2. Coherent (acceptable) and non-coherent (poor) predictions.

Figure 3. Officially recognized fake news.

Figure 4. Not all anomalies are fake news. Quantitative results should always be discussed by human analysts.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fenga, L. A Hybrid Computer-Intensive Approach Integrating Machine Learning and Statistical Methods for Fake News Detection. Eng. Proc. 2024, 68, 47. https://doi.org/10.3390/engproc2024068047

AMA Style

Fenga L. A Hybrid Computer-Intensive Approach Integrating Machine Learning and Statistical Methods for Fake News Detection. Engineering Proceedings. 2024; 68(1):47. https://doi.org/10.3390/engproc2024068047

Chicago/Turabian Style

Fenga, Livio. 2024. "A Hybrid Computer-Intensive Approach Integrating Machine Learning and Statistical Methods for Fake News Detection" Engineering Proceedings 68, no. 1: 47. https://doi.org/10.3390/engproc2024068047

Article Menu

A Hybrid Computer-Intensive Approach Integrating Machine Learning and Statistical Methods for Fake News Detection^†

Abstract

1. Introduction