To assess the effectiveness of the proposed fake news detection framework, we conducted a comprehensive evaluation consisting of several key components. This section outlines the datasets used for analysis, presents a preliminary examination of the features, details the performance results of our approach, and explores the explainability of our detection system.
4.2. Preliminary Analysis
We conducted a preliminary analysis to gain insights into the most influential features of fake news detection. First, we conducted a feature correlation analysis to identify and eliminate highly correlated features across all datasets. We used a threshold value of 0.9 for the correlation coefficient, meaning that if two features correlated at 0.9 or higher, one was removed to reduce redundancy. As a result, the following readability features were removed due to high correlation: RIX, characters_per_word, sentences, wordtypes, SMOGIndex, FleschReadingEase, syll_per_word, words, characters, and syllable.
Next, we calculated the normalized mean value of each feature for both fake and legitimate news articles. By comparing these mean values, we could quantify the differences in how each feature appeared in fake versus legitimate news. To differentiate between fake and legitimate news, we calculate the difference between each feature for fake news and the corresponding values for legitimate news. A positive value from this calculation suggests a stronger association of that characteristic with fake content, while a negative value indicates a stronger association with legitimate content.
Figure 3 illustrates the features associated with a higher likelihood of fake news. The analysis reveals that fake news often exhibits high values in features such as
type_token_ratio, which suggests that a diverse vocabulary may be employed to lend an appearance of sophistication or credibility. Additionally, the use of words expressing certainty (
liwc_certain), present-tense language (
liwc_focuspresent), and adverbs (
liwc_adverb) are prevalent in fake news, reflecting a persuasive and often dramatic tone. Features like
liwc_function and
liwc_pronoun indicate that fake news tends to use more function words and pronouns, potentially to create detailed and relatable narratives. The frequent use of auxiliary verbs (
liwc_auxverb), subjective language (
subjectivity), and expressions of disgust (
emotion_disgust) further highlights the manipulative nature of fake news, aiming to evoke strong emotional responses and sway reader opinions.
Conversely,
Figure 4 presents the features associated with a higher likelihood of legitimate news. Legitimate news articles are characterized by high values in features such as
liwc_hear, reflecting a focus on information transmission through quoting sources and describing events. The use of complex words (
readability_complex_words and
readability_complex_words_dc), long words (
readability_long_words), and nominalizations (
readability_nominalization) indicates a sophisticated and formal language style typical of legitimate news. Furthermore, features like
readability_preposition,
readability_article, and
readability_sentences_per_paragraph suggest that legitimate news employs clear, structured, and detailed writing. The use of subordinate clauses (
readability_subordination) and work-related terms (
liwc_work) also points to the thoroughness and factual nature of legitimate reporting, focusing on providing comprehensive and precise information.
It is worth noting the significant variability in influence scores across datasets, as evidenced by the wide ranges in both figures. This variability underscores the complexity of fake news detection and the importance of considering multiple features and their interactions. The presence of outliers from specific datasets, such as ISOT and FakeNewsBuzfeedPolitical, highlights the importance of considering dataset-specific characteristics when targeting a specific domain or context.
Overall, the findings highlight that fake news is more likely to employ language that expresses certainty, focuses on the present, and uses a diverse vocabulary to appear sophisticated and credible. It also tends to include more adverbs, pronouns, and function words, contributing to a persuasive and often dramatic tone. In contrast, legitimate news is characterized by the use of complex but readable language. The presence of complex words, nominalizations, and structured writing with clear sentence construction underscores the thoroughness and factual nature of legitimate reporting. These differences underscore the manipulative nature of fake news, which seeks to evoke strong emotional responses, versus the detailed and precise information typical of legitimate news (RQ1).
4.3. Classification Performance
In this section, we describe the experiments conducted to evaluate the classification performance of the proposed model on the task of fake news detection. We utilize a variety of traditional classifiers, including logistic regression, support vector machines (SVM), decision trees, and ensemble methods like random forests, XGBoost, and CatBoost. We used default hyperparameter settings provided by their respective libraries for all classifiers.
Regarding performance metrics, we prioritized the most commonly used ones in related work: accuracy, precision, recall, and F1 score. Accuracy, specifically, measures the proportion of correctly classified instances (both true positives and true negatives) out of the total number of instances. It provides an overall assessment of the model’s performance but can be less informative in cases of class imbalance. Precision measures the proportion of true positive predictions (TP) among all positive predictions, calculated as
Recall (or sensitivity) measures the proportion of true positive predictions among all actual positives, calculated as
F1 score is defined as the harmonic mean of precision and recall, which provides a single metric that balances both
We employed cross-validation with
to ensure robust evaluation and applied standard scaling to normalize the features. Experiments have been implemented using Python Scikit-learn library [
62] and run on a Dell Inc. XPS 13 9310 11th Gen i7 with 32GB RAM, except for experiments with transformers-based models, which have been run on a Kaggle notebook with GPU P100 (Available at
https://www.kaggle.com/docs/notebooks, accessed on 9 July 2024).
In fake news detection, particularly when deployed under resource constraints, the model’s efficiency can be as critical as its classification performance. This is especially relevant in scenarios where decisions must be made quickly and at scale. Therefore, in addition to evaluating the classification performance of different models, we considered training time a crucial factor in our evaluation [
63]. As such, models that offer a good trade-off between accuracy and training time were prioritized for further analysis and discussion.
To further enhance the interpretability of our model, we conducted an analysis using SHAP (Shapley additive explanations) values [
64]. SHAP is a well-established method whose theoretical foundation is based on cooperative game theory. It provides insights into the contribution of each feature to the model’s predictions, offering a transparent view of how different features impact the outcomes. By applying SHAP, we can identify the most influential features in predicting outcomes and better understand the model’s decision-making process. SHAP’s advantage over traditional feature importance methods, such as feature permutation or information gain, lies in its fair and consistent feature attribution. Furthermore, SHAP is model-agnostic, making it more versatile than methods specific to certain model types.
For the first experiment, we trained the classifiers using the features extracted from the text (after removing highly correlated ones). The average results for all datasets grouped by the classifier are detailed in
Table 2. In addition,
Table 3 shows the results of each dataset of our most-relevant methods and compares them with state-of-the-art solutions.
CatBoost, though the slowest with a training time of 41.683 s, delivered the highest weighted F1 score of 0.8122, demonstrating its strong classification performance. In contrast, DecisionTree, with its rapid fit time of 0.846 s, produced the lowest F1 score of 0.7231. XGBoost, with a training time of 1.112 s, achieved an F1 score of 0.7919, striking a commendable balance between performance and computational efficiency. Although CatBoost yielded the highest performance, its substantially higher training time limits its practical application in scenarios requiring rapid processing. In this regard, logistic regression showed the lowest training time. XGBoost, on the other hand, offers a solid compromise between classification performance and training time. The final choice between these classifiers would depend on the specific requirements of the application, such as the need for real-time processing or the availability of computational resources.
As we can observe, our approach delivers highly competitive performance while drastically reducing training time compared to transformer-based models. Although BERT slightly edges out in terms of F1 score, it comes with a significant computational cost, requiring around 878 s to train, while CatBoost and XGBoost only take 4.75% and 0.13% of that time, respectively. This demonstrates that while transformer-based models offer strong results, our approach yields competitive performance at a fraction of the computational time (RQ2).
Another interesting aspect of the proposed solution is that it demonstrates strong performance across various datasets, mainly when using ensemble algorithms. It produced auspicious results in the ISOT, FakeNewsKaggle, and FakeNewsSatirical datasets. Generally, precision and recall are closely aligned, and most datasets show minimal differences between the two metrics. The lowest scores were observed in the FakeNewsAMT dataset, likely due to the generally shorter news articles in this dataset, as indicated by the statistics in
Table 1. Shorter articles may provide less information, impacting the model’s performance.
Although state-of-the-art methods generally outperform it, our approach achieves competitive performance while preserving significant advantages regarding computational efficiency and explainability. Furthermore, our method demonstrates robust performance across various datasets, validating its generalizability and practical applicability.
To further understand the impact of feature reduction on classification performance, we conducted a series of experiments by incrementally reducing the number of features used by traditional classifiers. The selection process for the feature set was based on the preliminary analysis described in
Section 4.2, which identified the features with the most significant differences between fake and legitimate news. Specifically, we selected the 25 features whose average values were significantly higher in fake news and the 25 features whose average values were significantly higher in legitimate news. Starting with this set of 50 features, we incrementally reduced the number of features used in our models, evaluating performance at each step.
Figure 5 illustrates the relationship between the number of features and the model’s performance, measured using the weighted F1 score. The results show that the average performance across datasets stabilizes between 20 and 30 features, with an average F1 score of around 0.75. However, the trend varies between datasets. For instance, performance in ISOT and FakeNewsKaggle datasets continues to increase and only tends to stabilize after 40 features. The large size and diversity of these datasets could explain this matter. In contrast, performance in FakeNewsBuzfeedPolitical and FakeNewsRandomPolitical peaks before reaching 20 features and tends to drop after 30 features.
In terms of computational efficiency, reducing the number of features has a substantial impact on training time. As shown in
Table 4, all algorithms tested experienced substantial decreases in training time when using a reduced feature set of 30 features. On average, the algorithms saw their training times reduced by more than 70%. Also, it is worth noting that reducing the number of features not only enhances efficiency but also contributes to making the model more interpretable. In general, these findings suggest that reducing the number of features to around 30 can significantly improve the models’ efficiency and interpretability without seriously compromising performance (
RQ3). Although some datasets, like ISOT and FakeNewsKaggle, benefit from a larger feature set, most datasets achieve near-optimal performance with fewer features. Limiting the feature set reduces model complexity, leading to faster training times and simplifying the model’s structure. This makes understanding and interpreting the relationships between the features and the predictions easier.
Finally, we conducted an analysis using SHAP to further enhance the interpretability of our model. In this analysis, we focused on the ten features with the highest positive influence and those with the highest negative influence on the model’s predictions. Given its consistent balance between accuracy and speed, we used XGBoost as the method for this study. In addition, we have aggregated the SHAP values across all datasets to achieve a broad understanding.
Figure 6 shows the results of the experiment. The features are ranked by their average impact on the model’s output magnitude. The most influential feature appears to be
liwc_hear, followed closely by
liwc_pronoun. Several linguistic-related features like
readability_article,
readability_complex_words_dc, and
readability_long_words also show significant influence. Other characteristics such as adverbs, focus on the present, and certainty markers have moderate impacts. Finally, some affective features like
subjectivity and
emotion_disgust appear lower on the list, suggesting they have a smaller but still noticeable effect on the model’s output.