Stylometric Fake News Detection Based on Natural Language Processing Using Named Entity Recognition: In-Domain and Cross-Domain Analysis

Tsai, Chih-Ming

doi:10.3390/electronics12173676

Open AccessArticle

Stylometric Fake News Detection Based on Natural Language Processing Using Named Entity Recognition: In-Domain and Cross-Domain Analysis

by

Chih-Ming Tsai

Department of Industrial Engineering and Management, National Chin-Yi University of Technology, No.57, Sec. 2, Zhongshan Rd., Taiping District, Taichung 411030, Taiwan

Electronics 2023, 12(17), 3676; https://doi.org/10.3390/electronics12173676

Submission received: 25 July 2023 / Revised: 20 August 2023 / Accepted: 28 August 2023 / Published: 31 August 2023

(This article belongs to the Special Issue Data Push and Data Mining in the Age of Artificial Intelligence)

Download

Browse Figure

Versions Notes

Abstract

:

Nowadays, the dissemination of news information has become more rapid, liberal, and open to the public. People can find what they want to know more and more easily from a variety of sources, including traditional news outlets and new social media platforms. However, at a time when our lives are glutted with all kinds of news, we cannot help but doubt the veracity and legitimacy of these news sources; meanwhile, we also need to guard against the possible impact of various forms of fake news. To combat the spread of misinformation, more and more researchers have turned to natural language processing (NLP) approaches for effective fake news detection. However, in the face of increasingly serious fake news events, existing detection methods still need to be continuously improved. This study proposes a modified proof-of-concept model named NER-SA, which integrates natural language processing (NLP) and named entity recognition (NER) to conduct the in-domain and cross-domain analysis of fake news detection with the existing three datasets simultaneously. The named entities associated with any particular news event exist in a finite and available evidence pool. Therefore, entities must be mentioned and recognized in this entity bank in any authentic news articles. A piece of fake news inevitably includes only some entitlements in the entity bank. The false information is deliberately fabricated with fictitious, imaginary, and even unreasonable sentences and content. As a result, there must be differences in statements, writing logic, and style between legitimate news and fake news, meaning that it is possible to successfully detect fake news. We developed a mathematical model and used the simulated annealing algorithm to find the optimal legitimate area. Comparing the detection performance of the NER-SA model with current state-of-the-art models proposed in other studies, we found that the NER-SA model indeed has superior performance in detecting fake news. For in-domain analysis, the accuracy increased by an average of 8.94% on the LIAR dataset and 19.36% on the fake or real news dataset, while the F1-score increased by an average of 24.04% on the LIAR dataset and 19.36% on the fake or real news dataset. In cross-domain analysis, the accuracy and F1-score for the NER-SA model increased by an average of 28.51% and 24.54%, respectively, across six domains in the FakeNews AMT dataset. The findings and implications of this study are further discussed with regard to their significance for improving accuracy, understanding context, and addressing adversarial attacks. The development of stylometric detection based on NLP approaches using NER techniques can improve the effectiveness and applicability of fake news detection.

Keywords:

fake news; stylometric detection; natural language processing (NLP); named entity recognition (NER)

1. Introduction

Online news media have recently become the most important news source for many people, with 53% of Americans getting news via social media “often” or “sometimes” [1] and 72% of Internet users reading news online in the European Union [2]. People around the world are increasingly accessing their news through social media rather than websites. The majority of people considered that they used social networks to stay informed about news and current affairs; however, people are less likely to trust news on social media than news in general, even though the majority of participants use social networks to stay informed about current events and affairs [3]. Obviously, concerns about fake news or media-framing propaganda on social media have not stopped billions of people from receiving their news from their favorite social media in their daily lives. Furthermore, the lack of access to and lack of fact-checked, trusted information is leading to a conducive environment for fake news and misinformation [4]. Undoubtedly, a considerable amount of fake news and misinformation has been disseminated through illegal online news sources. This makes the distinction between legitimate and fake news an urgent topic [5].

There are three main concepts, including knowledge-based fact-checking, context-based identification, and stylometric detection, used for fake news detection [6]. Knowledge-based fact-checking typically involves cross-checking with numerous trusted official sources to identify any incongruities in the logic or description of the article [5]. Context-based identification involves analyzing the content of the news or information, as well as the social context in which it is being shared, to determine whether it is likely to be true or false [6]. Stylometric detection, which is based on the Undeutsch Hypothesis of Potthast et al. [6], aims to detect fraudulent writing and its manifestations to identify the inherent style of logic and patterns found in fake news [5]. The Undeutsch Hypothesis considers that real statements and claims are more unique and specific than false statements. It is reasonable to expect that legitimate articles on the same subject will possess certain similarities to one another and will definitely be different from fake articles in terms of their content and structure. These similarities may include the use of specific vocabulary, the overall style of the writing, and the presentation of evidence that has only a limited number of existing facts for any real event. The difficulties of detecting fake news are as follows: (1) fake news is often mixed with a dash of true events or some fragments of credible sources to deceive the public; and (2) the rate and volume of fake news outpace the possibility of verifying its authenticity and reliability by experts [7]. Since the above situations make knowledge-based fact-checking and context-based identification more and more difficult to implement, the use of stylometric detection is a viable approach for fake news detection [7,8].

Recently, fake news detection using natural language processing (NLP) approaches has played a crucial role in ensuring the reliability of information sources by combating the spread of misinformation. The role of the NLP approach in analyzing linguistic features, understanding context, handling large-scale data, and adapting to diverse datasets has contributed to the development of fake news detection [9]. Torabi Asr and Taboada [10] argued that the NLP approach can identify suspicious language cues and manipulative language patterns or biases employed in fake news content. These features provide valuable insights into the language patterns used in fake news [11]. In addition, the NLP approach can better assess the credibility and veracity of news articles since it allows for a deeper understanding of the context surrounding them [10]. The relationships between entities, events, and topics, enabling the detection of inconsistencies, contradictions, or misleading claims within a broader context, can be effectively captured [12]. Shu et al. [13] considered that the NLP approach can simultaneously examine both textual content and accompanying media, such as text, images, and videos, to identify the manipulated or misleading information presented in fake news articles. Moreover, NLP approaches can also successfully deal with the heterogeneity of various terms coming from different domains [14,15]. In summary, the NLP approach, which can conduct text classification, linguistic feature extraction, sentiment analysis, contextual emotional tone identification, cross-domain/cross-text analysis, and inconsistency analysis between different modalities, has been regarded as an effective technique for fake news detection [11,13,16,17,18,19,20].

In order to consider temporal dependencies, cultural references, and the broader context when detecting fake news, Tsai and Xu [5] presented an exploratory model that integrates both knowledge-based fact-checking and stylometric detection to differentiate between legitimate news and fake news. It is feasible to classify the descriptions derived from real-life experiences or events since the articles that report on the same topic have a high degree of fundamental similarity in content and writing style [5]. However, legitimate articles may differ from one another in some ways, such as the specific evidence presented or the perspective on a particular subject. Since the NLP approach can identify the abundance and diversity of such named entities, including people, organizations, locations, dates, times, and other specific entities that are mentioned in real articles on the same subject, the fraudulent patterns or anomalies found in fake news or disinformation can be successfully recognized. In order to implement such a stylometric fake news detection mechanism, a modified proof-of-concept model named NER-SA based on the NLP approach, named entity recognition (NER) is designed to distinguish between legitimate and false news articles. Differing from the study of Tsai and Xu [5], this study develops a mathematical model and involves a penalty function to achieve a balance between maximizing the number of legitimate news articles and minimizing the number of fake news articles for better detection performance. Furthermore, we also employ three existing datasets to carry out in-domain and cross-domain fake news detection. Two datasets (LIAR and fake or real news) are adopted to implement in-domain analysis, and the FakeNews AMT dataset is adopted to implement cross-domain analysis. We use Support Vector Machine (SVM, lexical feature), Naive Bayes (bigram feature), Convolutional Neural Networks (CNN), and Long Short-Term Memory (LSTM) in in-domain analysis [21], and model 1 with readability features and model 2 with all features in cross-domain analysis [22], respectively, to compare the fake news detection performance. Since the stylometric detection proposed in the NER-SA model can highlight the advantages of the NLP approach combined with NER, our study offers three innovative contributions:

The novelty of the integration of NLP and NER can emphasize the relationship between entities within a text, where NER identifies named entities and NLP captures the relationships between these entities. This approach can better grasp the thematic context of the text through the enhancement of the discriminatory model.
A mathematical model is developed to make the above concept enforceable. We can find the optimal solution to represent a legitimate visual area, which forms the fundamental basis for stylometric detection.
In comparison to machine learning or deep learning methods shown in other studies [21,22], this fusion of NLP and NER in stylometric detection can lead to better detection performance.

Finally, we also discussed the experimental results and practical implications to establish the conclusions and limitations of this study.

2. Related Works

Recently, numerous studies have made significant contributions to fake news detection by leveraging NLP techniques. Wang et al. [23] stated that using NLP to detect fake news involves identifying several features that are extracted from news text, images, and social contexts. For textual-based features, linguistic-based features were most commonly extracted from the text content [24]. Pérez-Rosas et al. [22] used various linguistic-based features, including n-grams, linguistic inquiry, word count, grammar, and readability to train a linear SVM classifier and found that readability, consisting of vocabulary, grammar, and semantic information, can be a powerful feature to distinguish real news from fake news. The use of term frequency (TF), term frequency-inverse document frequency (TF-IDF), and word embedding can capture vital information in the text, aiding in effectively distinguishing between real and fake news [19]. Text-based detection methods commonly use statistical and semantic features of text to detect the credibility of individual news stories [23]. Accordingly, NLP is also considered a powerful tool for extracting clues from a wide range of sentiment analysis and emotional term extraction (especially negative feelings, inflammatory language, and sarcastic insults) from text, documents, and images from numerous sources [18,25]. Since the development of an effective NLP model needs comprehensive datasets that include a diverse range of legitimate and fake news articles [21], the combination of linguistic features and the contextual structure can enhance the interpretability of NLP approaches for fake news detection [10,11,17,24]. In addition, latent textual features are used for news text representations, which can be applied as input into advanced machine learning and deep learning models. Kishwar and Zafar [26] argued that a deep learning model, LSTM initializing with pre-trained GloVe word embeddings, was the best-performing model. Song et al. [27] proposed a novel CNN-based detection model, named Credible Early Detection (CED), to improve the accuracy of social media rumor detection. Shu et al. [13] proposed a content-based semi-supervised deep learning LSTM model to facilitate fake news detection with diverse features in news content, social context, and spatiotemporal information on social media.

The existing fake news detection methods focus on social context-based detection limited by a single modality; therefore, numerous integrated multi-modal detection methods were designed to achieve better performance. Multi-modal detection methods have already become a popular research field. According to the studies of Wang et al. [9] and Zhang et al. [28], multi-modal networks can provide a better understanding of the semantics of text and images. Overall, the NLP approach has demonstrated the significance of considering linguistic structures, contextual information, and theory in achieving better performance for fake news detection. Table 1 summarizes the recent studies in the field of fake news detection mentioned above.

Stylometry has recently become a compelling tool for fake news detection. Nadeem et al. [7] proposed the stylometric and semantic similarity-oriented multimodal (SSM) approach, which contains five distinct modules to determine whether the news is fake through the consideration of textual style, content consistency, semantic similarity, and image manipulation or forgery. The stylometric approach based on the Undeutch Hypothesis has also been successfully adopted in fake news detection [5,6]. Since fake news detection requires a rigorous and evidence-based mechanism, the Undeutch Hypothesis proposes that a limited amount of evidence should be available for any given real news article when considering the credibility of the source, the accuracy of the information, and the context presented in the content. Legitimate news articles are typically presented with a finite evidence pool from different credible sources, providing a balanced representation of this evidence to support the author’s claims and allowing readers to frame their own opinions based on a well-rounded understanding of the event. On the contrary, biased or deceptive news articles selectively reference a portion of the available evidence to present information in a manipulative way. Since the Undeutch Hypothesis provides a useful framework for understanding the differences between legitimate and fake news articles, the validity of a written statement of legitimacy can be successfully detected [5]. In some cases, even trusted news sources may inadvertently present biased or incomplete information due to limited access to the sources or other factors. Readers must approach these news articles carefully and seek out multiple credible sources of information before making a judgment on a particular subject.

This study took the same insights regarding “the Undeutch hypothesis” to propose a modified proof-of-concept model based on NLP using NER to detect fake news articles. NLP obtains a deeper understanding of the textual content, such as sentence structure, word meanings, context, and emotions, while NER captures more entity information by identifying the relationships between entities within the text. This integration enables comprehensive stylometric analysis to enhance the efficiency and accuracy of fake news detection.

3. Data and Methods

3.1. Dataset and Corpus

Multiple open-source datasets were adopted in this study. We used the publicly available fake news dataset, LIAR (https://www.cs.ucsb.edu/~william/data/liar_dataset.zip (accessed on 27 February 2023)), which contains 12.8 K short news statements labeled into six categories: pants-fire, false, barely-true, half-true, mostly true, and true [12]. The LIAR dataset contains 12,791 pieces of data, consisting of 10,240 pieces of training data, 1284 pieces of valid data, and 1267 pieces of testing data. Another dataset named fake or real news (https://www.kaggle.com/mrisdal/fake-news (accessed on 27 February 2023)), including legitimate news (3171 data) and fake news (3164 data), was included in this study [21]. The selected topics of LIAR and fake or real news are related to politics. In addition, a dataset, FakeNews AMT (https://web.eecs.umich.edu/~mihalcea/downloads/fakeNewsDatasets.zip (accessed on 27 February 2023)), presented in the study of Pérez-Rosas et al. [22], was applied in this study. The dataset FakeNews AMT comprises six different sections including business, education, entertainment, politics, sports, and technology. Each section contains forty pairs of articles that focus on a single, specific event. Each article in the dataset consists of a headline and its content, and is labeled as legitimate or fake through binary classification [16,22]. All legitimate news articles are sourced from reputable and credible news sources, including the New York Times, the Wall Street Journal, Bloomberg, CNN, etc. The fake news articles were produced through crowdsourcing using Amazon Mechanical Turk or collected from the Kaggle platform, which offers datasets related to fake news. In this study, our proposed model was a pre-trained model integrating the stylometric detection method based on the Undeutch Hypothesis. The penalty for the misclassification of news articles was simultaneously considered in this model. Both in-domain and cross-domain evaluations were also applied.

3.2. NER-SA: The Modified Proof-of-Concept Model

3.2.1. Creation of Related-Entity and Unrelated-Entity Banks

According to Tsai and Xu [5], an authentic entity bank was created by crawling websites, searching for relevant and irrelevant articles, and utilizing NER techniques to extract and filter out a number of named entities in the initial stage of processing this model. We solved the inherent problem of typical encyclopedia-like information sources with limited detail, incomplete coverage, and infrequent updates by conducting detailed fact-checking processes for specific events. We conducted two Google searches for each legitimate–-fake pair of news articles to obtain the pre-trained data in the corpus. We adopted the headline of the news article as the query string in the first search. Then, the first ten news articles published within five days based on the release date of the paired news article were collected and converted into text files containing the headline and content text through the usage of the Python library Newspaper [31]. The first search results named document A were considered a corpus from the related news articles. For the second search, the same query string was used to conduct the same processes; however, only articles published within five days a year before the release date of the paired news article were recorded. The first ten news articles were collected and processed according to the same procedure. The second search result, document B, was regarded as a corpus from unrelated news articles. Through these two searches, we were able to gather a comprehensive set of up-to-date and historical news articles related to each legitimate–fake pair in our corpus, enabling us to conduct thorough fact-checking and ensure the accuracy of our analysis.

After parsing the document, the variety of named entities that appear in the headline and content text and the reoccurrences of each distinct entity (including the number of reoccurrences of its pronouns) within the document were recorded through the NER technique. These named entities were categorized as proper nouns, numbers, and dates through the usage of space.load (with “en_core_sm” model) for NER [5]. Compiling these named entities extracted from documents A document B, two entity banks named “related-entity bank” and “unrelated-entity bank” were created, respectively. The “related-entity bank” included all named entities that reoccurred at least once in document A, and the “unrelated-entity bank” included all named entities that reoccurred at least once in document B.

3.2.2. Filtering Out the Outliers and Calculating the Reoccurrence Index and Variety Index

The next stage was a process for filtering out irrelevant, invalid, and unnecessary named entities through the comparison of the related-entities bank and the unrelated-entities bank [5]. It was necessary to remove co-occurrences of the named entities appearing in both the unrelated-entities bank and the related-entities bank. This filtering process helped identify named entities that were more likely to be relevant to a particular topic or domain and filter out irrelevant or misleading information. Two indexes, the reoccurrence index (RI) and the variety index (VI), which measured the weight of a named entity and the diversity of all named entities, respectively, within an article, were calculated to identify the outliers. If the RI or VI of any article originating from Document A was less than the RI or VI of the article originating from Document B, this article was considered an outlier that needed to be removed.

R I = \frac{t h e n u m b e r o f t h e n a m e d e n t i t y w i t h r e o c c u r r e n c e s}{t h e t o t a l n u m b e r o f n a m e d e n t i t i e s (i n c l u d i n g t h e r e o c c u r r e n c e s)}

(1)

V I = \frac{t h e n u m b e r o f d i s t i n c t n a m e d e n t i t i e s w i t h i n a n a r t i c l e}{\begin{matrix} t h e t o t a l n u m b e r o f d i s t i n c t n a m e d e n t i t i e s w i t h i n t h e c o r r e s p o n d i n g \\ e n t i t y b a n k t o w h i c h i t b e l o n g s \end{matrix}}

(2)

3.2.3. Developing the Mathematical Model to Form the Legitimate Area

The legitimate area was developed by calculating the ranges of RI and VI of all pre-trained articles. Any news article with an RI and VI falling within the legitimate area was classified as a legitimate news article; otherwise, it was classified as a fake news article. For each news article, k, we looked for the possible RI(k) values, which had to fall into the feasible range between the lower bound m_RI and the upper bound n_RI (i.e., m_RI < RI(k) < n_RI). This was determined by simultaneously adjusting the m_RI from the lowest to the highest and the n_RI from the highest to the lowest to form a range with a maximum number of legitimate news articles and a minimum number of fake news articles. Likewise, a feasible range of the VI was also obtained through the same process of simultaneously adjusting the lower bound m_VI and the upper bound n_VI with the possible VI(k) values, which matched the formula m_VI < VI(k) < n_VI as well. Once the legitimate area was defined, any news article with its RI and VI values falling within this area was considered legitimate, while those outside of it were classified as fake. Accordingly, the objective was to find a range of values that maximized the number of legitimate news articles while minimizing the number of fake news articles.

We assumed that N was the total number of news articles. For each news article, k, RI(k), and VI(k) are the respective RI and VI values. The legitimate area can be defined as:

L e g i t i m a t e a r e a = {(R I (k), V I (k)) | m_R I \leq R I (k) \leq n_R I a n d m_V I \leq V I (k) \leq n_V I}

(3)

A news article k is classified as legitimate if and only if (RI(k), VI(k)) is within the legitimate area. Otherwise, the article is classified as fake. L(RI, VI) is the number of legitimate news articles with the corresponding RI(k) and VI(k) values falling within a given range of values, and F(RI, VI) is the number of fake news articles with the corresponding RI(k) and VI(k) values falling within a given range of values. This study involved M(RI, VI), which defines the number of misclassified news articles within the legitimate area, and a parameter, p, which controls the trade-off between maximizing the number of legitimate news articles and minimizing the number of fake news articles. To obtain the optimal values of m_RI, n_RI, m_VI, and n_VI, the objective function maximizes the difference between correctly classified legitimate news articles and falsely classified fake news articles while penalizing the quantity of misclassified news articles. The constraint ensures that all news articles fall within the specified RI and VI ranges, and the parameter p, can be tuned to achieve a desired balance between maximizing the number of legitimate news articles and minimizing the number of fake news articles. Thus, the mathematical formulas of the NER-SA model are:

M a x i m i z e L (R I, V I) - F (R I, V I) - p * M (R I, V I)

(4)

subject to:

m_R I \leq R I (k) \leq n_R I

(5)

m_V I \leq V I (k) \leq n_V I

(6)

L (R I, V I) + F (R I, V I) = N

(7)

The formation process of the legitimate area can be viewed as a search across the RI–VI multi-dimensional surface, aiming at identifying an appropriate maximal intersecting region that encompasses the maximum number of legitimate news articles while minimizing the inclusion of false news articles. To simplify the visual complexity of multi-dimensional data, Figure 1a illustrates a 3D representation when both RI and VI have 3 dimensions (features) each. Figure 1b displays a contour plot that depicts the intersecting surface obtained from the first dimension of RI and the first dimension of VI. The legitimate area, also shown in Figure 1b, is graphically represented by a dotted rectangle.

We used the simulated annealing algorithm to optimize the mathematical model where the legitimate area is developed to classify news articles as legitimate or fake. Simulated annealing is a global search algorithm well suited for non-convex problems with multiple locally optimal solutions. Additionally, simulated annealing exploits randomness and disorder to avoid getting stuck in local optima. In the NER-SA model, tuning the value of p might create different feasible regions in the solution space, and simulated annealing can aid in exploring these regions. The algorithmic process is as follows:

Step 1. Initialization:

-: Initialize ‘p’ to an initial value ‘initial_p’ (the parameter setting = 0.5).
-: Set the maximum number of iterations ‘max_iterations’ (the parameter setting = 5000).
-: Set the initial temperature ‘initial_temp’ (the parameter setting = 100).
-: Set the cooling rate ‘cooling_rate’ (the parameter setting = 0.95).

Step 2. Objective Function:

-: Define the objective function ‘objective_function(L, F, M, p)’ that calculates the objective score based on the given ‘p’.

Step 3. Simulated Annealing Process:

-

Enter the main loop and iterate from 1 to ‘max_iterations’.

-

In each iteration:

-: Calculate the current objective score for the current ‘p’, ‘current_score’.
-: If ‘current_score’ is better than the previous best score ‘best_score’, update ‘best_score’ and ‘best_p’.
-: Generate a new ‘p’ value ‘new_p’, usually by adding a small random decimal to the current ‘p’.
-: Calculate the acceptance probability ‘acceptance_prob’ using the formula ‘exp((current_score—objective_function(L, F, M, new_p))/initial_temp)’.
-: Generate a random number between 0 and 1; if it is less than ‘acceptance_prob’, accept the new solution by updating ‘current_p’ to ‘new_p’.
-: Multiply the temperature ‘initial_temp’ by the cooling rate ‘cooling_rate’ to decrease the temperature.
-: Update the new temperature to ‘initial_temp’.

Step 4. Output Results:

-: After optimization, output the best ‘p’ value ‘best_p’ and its corresponding objective score ‘best_score’.

3.2.4. Differentiating between Legitimate News and Fake News

After establishing the legitimate area through all pre-trained data, the RI and VI of each test news article were calculated and plotted as X and Y coordinates, respectively. The point coordinate of a news article falling within the legitimate area was marked as a legitimate article, while the point coordinate of a news article falling outside was marked as fake news.

The concept of fake news is still vague. Potthast et al. [6] considered that any biased reports or completely fraudulent articles may be included as fake news. This study classified legitimate news as reports based on factual evidence. We still cannot judge reports based purely on opinions, even though their opinions may be correct. An article was classified as illegitimate if it lacked real facts or evidence to support its claims.

3.3. Performance Evaluation

Wang [12] emphasized the importance of benchmark datasets in enabling a fair and accurate evaluation of the NLP approach. According to the study of Khan et al. [21], the performance evaluation of the detection model was assessed using accuracy (A), precision (P), recall (R), and an F1-score (F1), where the F1 is derived from the harmonic mean of P and R. A confusion matrix was utilized to record the number of true and false matches for each label. To obtain a consistent baseline for evaluation, we adopted the concept of macro-average to calculate precision and recall, which will be equal to the average of P(real) and P(fake) and the average of R(real) and R(fake), respectively. Accordingly, the four evaluation indexes, A, P, R, and F1, were then calculated as follows [21].

A c c u r a c y (A) = \frac{T P + T N}{T P + F N + T N + F P}

(8)

P r e c i s i o n (P) = \frac{P (r e a l) + P (f a k e)}{2}, w h e r e P (r e a l) = \frac{T P}{T P + F P}, P (f a k e) = \frac{T N}{T N + F N}

(9)

R e c a l l (R) = \frac{R (r e a l) + R (f a k e)}{2}, w h e r e R (r e a l) = \frac{T P}{T P + F N}, R (f a k e) = \frac{T N}{T N + F P}

(10)

F 1 = \frac{2 \times P \times R}{P + R}

(11)

4. Experimental Results and Discussion

4.1. Statistical Evaluation

The NER-SA model’s training set accuracy, test set accuracy, root mean square error (RMSE), and R-square values are used to evaluate the experimental results. We also used the test samples of each dataset to carry out a paired t-test [32] on the actual and predicted values. The statistical study determined by all t values is significant (p < 0.001), implying that the NER-SA model is reliable and feasible. The results of statistical evaluation for the NER-SA model have been shown in Table 2.

4.2. In-Domain Analysis Results

We adopt the datasets LIAR and fake or real news to conduct the fake news detection in-domain analysis. The performances of various models are shown in Table 3. Two machine learning models and two deep learning models, referring to Khan et al. [21], are simultaneously reported to reveal the performances among these various models. Our NER-SA model achieves better detection performance since the accuracy and F1 display an 8.94% and 24.04% increase on average on the LIAR dataset, and the accuracy and F1 demonstrate a 19.36% and 19.36% increase on average on the fake or real news dataset, respectively. Compared with the conclusions of Khan et al. [21], the NER-SA model is a considerable success in style-based classification tasks.

4.3. Cross-Domain Analysis Results

The six news domains in the FakeNews AMT dataset are also used to assess the cross-domain classification performance. Five of the six domains in the dataset are used as the trained samples, and the remaining domain is used as the tested samples. The results obtained are shown in Table 4. Compared with the experimental results of the two existing models [15], the NER-SA model also achieves better performance in cross-domain classification tasks since accuracy and F1 have significantly high values in the business, sports, and entertainment domains. Pérez-Rosas et al. [22] considered that the technology, education, and politics domains have higher accuracy with the readability feature, which means that legitimate and fake news in each of these three domains might be similar to each other in their structure and content. The NER-SA model also obtains the same conclusion. The descriptive style in the technology, education, and political domains often takes the form of names and numbers. All four of the “who, what, when, where” terms will be proper nouns, while the articles are decomposed into separate paragraphs or sentences. Although Pérez-Rosas et al. [22] stated that domains such as business, sports, and entertainment are less generalizable and therefore may only rely on their own specific domains, the NER-SA model still achieves higher accuracy and F1-score in the above three domains. We found that the “who”, “when”, and “where” terms in business, sports, and entertainment domains often refer to specific proper nouns expressing concerns about what is happening at the moment of the event. For example, in the business domain, the “who” term may refer to the name of a company or a person involved in a particular deal or transaction, while the “when” term may refer to the date and time when a company will release its financial results. The “where” term may refer to the location of a company’s headquarters or the location where a new project is planned to be developed. Moreover, fake articles in the education and politics domains may more often focus on the “who” term, which involves high-profile individuals and political parties. Fake news creators can take advantage of fabricated content to create sensational headlines and stories that are designed to grab readers’ attention and generate clicks. Similarly, fake news articles in the business and entertainment domains may focus on the “what” term to make the stories seem more credible and convincing by providing specific details and facts. However, it is important to note that the sense of realism shown may be entirely fabricated, and the stories themselves may be completely false.

4.4. Discussion

In Section 4.2 and Section 4.3, we designed the experiments to carry out in-domain and cross-domain analyses. From the NER perspective, the NER-SA model significantly achieves better detection performance than those published in the two studies [21,22] using the same datasets since the words, sentences, and semantics hidden in fake news writing are analyzed using stylistic and topical patterns. This means that the proposed model is a potential approach for detecting fake news compared to current state-of-the-art models.

Accuracy (A) is the proportion of correct predictions among all predictions, which means that accurately identifying true and false news samples is crucial to maintaining high accuracy. In NER for fake news detection, a higher accuracy demonstrates that the NER-SA model possesses a higher proportion of correctly recognized named entities, such as people, places, organizations, etc., among all the named entities in the text. Accurate identification of entities is an important and reliable foundation for NLP.
Precision (P) is the average of the proportion of true positive predictions among all positive predictions and the proportion of true negative predictions among all negative predictions. In NER for fake news detection, a higher precision means that when the NER-SA model marks an entity, it is more likely to be a true entity and not a false one. Precise identification of trustworthy entities in the extracted information helps in avoiding the inclusion of fabricated or misleading entities.
Recall (R) is the average of the proportion of true positive predictions among all actual positive samples and the proportion of true negative predictions among all actual negative samples. In NER for fake news detection, a higher recall implies that the NER-SA model is more effective at capturing a significant portion of the entities, including those that might be intentionally obscured or hidden within the fake news text. The recall identification of correctly identified entities provides valuable clues about the authenticity of the information.
The F1-score (F1) is the harmonic mean of precision and recall, considering the trade-off between false positives and false negatives. In NER for fake news detection, achieving a higher F1 implies that the NER-SA model not only correctly identifies named entities (precision) but also successfully captures a substantial portion of all entities present (recall). Moreover, a high F1 leads to a more robust and reliable NER-SA model for detecting fake news.
For cross-domain analysis, the NER-SA model reports higher recall along with lower precision in all domains. The real news must have some similarities, as similar evidence is referenced to support their reporting on events; however, fake news does not have limited evidence, meaning that the descriptions must be conjured up from the author’s fictional imagination.
For the mathematical model, some fake news articles will inevitably possess the coordinates on the VI that are located in the legitimate area. Such a problem can be effectively solved by involving the penalty function in the NER-SA model so as to avoid excessive expansion of the legitimate area. Judging from the numerical results of A and F1, its performance will indeed be better than machine learning models and deep learning models. Since fake news producers continuously employ adversarial techniques that involve intentional modifications of linguistic features to deceive NLP models, it is suggested that more diverse RI and VI shape frameworks should be developed to maximize the legitimate area in order to enhance the robustness of the NER-SA model against adversarial attacks and improve detection performance in the future.

4.5. Theoretical and Practical Implications

Fake news mainly consists of fictitious statements relating to actual events. Therefore, the content of the fake news involves a presentation of artificial fiction. Fake news often causes people to misunderstand the content of real events and even blurs, exaggerates, distorts, and fabricates many rumors out of nothing [27,33]. Comparing all the words in the named entity banks, we are aware that these fake news articles are not noticeably similar to one another. Furthermore, conflicts and inconsistencies existing in content and writing style also indicate that fictitious and unreasonable content are forcibly placed into the narrative of fake news. While it is perhaps not easy to detect these fake news articles manually, they can be discerned clearly by applying the natural language processing approach.

Although the majority of studies on fake news detection are restricted to English materials, the NER-SA model proposed in this study is still applicable to articles in the Chinese language since it is based on the NLP and NER approaches. There are some related studies in this field [23]. The development of language-specific models associated with detecting fake news using NER techniques is worth further discussion.

Due to the fact that the popularity of social media has soared in recent years, the content of the disseminated information is not exclusively based on text but rather on big data that combines pictures, audio, video, etc. Bonifazi et al. [20] and Corradini et al. [25] highlighted the role of linguistic features of NLP techniques in detecting rumors and hurtful remarks on social media. Zubiaga et al. [33] also considered that NLP techniques play a crucial role in integrating multimodal analysis and uncovering inconsistencies and discrepancies that may signal the presence of misinformation or manipulated content.

Nowadays, it is very easy for people to access all kinds of information on social media platforms, and they can also easily spread this potentially problematic information with just one click or one retweet. Although these increasingly complex problems make detection systems less and less useful, we may still try to introduce an automatic fake news detection system on social media platforms and provide users with appropriate detection results based on a certain range of factual events as an early warning. Since fake news spreads faster and has a wider influence on social media, the NER techniques proposed in this study are able to help detect fake news spread on various social media platforms. In addition, NER techniques can also assist government authorities in fighting against illegitimate fake news and allow ordinary people to evaluate and access reliable news sources so as not to be deceived.

5. Conclusions

In this study, we present the NER-SA model based on the natural language processing approach to conduct fake news detection. This model effectively recognizes the features of content and writing style in articles by extensively creating a corpus of various named entities. The involvement of the penalty function also improves the overall model’s performance. The involvement of NER indeed provides insights into the legitimacy of the article. A legitimate news article is expected to contain accurate and relevant named entities, whereas a fake or misleading article may contain inaccurate or irrelevant named entities. If a legitimate news article contains a high frequency of named entities that are relevant to the topic and consistent with other reputable sources, it is more likely to be trustworthy. On the other hand, if a fake news article contains a high frequency of named entities that are irrelevant to the topic or inconsistent with other sources, it is more likely to be suspicious. Since the descriptions of legitimacy must be derived from real events, the content and writing style on the same topic are similarly reported. This study used LIAR and fake or real news, two datasets for in-domain analysis. In addition, the FakeNews AMT dataset with six domains was used for cross-domain analysis. Using either 12,791 records from the LIAR dataset, 6335 records from the fake or real news dataset, or even 480 records from the six domains in the FakeNews AMT dataset, the NER-SA model can obtain good detection results. As a matter of fact, this also demonstrates that stylometric detection based on the Undeutsch Hypothesis can be successfully applied to the detection of fake news.

Our findings also align with the advantages of NLP approaches, which are shown in other studies [10,11,13,17], emphasizing fact extraction, linguistic manipulations, sentiment analysis, and the ability to handle large volumes of textual data and extract valuable insights. By analyzing linguistic patterns, syntactic structures, semantic relationships, cultural references, and temporal dependencies through NER techniques, we can better distinguish between factual information and fabricated claims through the NER-SA model to improve the accuracy of detection. Moreover, fake news producers often employ adversarial strategies to evade detection algorithms [34]. In this study, we have made progress in developing defenses against adversarial attacks, such as incorporating robust linguistic features and interpretable models with NER techniques that can identify subtle linguistic manipulations and improve the robustness of NLP approaches.

The complexity of human language poses challenges for accurately detecting fake news. Fake news detection often raises ethical concerns, including user privacy, freedom of speech, and potential biases. The limitation of this study is that such ethical implications are not considered when designing and deploying the NER-based detection model to ensure fairness, transparency, and user privacy. In addition, as the volume of information increases, processing and analyzing vast amounts of textual data in real-time can be computationally intensive. Exploring parallel computing and distributed systems to overcome the scalability limitations of NLP-based approaches should also be continuously monitored. For future research, the NER-SA model should still continue to refine the detection approaches based on difficult-to-capture subtle linguistic cues and the detection efficiency of computational algorithms to successfully identify the sophisticated techniques employed by fake news producers.

Funding

This research received no external funding.

Data Availability Statement

The link of the dataset used in this study has been provided in Section 3.1.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hutchinson, A. New Research Shows that 71% of Americans Now Get News Content via Social Platforms. 2021. Available online: https://www.socialmediatoday.com/news/new-research-shows-that-71-of-americans-now-get-news-content-via-social-pl/593255/ (accessed on 31 October 2022).
Ellerbeck, S. Most People Get Their News Online—But Many Are Switching off Altogether. Here’s Why. 2022. Available online: https://www.weforum.org/agenda/2022/09/news-online-europe-social-media/ (accessed on 31 October 2022).
Majid, A. Survey: Google Is Most Trusted Tech Platform for News, TikTok the Least. 2022. Available online: https://pressgazette.co.uk/data-shows-broad-trust-gap-between-news-in-general-and-news-on-social-media/ (accessed on 31 October 2022).
Shahsavari, S.; Holur, P.; Tangherlini, T.R.; Roychowdhury, V. Conspiracy in the time of corona: Automatic detection of COVID-19 conspiracy theories in social media and the news. J. Comput. Soc. Sci. 2020, 3, 279–317. [Google Scholar] [CrossRef]
Tsai, C.M.; Xu, B.S. Automatic differentiation between legitimate and fake news using named entity recognition. In Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition, Xiamen, China, 26–28 June 2020; pp. 74–78. [Google Scholar] [CrossRef]
Potthast, M.; Kiesel, J.; Reinartz, K.; Bevendorff, J.; Stein, B. A stylometric inquiry into hyperpartisan and fake news. arXiv 2017, arXiv:1702.05638. [Google Scholar] [CrossRef]
Nadeem, M.I.; Ahmed, K.; Zheng, Z.; Li, D.; Assam, M.; Ghadi, Y.Y.; Alghamedy, F.H.; Eldin, E.T. SSM: Stylometric and semantic similarity oriented multimodal fake news detection. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101559. [Google Scholar] [CrossRef]
Abeynayake, A.D.L.; Sunethra, A.A.; Deshani, K.A.D. A stylometric approach for reliable news detection using machine learning methods. In Proceedings of the 2022 22nd International Conference on Advances in ICT for Emerging Regions (ICTer), Colombo, Sri Lanka, 30 November–1 December 2022; pp. 1–6. [Google Scholar] [CrossRef]
Wang, Y.; Qian, S.; Hu, J.; Fang, Q.; Xu, C. Fake news detection via knowledge-driven multimodal graph convolutional networks. In Proceedings of the 10th International Conference on Multimedia Retrieval, Dublin, Ireland, 8–11 June 2020; pp. 540–547. [Google Scholar] [CrossRef]
Torabi Asr, F.; Taboada, M. Big Data and quality data for fake news and misinformation detection. Big Data Soc. 2019, 6. [Google Scholar] [CrossRef]
Himdi, H.; Weir, G.; Assiri, F.; Al-Barhamtoshy, H. Arabic fake news detection based on textual analysis. Arab. J. Sci. Eng. 2022, 47, 10453–10469. [Google Scholar] [CrossRef]
Wang, W.Y. Liar, liar pants on fire: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 422–426. [Google Scholar] [CrossRef]
Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; Liu, H. FakeNewsNet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big Data 2020, 8, 171–188. [Google Scholar] [CrossRef]
Cauteruccio, F.; Stamile, C.; Terracina, G.; Ursino, D.; Sappey-Marinier, D. An automated string-based approach to extracting and characterizing White Matter fiber-bundles. Comput. Biol. Med. 2016, 77, 64–75. [Google Scholar] [CrossRef]
Cauteruccio, F.; Stamile, C.; Terracina, G.; Ursino, D.; Sappey-Marinier, D. An automated string-based approach to White Matter fiber-bundles clustering. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–8. [Google Scholar] [CrossRef]
Saikh, T.; De, A.; Ekbal, A.; Bhattacharyya, P. A deep learning approach for automatic detection of fake news. In Proceedings of the 16th International Conference on Natural Language Processing, Hyderabad, India, 18–21 December 2019; pp. 230–238. [Google Scholar] [CrossRef]
Amer, E.; Kwak, K.-S.; El-Sappagh, S. Context-based fake news detection model relying on deep learning models. Electronics 2022, 11, 1255. [Google Scholar] [CrossRef]
Rasool, A.; Tao, R.; Kamyab, M.; Hayat, S. GAWA—A feature selection method for hybrid sentiment classification. IEEE Access 2020, 8, 191850–191861. [Google Scholar] [CrossRef]
Lai, C.-M.; Chen, M.-H.; Kristiani, E.; Verma, V.K.; Yang, C.-T. Fake News Classification Based on Content Level Features. Appl. Sci. 2022, 12, 1116. [Google Scholar] [CrossRef]
Bonifazi, G.; Cauteruccio, F.; Corradini, E.; Marchetti, M.; Sciarretta, L.; Ursino, D.; Virgili, L. A Space-Time Framework for Sentiment Scope Analysis in Social Media. Big Data Cogn. Comput. 2022, 6, 130. [Google Scholar] [CrossRef]
Khan, J.Y.; Khondaker, M.T.I.; Afroz, S.; Uddin, G.; Iqbal, A. A benchmark study of machine learning models for online fake news detection. Mach. Learn. Appl. 2021, 4, 100032. [Google Scholar] [CrossRef]
Pérez-Rosas, V.; Kleinberg, B.; Lefevre, A.; Mihalcea, R. Automatic detection of fake news. In Proceedings of the International Conference on Computational Linguistics, Yangon, Myanmar, 16–18 August 2017. [Google Scholar]
Wang, H.; Wang, S.; Han, Y.H. Detecting fake news on Chinese social media based on hybrid feature fusion method. Expert Syst. Appl. 2022, 208, 118111. [Google Scholar] [CrossRef]
Alghamdi, J.; Lin, Y.; Luo, S. A comparative study of machine learning and deep learning techniques for fake news detection. Information 2022, 13, 576. [Google Scholar] [CrossRef]
Corradini, E. The dark threads that weave the web of shame: A network science-inspired analysis of body shaming on Reddit. Information 2023, 14, 436. [Google Scholar] [CrossRef]
Kishwar, A.; Zafar, A. Fake news detection on Pakistani news using machine learning and deep learning. Expert Syst. Appl. 2023, 211, 118558. [Google Scholar] [CrossRef]
Song, C.; Yang, C.; Chen, H.; Tu, C.; Liu, Z.; Sun, M. CED: Credible early detection of social media rumors. IEEE Trans. Knowl. Data Eng. 2019, 1, 99. [Google Scholar] [CrossRef]
Zhang, H.; Fang, Q.; Qian, S.; Xu, C. Multi-modal knowledge-aware event memory network for social media rumor detection. In Proceedings of the 27th International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1942–1951. [Google Scholar] [CrossRef]
Trueman, T.E.; Ashok Kumar, J.; Narayanasamy, P.; Vidya, J. Attention-based C-BiLSTM for fake news detection. Appl. Soft. Comput. 2021, 110, 107600. [Google Scholar] [CrossRef]
Segura-Bedmar, I.; Alonso-Bartolome, S. Multimodal fake news detection. Information 2022, 13, 284. [Google Scholar] [CrossRef]
Yang, L.O. Newspaper3k: Article Scraping & Curation. 2020. Available online: https://newspaper.readthedocs.io (accessed on 15 December 2020).
Xu, M.; Fralick, D.; Zheng, J.Z.; Wang, B.; TU, X.M.; Feng, C. The differences and similarities between two-sample t-test and paired t-test. Shanghai Arch. Psychiatry 2017, 29, 184–188. [Google Scholar] [CrossRef]
Zubiaga, A.; Aker, A.; Bontcheva, K.; Liakata, M.; Procter, R. Detection and resolution of rumours in social media: A survey. ACM Comput. Surv. 2018, 51, 1–36. [Google Scholar] [CrossRef]
DSouza, K.M.; French, A.M. Social media and fake news detection using adversarial collaboration. In Proceedings of the 55th Hawaii International Conference on System Sciences, Maui, HI, USA, 4–7 January 2022; pp. 115–123. [Google Scholar]

Figure 1. Visual representation of the legitimate area. (a) A 3D representation when both RI and VI have 3 dimensions (features). (b) A contour plot depicting the intersecting surfaces of RI feature 1 and VI feature 1.

Table 1. Summary of fake news detection using NLP.

Reference	Method/Contributions	Limitations
Shu et al. [13]	Their content-based LSTM model is beneficial for fake news detection, fake news evolution, fake news mitigation, and malicious account detection.	People belonging to the same social communities often express the same interests and viewpoints. Their opinions are more likely to be manipulated.
Cauteruccio et al. [14,15]	They proposed a particular string-based fiber representation to solve the string clustering problem of fibers. Their method can be applied to heterogeneous data from different domains.	The model can easily extract those fibers with the same structure The contextual multi-modal approach might not be considered.
Amer et al. [17]	They developed a more effective linguistic model with contextual feature extraction and showed that their model outperformed the accuracy of BERT. Their model also achieves the same accuracy as the LSTM and the Gated Recurrent Unit (GRU) models.	The authors said their model needs to be continuously retrained or updated to deal with a larger volume of malicious fake news generation tools.
Rasool et al. [18]	They proposed the GAWA method, supported by using hybrid sentiment classification to find the optimal features for feature reductions. The results revealed that GAWA can reduce the feature set by up to 61.95% without reducing the accuracy level and enhance the efficiency of the Naive Bayes algorithm for better accuracy.	The GAWA method may not consider interdependencies between features. In cases where some features are correlated with each other, the GAWA method may select all features, resulting in a redundant feature set.
Lai et al. [19]	They verified that both ML models and neural network models implementing NLP have better performance than traditional ML models, CNN, and LSTM in both accuracy and precision.	Lack of consistency in writing patterns, stylistic analysis, and sentiment analysis can lead manipulators to intentionally use common words or synonyms to mimic legitimate texts.
Bonifazi et al. [20]	They extracted the sentiment scope of a user on any topic on the social platform Reddit and proposed an approach that expands the sentiment scope to integration with spatial and temporal scopes.	Authors argued that the main limitations of their approach include: 1. the interconnectivity between different social platforms is not considered; 2. the interference provided by different users potentially having specific sentiments regarding a topic on a given user cannot be analyzed.
Alghamdi et al. [24]	They proposed an approach that combines the bidirectional encoder representation from the transformer (BERT) and CNN to capture semantic and contextual information of a given news text using NLP and demonstrated the improvement of the performance of fake news detection.	The context-based information needs to incorporate style and sentiment analysis for better performance.
Trueman et al. [29]	They proposed the attention-based convolutional bidirectional long short-term memory (AC-BiLSTM) model, which can simultaneously capture the meaning of all sentences and memorize long input sequences. The hybrid model provides an impressive improvement in accuracy and F1-score in comparison with other existing SVM, CNN, and LSTM models.	The AC-BiLSTM may face the limitations such as computational complexity, overfitting, and interpretability, which make it less suitable for the applications of specific tasks.
Segura-Bedmar and Alonso-Bartolome [30]	They adopted both uni-modal (using only texts) and multi-modal (integrating texts and images) approaches to detect fake news on the Fakeddit dataset. The results revealed that the uni-modal approach using BERT is the model with the best accuracy, while the CNN-based multi-modal approach obtains the best performance in accuracy.	The proposed multi-modal approach does not involve the similarity and association detection methods.

Table 2. Statistical study for the NER-SA model.

Datasets	Training Set Accuracy	Test Set Accuracy	RMSE	R-Square	t Value
LIAR	0.6426	0.6218	0.4045	0.703	3.867 ***
Fake or real news	0.9382	0.9304	0.3640	0.814	4.748 ***
FakeNews AMT (Technology domain)	0.9463	0.9378	0.2454	0.837	4.506 ***
FakeNews AMT (Education domain)	0.9451	0.9365	0.2271	0.860	4.454 ***
FakeNews AMT (Business domain)	0.8926	0.8731	0.2689	0.815	3.980 ***
FakeNews AMT (Sports domain)	0.8705	0.8582	0.2904	0.796	4.014 ***
FakeNews AMT (Politics domain)	0.9502	0.9376	0.2487	0.842	4.621 ***
FakeNews AMT (Entertainment domain)	0.8677	0.8563	0.3104	0.775	3.859 ***

*** p < 0.001.

Table 3. The experimental results for the fake news detection of in-domain analysis.

Datasets		Machine Learning Models *		Deep Learning Models *		NER-SA Model (The Increase % in Performance on Average)
Datasets		SVM (Lexical Features)	Naive Bayes (Bigram Feature)	CNN	LSTM	NER-SA Model (The Increase % in Performance on Average)
LIAR	A	0.56	0.60	0.58	0.54	0.62 (8.94%)
	P	0.56	0.59	0.58	0.29	0.61 (31.96%)
	R	0.56	0.60	0.58	0.54	0.61 (7.18%)
	F1	0.48	0.59	0.58	0.38	0.61 (24.04%)
Fake or real news	A	0.67	0.86	0.86	0.76	0.93 (19.36%)
	P	0.67	0.86	0.86	0.78	0.93 (18.58%)
	R	0.67	0.86	0.86	0.76	0.93 (19.36%)
	F1	0.67	0.86	0.86	0.76	0.93 (19.36%)

* The results of machine learning and deep learning models refer to Khan et al. [21].

Table 4. Experimental results for the fake news detection of cross-domain analysis.

Domains	Model 1 * (Readability Feature)		Model 2 * (All Features)		NER-SA Model (The Increase % in Performance on Average)
	A	F1	A	F1	A	P	R	F1
Technology	0.90	0.90	0.80	0.79	0.94 (10.97%)	0.84	0.96	0.90 (6.96%)
Education	0.84	0.84	0.84	0.84	0.94 (11.90%)	0.82	0.96	0.88 (4.76%)
Business	0.53	0.41	0.85	0.85	0.87 (43.97%)	0.83	0.88	0.85 (53.66%)
Sports	0.51	0.45	0.81	0.81	0.86 (50.18%)	0.80	0.88	0.84 (45.19%)
Politics	0.91	0.91	0.75	0.75	0.94 (14.32%)	0.86	0.96	0.91 (10.67%)
Entertainment	0.61	0.60	0.75	0.75	0.86 (39.72%)	0.82	0.86	0.84 (26.00%)

* The results of model 1 (readability feature) and model 2 (all features) refer to Pérez-Rosas et al. [22].

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tsai, C.-M. Stylometric Fake News Detection Based on Natural Language Processing Using Named Entity Recognition: In-Domain and Cross-Domain Analysis. Electronics 2023, 12, 3676. https://doi.org/10.3390/electronics12173676

AMA Style

Tsai C-M. Stylometric Fake News Detection Based on Natural Language Processing Using Named Entity Recognition: In-Domain and Cross-Domain Analysis. Electronics. 2023; 12(17):3676. https://doi.org/10.3390/electronics12173676

Chicago/Turabian Style

Tsai, Chih-Ming. 2023. "Stylometric Fake News Detection Based on Natural Language Processing Using Named Entity Recognition: In-Domain and Cross-Domain Analysis" Electronics 12, no. 17: 3676. https://doi.org/10.3390/electronics12173676

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Stylometric Fake News Detection Based on Natural Language Processing Using Named Entity Recognition: In-Domain and Cross-Domain Analysis

Abstract

1. Introduction

2. Related Works

3. Data and Methods

3.1. Dataset and Corpus

3.2. NER-SA: The Modified Proof-of-Concept Model

3.2.1. Creation of Related-Entity and Unrelated-Entity Banks

3.2.2. Filtering Out the Outliers and Calculating the Reoccurrence Index and Variety Index

3.2.3. Developing the Mathematical Model to Form the Legitimate Area

3.2.4. Differentiating between Legitimate News and Fake News

3.3. Performance Evaluation

4. Experimental Results and Discussion

4.1. Statistical Evaluation

4.2. In-Domain Analysis Results

4.3. Cross-Domain Analysis Results

4.4. Discussion

4.5. Theoretical and Practical Implications

5. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI