Multitask Learning for Authenticity and Authorship Detection

Chhatwal, Gurunameh Singh; Zhao, Jiashu

doi:10.3390/electronics14061113

Open AccessArticle

Multitask Learning for Authenticity and Authorship Detection

by

Gurunameh Singh Chhatwal

and

Jiashu Zhao

^*

Physics and Computer Science, Wilfrid Laurier University, Waterloo, ON N2L 3C5, Canada

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(6), 1113; https://doi.org/10.3390/electronics14061113

Submission received: 4 February 2025 / Revised: 2 March 2025 / Accepted: 10 March 2025 / Published: 12 March 2025

(This article belongs to the Special Issue Big Data Analytics and Information Technology for Smart Cities and Citizen Wellbeing)

Download

Browse Figures

Versions Notes

Abstract

:

Traditionally, detecting misinformation (real vs. fake) and authorship (human vs. AI) have been addressed as separate classification tasks, leaving a critical gap in real-world scenarios where these challenges increasingly overlap. Motivated by this need, we introduce a unified framework—the Shared–Private Synergy Model (SPSM)—that tackles both authenticity and authorship classification under one umbrella. Our approach is tested on a novel multi-label dataset and evaluated through an exhaustive suite of methods, including traditional machine learning, stylometric feature analysis, and pretrained large language model-based classifiers. Notably, the proposed SPSM architecture incorporates multitask learning, shared–private layers, and hierarchical dependencies, achieving state-of-the-art results with over 96% accuracy for authenticity (real vs. fake) and 98% for authorship (human vs. AI). Beyond its superior performance, our approach is interpretable: stylometric analyses reveal how factors like sentence complexity and entity usage can differentiate between fake news and AI-generated text. Meanwhile, LLM-based classifiers show moderate success. Comprehensive ablation studies further highlight the impact of task-specific architectural enhancements such as shared layers and balanced task losses on boosting classification performance. Our findings underscore the effectiveness of synergistic PLM architectures for tackling complex classification tasks while offering insights into linguistic and structural markers of authenticity and attribution. This study provides a strong foundation for future research, including multimodal detection, cross-lingual expansion, and the development of lightweight, deployable models to combat misinformation in the evolving digital landscape and smart society.

Keywords:

multitask learning; shared–private synergy model; hierarchical PLM; fake news detection; AI authorship attribution; stylometric analysis; prompt-based classification; large language models; ablation studies

1. Introduction

Since the emergence of information distribution channels, the propagation of fake news has posed a persistent challenge to society. In the early days, responsible institutions and broadcasting agencies avoided such issues through editorial checks and mutual trust. However, with the introduction of the Internet came an information explosion that made it difficult to isolate genuine content from malicious content. The abundance of information empowers users with a wealth of knowledge, but it has also led to the dissemination of false or misleading stories, commonly referred to as fake news. The impact of such fake news can be severe, affecting domains such as public health, politics, and social harmony. Meanwhile, in recent times, the capabilities of generative artificial intelligence models such as large language models (LLMs) have improved to generate coherent human-like text. This supercharges the ability to produce large volumes of content but also makes its origin and veracity more ambiguous.

Although these models can be exploited by malicious actors to create deceptive narratives on a scale, it is equally important to recognize that not all AI-generated material is necessarily fake or harmful. The availability of tools like chatGPT, ClaudeAI, Gemini, and their open-source counterparts such as llama, Mistral, Bloom, etc., have provided bad actors in society with the ability to produce fake news at scale, amplifying the reach and speed of disinformation campaigns. Authorship, in particular, has significant weight in our society because it confers responsibility. When a specific individual or organization claims authorship, they accept liability for the claims and perspectives they present. In a world where AI can rapidly generate large volumes of text, it becomes imperative to distinguish who (or what) authored a piece of content. Failing to do so could erode accountability, making it easier for malicious actors to publish fabricated stories without fear of repercussion.

This dynamic poses intensified concerns for a secure and intelligent civilization: we must ensure the authenticity of our information and hold its creators accountable. Such measures become especially crucial in contexts like smart cities, where digital infrastructure underpins public services, governance, and everyday citizen interactions. Fake or misleading information, whether produced by humans or AI, can disrupt civic processes, spread disinformation about local events, and undermine trust in essential urban systems. Thus, it is vital to develop advanced detection frameworks that simultaneously assess the authenticity of a text (fake vs. real) and its authorship (human vs. AI). To contend with these intertwined problems, namely, authenticity (fake vs. real) and authorship (human vs. AI), this study explores a unified strategy. We propose a single, integrated framework, called the Shared–Private Synergy Model (SPSM), which leverages advanced transformer-based architectures to analyze text from both perspectives simultaneously. In essence, the SPSM architecture shares certain layers that capture general linguistic cues while dedicating private layers to each classification objective, maximizing synergy across tasks without sacrificing task-specific specialization. In doing so, it addresses the pressing need for a holistic system that can efficiently detect fake news and clarify whether content is produced by humans or AI. By unifying these tasks within a single architectural framework, one that learns shared linguistic cues but specializes in each classification goal, practitioners can address both the factual reliability of content and its source. This dual approach has the potential to protect the integrity of the information that circulates in a smart city environment, enhancing public confidence and ensuring that accountability remains intact even as AI continues to evolve.

1.1. Interdependent Challenges

With the rise of AI-generated fake news, broadly, two main classification tasks arise within the domain of AI-aided content generation:

Authenticity: distinguishing fake news from real or verified news.
Authorship: distinguishing machine-generated text from human-written content.

Although these two tasks may appear unrelated in previous eras, in current scenarios, they are correlated due to the increasing use of AI in online content generation. For instance, fake news is increasingly generated, or at least aided, by AI systems that can produce deceptive content at scale. In contrast, AI-generated content that is benign or journalistic in nature might still be “authored by AI” but not necessarily fake. Hence, there is a likelihood that knowledge of authorship could reveal insights into the reality of content. In real-world scenarios, platforms or fact checkers may first discover that a large volume of suspicious text shares an AI fingerprint. This finding can then guide more in-depth authenticity checks. Moreover, if an established news source is known to produce mostly genuine (real) articles, discovering that certain content strays from the typical writing style of that source might indicate both potential AI involvement and deceptive intentions.

Beyond textual plausibility, generative models allow adversaries to tailor messages to specific audiences, weaving false narratives with alarming efficiency. Recognizing that text is AI-written raises the prior probability that it may be an attempt at mass-produced disinformation. Of course, legitimate AI usage (e.g., automated journalism for sports or financial updates) complicates the matter; therefore, we do not equate “AI-generated” with “fake”. Still, statistically, false or misleading content is more likely to exploit the scale and speed of AI generation.

1.2. Motivations

Escalation of Fake News: The global spread of misleading information has real-world consequences, ranging from undermining public health campaigns to fueling polarized politics. Accurate fake vs. real classification systems remain an open challenge as adversaries adapt new methods of content creation.
Rise of AI-Generated Text: Large language models are now capable of drafting entire articles that are difficult to distinguish from those authored by humans. Distinguishing between human and AI authorship has become essential to ensuring accountability.
Potential Synergy: If authorship classification can improve the accuracy of identifying misinformation (and vice versa), a hierarchical or multitask approach may outperform traditional single-task systems. Consequently, it is valuable to investigate whether incorporating both tasks, authenticity and authorship, within a unified or cascaded pipeline increases overall classification performance.

1.3. Research Objectives

In light of the challenges outlined above both in detecting misinformation (fake vs. real) and attributing authorship (human vs. AI), this work pursues the following primary objectives:

Dataset Creation and Curation: We assembled and refined a novel dataset that explicitly labels each article for both authenticity (fake vs. real) and authorship (human vs. AI). This dataset is drawn from multiple sources, ensuring a diversity of topics, writing styles, and content generation methods. By incorporating comprehensive synthetic data generation for both tasks, our dataset aims to facilitate richer experimentation and cross-task insights.
PLM-Based Investigations: We implement and compare single-task and novel multitask fine-tuning of pretrained language encoder models (e.g., BERT, DistilBERT, Electra), examining how they perform on authenticity and authorship classification. Moreover, we aim to innovate on synergetic models capable of completing both tasks.
Ablation Studies: To identify the most critical model components, we conduct systematic ablations, removing or altering features (e.g., loss weighting, shared layers, specific stylometric inputs) and evaluating the resulting performance impact.
Hierarchal Modeling: We introduce and evaluate hierarchical pipelines in which the authorship task (human vs. AI) informs authenticity classification (fake vs. real). Our goal is to empirically validate the hypothesis that knowledge of authorship significantly aids the detection of fake vs. real content.
Stylometric Feature Analysis: We incorporate stylistic and linguistic features, such as reading ease, lexical diversity, and syntactic complexity, to probe whether certain characteristics are especially indicative of AI-generated text or fake news content.
LLM-Based Approaches: Finally, we assess zero-shot or few-shot prompt engineering with state-of-the-art LLMs, contrasting their classification capabilities (for both tasks) against fully fine-tuned models.

1.4. Contributions

Novel Dual-Labeled Dataset: We present a newly curated dataset that explicitly labels each article for both authenticity (fake vs. real) and authorship (human vs. AI), capturing diverse writing styles and domains.
Unified Dual-Task Framework: We introduce a method that simultaneously classifies text authenticity (fake vs. real) and authorship (AI vs. human), addressing overlapping challenges in today’s AI-driven information landscape.
SPSM: We propose a shared–private multitask framework design, in which common layers capture universal text patterns while specialized layers focus on each classification task. This structure demonstrates superior performance and interpretability.
Comprehensive Experimental Analysis: We offer exhaustive comparisons, including classical ML baselines, stylometric methods, PLM-based ablations, hierarchical setups, and prompt-based evaluations, culminating in a deeper understanding of the strengths and limitations of each approach.

The remainder of this paper is structured as follows. Section 2 discusses the relevant literature on fake news detection, authorship attribution, and hierarchical/multitasking strategies. Section 3 describes our dataset and preprocessing steps. Section 4 outlines the novel methods we propose, PLM-based approaches, hierarchical setups, stylometric approaches, and LLM-based classification. Section 5 further details the experiments including setup, baselines, and ablation studies. Section 6 presents comprehensive results for each method, including performance tables, ablation findings, stylometry interpretations, and prompt-based outcomes. Section 7 offers further analysis on experimental results that tie together the quantitative results with practical insights into intertask synergy and model efficacy. Finally, Section 8 concludes this paper by summarizing the major findings and suggesting directions for future research.

2. Related Work

Research on misinformation detection has rapidly evolved alongside the growing availability of advanced natural language processing tools, introducing a wide spectrum of methods ranging from classical feature-based approaches to PLM-based neural models. At the same time, the authorship attribution domain, once focused on human stylometry, has pivoted toward recognizing machine-generated text, fueled by large language models capable of producing near-human writing. This section reviews key developments in fake news detection, AI authorship classification, popular architectures for interconnected NLP tasks, and zero-shot classifiers, including LLM-based prompt classifiers, providing context for our integrated approach to authenticity and authorship classification.

2.1. Fake News Detection

The detection of fake news has been extensively studied, with various machine learning (ML) and deep learning (DL) approaches proposed. Saleh et al. introduced OPCNN-FAKE, an optimized Convolutional Neural Network (CNN) that demonstrated superior performance over RNN, LSTM, and traditional ML methods in benchmark datasets, using TF-IDF and GloVe embeddings [1]. Dou et al. proposed the User Preference-Aware Fake News Detection (UPFD) framework, which integrates user preferences and social context using graph neural networks (GNNs) to enhance detection accuracy [2]. Another approach emphasized linguistic feature-based learning models, extracting textual and syntactic attributes to effectively classify fake news [3]. Furthermore, a lightweight DL-based model was developed for the detection of fast fake news in cyber-physical social services, optimizing speed and accuracy for real-time applications [4].

Alonso et al. (2021) [5] highlight sentiment analysis as a key tool in identifying emotional manipulation, advocating for multilingual and multimedia capabilities in detection systems. Addressing domain-specific challenges, Nan et al. (2021) propose MDFEND [6], a model that leverages domain-specific features to improve accuracy across multiple domains. Khanam et al. (2021) focus on supervised learning, emphasizing feature extraction and vectorization using tools like Python’s scikit-learn for precise classification [7]. This extends detection capabilities by integrating multimodal features such as text, metadata, and images through deep learning [8].

A hybrid CNN-BiLSTM-AM model was proposed to detect fake news related to COVID-19, integrating convolutional and recurrent networks with attention mechanisms for feature extraction and classification [9]. Another innovative model, QMFND, employed quantum-inspired multimodal fusion techniques, combining text, images, and metadata to enhance detection accuracy on social media platforms [10]. A dual-emotion-based fake news detection framework utilized deep attention weight updates to capture emotional patterns from text, highlighting the role of sentiment in assessing news credibility [11]. Lastly, a multimodal approach using data augmentation-based contrastive learning improved feature representation, leveraging complementary modalities to enhance detection robustness [12]. These studies collectively underscore the evolution of fake news detection, moving toward sophisticated, domain-aware, and multimodal approaches.

2.2. Machine-Generated Content Attribution

Recent studies have explored challenges in distinguishing human-authored texts from machine-generated ones. Uchendu et al. investigated authorship attribution across tasks like identifying the generation model or distinguishing human from machine texts, highlighting the superior quality of GPT-2 and GROVER in evading detection [13]. Clark et al. examined the human evaluation of generated texts, finding non-experts performed poorly in distinguishing GPT-3 outputs and achieved only marginal improvements with training, raising concerns about the reliability of human assessments [14]. Guo et al. analyzed ChatGPT’s responses across domains, noting its gaps in contextual understanding compared to human experts, and proposed effective detection systems to address potential misuse [15].

DetectGPT, introduced by Mitchell et al., leverages the curvature of a model’s probability function for zero-shot detection, improving accuracy without requiring additional training data or fine-tuning, and performs notably better in detecting LLM-generated fake news [16]. Sadasivan et al. demonstrated vulnerabilities in current detectors, including watermarking techniques, by developing a paraphrasing attack that significantly reduces detection accuracy, raising concerns about detector reliability in practical scenarios [17]. Schuster et al. explored stylometry’s limitations in distinguishing malicious from legitimate uses of LLMs, noting that language models maintain stylistic consistency, complicating the detection of machine-generated misinformation [18].

Studies emphasize the growing complexity of detecting AI-generated texts due to advancements in paraphrasing and sophisticated language models. Krishna et al. demonstrated that paraphrasing can evade state-of-the-art detection methods like GPTZero and DetectGPT but proposed retrieval-based defenses as a potential solution [19]. Orenstrakh et al. evaluated eight detection tools in educational contexts, highlighting limitations in accuracy and resilience, particularly when handling paraphrased or non-English content, and called for enhancements to better preserve academic integrity [20]. Dergaa et al. explored ChatGPT’s role in academic writing, discussing its potential to enhance efficiency but also warning about risks to authenticity and credibility, stressing ethical guidelines and transparency [21].

2.3. AI-Generated Misinformation and Multitask Approaches

Recent research underscores the escalating complexity of detecting and mitigating AI-generated misinformation. Zhou et al. analyzed AI-generated misinformation’s linguistic distinctiveness, finding it to be more detailed and emotionally engaging than human-crafted content, which often misleads existing detection systems [22]. Shoaib et al. explored the implications of generative AI for creating deepfakes and misinformation, advocating for a multimodal defense framework integrating digital watermarking and machine learning [23]. Kreps et al. highlighted AI’s capability to produce credible-sounding media at scale, raising concerns about its potential misuse in misinformation campaigns targeting public opinion [24]. Demartini et al. proposed a hybrid human-in-the-loop approach to misinformation detection, combining AI scalability with human expertise to enhance reliability and fairness [25].

Chen and Shu investigated the challenges posed by large language models (LLMs) such as ChatGPT, finding that misinformation generated by LLMs is harder to detect than human-written misinformation due to its deceptive styles and semantic preservation. Their taxonomy categorizes LLM misinformation based on types, sources, and intents, highlighting the need for robust detection mechanisms [26]. Blauth et al. reviewed the malicious uses and abuses of AI, including misinformation, social engineering, and hacking, emphasizing the importance of global collaboration to develop effective mitigation strategies [27]. Kumari et al. introduced a multitask learning framework that integrates novelty detection and emotion recognition to improve misinformation detection accuracy across multiple datasets [28]. Choudhry et al. introduced an emotion-aware multitask approach using transfer learning to detect fake news and rumors, demonstrating that emotions serve as domain-independent features that enhance performance in cross-domain settings [29]. Jing et al. proposed TRANSFAKE, a multimodal transformer-based model that integrates text, images, and user comments, leveraging multimodal interactions and sentiment variances to improve detection accuracy [30]. Wu et al. highlighted the vulnerability of existing detectors to style-based attacks facilitated by large language models (LLMs). They proposed SheepDog, a robust, style-agnostic detector that prioritizes content over style, achieving resilience against LLM-empowered adversarial attacks [31].

2.4. LLM-Based and Zero-Shot Detectors

Advancements in misinformation detection emphasize novel frameworks leveraging LLMs. Satapara et al. proposed an adversarial prompting approach to generate misinformation datasets with controlled factual inaccuracies, enabling improved training for detection models [32]. Hu et al. introduced knowledgeable prompt tuning (KPT), incorporating external knowledge into prompt verbalizers to enhance zero-shot and few-shot text classification performance, a potential asset for misinformation detection [33]. Thaminkaew et al. presented a prompt-based label-aware framework (PLAML) for multi-label text classification, integrating token weighting, label-aware templates, and dynamic thresholds to boost accuracy in few-shot settings [34].

On the other hand, Hou et al. proposed PROMPTBOOSTING, a black-box approach utilizing Adaboost for ensemble learning, significantly improving computational efficiency and performance in few-shot classification tasks [35]. Sun et al. introduced CARP, a reasoning-focused prompting strategy for large language models (LLMs), addressing complex linguistic phenomena in text classification and achieving state-of-the-art results on multiple benchmarks [36]. Chen et al. developed Concept Decomposition (CD), a framework for interpretable text classification by aligning continuous prompts with human-readable concepts, enhancing both performance and explainability [37]. Balkus and Yan explored GPT-3’s self-augmentation capabilities, demonstrating improved accuracy in short-text classification by generating high-quality training examples [38]. Xie and Li presented DLM-SCS, leveraging discriminative language models for semantic consistency scoring, outperforming other prompt-based methods in few-shot classification [39].

Overall, these studies underscore the increasing sophistication and scale of both fake news and AI-generated content, necessitating ever more capable detection frameworks. As PLM-based architectures gain prominence, researchers have combined traditional feature engineering (e.g., stylometric and sentiment cues) with modern deep learning (e.g., multimodal PLMs) to handle diverse sources, from textual posts to images and metadata. Meanwhile, prompt-based and zero-shot techniques leverage LLMs for flexible, domain-adaptive classification, although they face challenges such as prompt sensitivity and adversarial paraphrasing. Finally, multitask and hierarchical strategies demonstrate that integrating tasks (e.g., authorship and authenticity) or modalities (e.g., textual, social, multimodal features) can significantly strengthen misinformation detection. Building on these advancements, our research explores novel dataset construction, shared–private transformer layers, hierarchical pipelines, and stylometric analysis to further unify—and improve upon—the approaches highlighted in this section.

3. Dataset and Preprocessing

In real-world environments, datasets are often complex, encompassing diverse content across various domains and mixing multiple challenges such as authenticity and authorship. To better mimic these real-world conditions, we extend our dataset to include both fake vs. real and AI vs. human labels, thereby reflecting the multifaceted nature of modern information channels. We designate this extended resource as the Fused Authenticity–Authorship News Resource (FAANR) to underscore its dual focus on both content reliability and text origin.

The scarcity of comprehensive datasets tailored for detecting AI-generated fake news remains a significant challenge in advancing this field. While datasets such as LIAR [40] and FakeNewsNet [41] have been instrumental in detecting human-generated misinformation, they lack instances of AI-generated fake news, which limits the development of robust detection models capable of addressing this growing challenge. The majority of existing resources either focus on specific domains or omit key linguistic variations that can be exploited by modern LLMs [42,43].

To address these limitations, we developed a novel dataset by combining previously established human-generated news datasets across domains such as politics, health, and technology with AI-generated fake news. Using both open-source and proprietary LLMs, we designed innovative prompting strategies to create diverse and realistic fake news instances.

3.1. Data Collection and Generation

We aim to build a dataset (FAANR) that explicitly labels each article for both authenticity (fake vs. real) and authorship (human vs. AI), capturing a broad spectrum of writing styles and textual properties. The dataset created is an extension of two open-source fake news detection datasets, namely, the ISOT [44,45] and the news dataset labeled by the news media bias lab (nmbbias) [46]. Combining these two sources, we have a diverse database of articles with two labels, fake and real, from various domains ranging from politics to healthcare. The ISOT and nmbbias datasets have 44,898 and 9513 articles, respectively. The former has articles ranging on global and US news from 2016 to 2017, and the latter has the latest election news articles from 2024. Thus, this provides data before and after the emergence of LLMs to provide a balanced generation in the time domain.

However, they do not adequately address the AI-generated aspect of fake news. To remedy this, we used a subset of 20,000 articles randomly selected from the real set of the FAANR dataset. Next, these articles were fed into four LLMs that include ChatGPT-4o, Llama3.2, gemma, and mistral, to produce respective AI-generated articles using carefully designed scientific prompts, as shown in Table 1. In order to ensure carefully controlled prompts, in one subset, the model was instructed to maintain factual accuracy and coherent style, yielding AI Real; in the other subset, it was guided to create deliberately false or misleading information, forming AI Fake. Each of the LLMs was given 5000 articles to churn out the real-phrased and fake-fabricated versions. Using multiple LLMs ensures diversity in the AI-authored content and prevents the overfitting of models to certain stylistic attributes.

When generating AI content, we cross-checked linguistic and thematic variety, minimizing repetitive patterns that might bias subsequent classification tasks. This process resulted in a total of 83,068 articles, of which around 58% is human-generated and the rest is AI-authored. Table 2 depicts the exact authorship source for the articles in the dataset. This approach allowed us to unify both human-origin (already labeled real/fake) and AI-origin (specially created) articles under a single dataset that covered four possible categories of interest.

3.2. Class Distribution and Preprocessing

After combining the prelabeled human-authored articles with the newly generated AI content, we shuffled and partitioned the dataset into train, validation, and test splits. To maintain balanced class distributions, we employed a stratification strategy, respecting both the real/fake and human/AI labels. As a result, each split contained a comparable proportion of Human Real, Human Fake, AI Real, and AI Fake samples. The final distribution is summarized in Table 3, which reports the number of articles in each partition, the average word count per category, and the general proportions. Furthermore, in Figure 1, bar graphs illustrate how the dataset is distributed between fake vs. real, AI vs. human, and different text sources. This visualization confirms that we have a substantial number of samples from each class, including those derived from GPT-4, Llama, Mistral, and a large body of human-written examples. Additionally, Figure 2 displays a correlation heatmap that indicates how basic numeric variables, such as sentiment, text length, and the two labels (fake, is AI) relate to each other. Although the correlations remain relatively modest, the map offers hints that text length may differ slightly between AI and human samples, and sentiment could have a minor relationship with the fake vs. real label.

Before feeding these articles into our experiments, we applied a uniform preprocessing pipeline. Each article was lowercased and trimmed of extraneous whitespace, removing any non-ASCII artifacts that did not convey useful information. We retained punctuation that was potentially relevant to stylometric or linguistic features. For feature-based methods (e.g., stylometric analysis), a tokenization step (via Python’s NLTK or a simple whitespace tokenizer) was used to compute attributes such as word counts, sentence lengths, or readability metrics. In contrast, PLM-based models employed subword tokenizers (e.g., BERTTokenizer) that automatically handled splits and special tokens. Finally, we excluded articles with too few tokens (e.g., fewer than 20) or containing repetitive filler text, ensuring that all remaining samples had meaningful linguistic content for downstream tasks.

As part of an exploratory study on linguistic differences, we generated WordCloud visualizations (shown in Figure 3) to compare the most frequent words in fake articles (left) versus real articles (right). Some terms, like “Trump” or “said”, appear prominently across both categories, underscoring ongoing political topics; however, the emphasis on certain key verbs or repeated names can differ in placement or frequency. Taken together, these observations, supported by the correlation heatmap, the distribution bar plots, and the WordClouds, help contextualize how authenticity and authorship may manifest in textual patterns. Our stratified train/validation/test splits in the ratio of 70/10/20, balanced across the four categories, ensure that subsequent experiments, whether using multitask or hierarchical models, can effectively learn from both AI-derived and human-authored content.

4. Proposed Approaches

Building on the challenges outlined in the earlier sections, we propose a series of novel solutions aimed at tackling the dual tasks of authenticity and authorship detection in a unified manner. Our goal is to move beyond standard single-task or naive multitask setups by introducing advanced architectures and training strategies that explicitly account for both shared linguistic cues and task-specific nuances. Key contributions include our Shared–Private Synergy Model (SPSM), hierarchical task designs, and adaptive loss weighting methods. These innovations ensure that each classification task benefits from the other’s insights while maintaining sufficient specialization to address its unique requirements. Below, we present the core components of our proposed approaches, detailing how they collectively enhance fake news and AI authorship detection.

4.1. PLM-Based Approaches

These approaches are built on modern pretrained encoders such as BERT, ELECTRA, or DistilBERT. Depending on the task setup, we perform single-task fine-tuning (for authenticity or authorship alone) or introduce a multitask framework (jointly learning both tasks). We first describe the single and naive multitask approaches to further introduce the novel SPSM architecture.

Regardless of the specific pretrained model (BERT, ELECTRA, DistilBERT), we can conceptualize it as a function:

H = TransformerEncoder (X),

(1)

where

X = {x_{1}, x_{2}, \dots, x_{n}}

are the token embeddings for an input sequence and

H

denotes the sequence of hidden states (or a single pooled embedding, depending on the architecture). Each model’s tokenizer transforms raw text into subword IDs, which feed into this encoder.

4.1.1. Single-Task Fine-Tuning

In the single-task approach, we follow a standard end-to-end fine-tuning procedure, targeting either authenticity (fake vs. real) or authorship (human vs. AI). Let

h_{CLS}

be the final hidden state corresponding to the special [CLS] token:

h_{CLS} = BERT (X) [CLS index],

(2)

where BERT denotes a BERT-based or BERT-large pretrained model. We then apply a single linear classification head:

\hat{y} = softmax (W h_{CLS} + b),

(3)

where

W

and

b

are trainable parameters, mapping

h_{CLS} \in R^{D}

to a probability distribution over the two classes, e.g., {fake, real}. We optimize the negative log-likelihood loss (Cross-Entropy) on the labeled dataset for that specific task. For instance, if

ℓ (\hat{y}, y)

is cross-entropy with ground truth

y \in {0, 1}

, we minimize

L_{single} = - \sum_{i} [y_{i} log {\hat{y}}_{i} + (1 - y_{i}) log (1 - {\hat{y}}_{i})] .

(4)

Moreover, Figure 4 depicts the separate single-task fine-tuning process: raw text tokenized into subwords, passed into BERT, and the final [CLS] embedding going through one linear head for classification. The figure highlights how only one output layer is trained for a single classification objective.

4.1.2. BERT Multitask Learning

When we wish to classify both authenticity and authorship simultaneously, we can adapt BERT to a multitask setup. Let

h_{CLS}

still represent the pooled embedding, but now we have two heads:

{\hat{y}}_{auth} = softmax (W_{A} h_{CLS} + b_{A}), {\hat{y}}_{fake} = softmax (W_{F} h_{CLS} + b_{F}) .

(5)

Here,

{\hat{y}}_{auth}

is the probability distribution over {human, AI} and

{\hat{y}}_{fake}

is over {real, fake}. The training objective combines both cross-entropy losses:

L_{multi} = L_{auth} + L_{fake},

(6)

where

L_{auth}

handles authorship classification and

L_{fake}

handles authenticity. This single [CLS]-based embedding thus influences two separate classification heads.

In place of BERT, we can load alternative pretrained encoders, ELECTRA and DistilBERT, while following the same multitask design. The only major difference lies in the underlying architecture of the encoder. For example, we have the following:

ELECTRA is pretrained with a “discriminator” objective that distinguishes real tokens from those replaced by a “generator”. At fine-tuning time, it also produces a hidden state for [CLS] that we map to our tasks.
DistilBERT is a lighter, distilled version of BERT with roughly 40% fewer parameters, but retaining enough representational power for many tasks. It omits the [CLS] pooled output layer in the original sense, so we often take the first token embedding or a special vector from the final layer.

Figure 5 illustrates this multitask approach. The same input tokens pass through the chosen encoder (BERT, ELECTRA, DistilBERT), producing a final hidden state for [CLS]. Two parallel heads transform that embedding into predicted labels:

{\hat{y}}_{auth}

for authorship and

{\hat{y}}_{fake}

for authenticity. The combined loss (6) backpropagates through shared parameters, encouraging the model to learn features that benefit both tasks.

4.1.3. Shared–Private Synergy Model (SPSM)

While standard multitask learning on a PLM encoder (e.g., BERT, ELECTRA, DistilBERT) places multiple classification heads atop the same final hidden representation, we innovatively extend this approach by introducing shared and private layers. In other words, beyond the base PLM encoder, we split the network into two sets of feed-forward layers:

Shared Layers $(SharedFF)$ : A stack of $L_{shared}$ additional feed-forward blocks (or MLPs) that process pooled embedding $h_{CLS}$ for all tasks, thus capturing common features relevant to both authenticity (fake vs. real) and authorship (human vs. AI).
Private Layers $({PrivateFF}_{t})$ : For each task $t \in {authorship, authenticity}$ , there is a distinct feed-forward network that takes the output of the shared layers as input. Each private sub-network then specializes in distinguishing either human vs. AI or fake vs. real, preserving nuances unique to the respective classification objective.

Formally, let

h_{CLS}

be the final hidden state from the PLM. We first compute

z_{shared} = SharedFF (h_{CLS}),

(7)

where

SharedFF

denotes

L_{shared}

layers of feed-forward blocks (potentially interleaved with dropout and activation functions). We then branch into two private sub-networks:

\begin{matrix} z_{auth} & = {PrivateFF}_{auth} (z_{shared}), \end{matrix}

(8)

\begin{matrix} z_{fake} & = {PrivateFF}_{fake} (z_{shared}), \end{matrix}

(9)

where

z_{auth} \in R^{d_{auth}}

and

z_{fake} \in R^{d_{fake}}

are the private representations passed to task-specific classification heads. Finally, each task head predicts a probability distribution over its labels (e.g., human vs. AI or fake vs. real):

{\hat{y}}_{auth} = softmax (W_{A} z_{auth} + b_{A}), {\hat{y}}_{fake} = softmax (W_{F} z_{fake} + b_{F}) .

(10)

Boosting accuracy via shared and private layers, this design capitalizes on two complementary ideas:

Shared Feature Extraction: By letting both tasks pass through $SharedFF$ , we encourage the model to learn common linguistic cues that might benefit both authorship and authenticity classification (e.g., capturing writing style, sentiment usage, or certain domain-specific patterns).
Task-Specific Nuance: The private layers preserve specialized features that differ between tasks (e.g., lexical signals of AI text vs. stylistic cues of misinformation). Thus, each private sub-network can focus on the subtleties of its classification goal without being overwhelmed by the competing task’s objectives.

In practice, we find that introducing shared and private layers significantly enhances performance compared to a simple “two-head” approach that shares all downstream parameters. Empirically, this hierarchy yields an improvement in accuracy/F1 because the shared block learns robust, general representations, whereas private blocks can refine them for authenticity or authorship independently. The overall architecture is depicted in Figure 6. We detail quantitative improvements in the Results Section, showing that this SPSM strategy consistently outperforms both single-task and naive multitask (one [CLS] embedding, two linear heads) baselines.

4.2. Hierarchical Approaches

In addition to the shared–private multitask paradigm, we explore hierarchical strategies for learning authorship (human vs. AI) and authenticity (fake vs. real) in a sequential or cascade manner. Unlike the multitask approach—where both tasks are trained in parallel from a common [CLS] embedding—hierarchical setups explicitly make the output (or an intermediate representation) of the first task feed into the second task. This design can capture the intuitive notion that AI-generated articles may, statistically, be more likely to be fake or vice versa, thus exploiting potential dependencies between tasks.

4.2.1. Two-Stage Pipeline

In the two-stage pipeline, we first train or compute an authorship prediction for each article and then use that authorship signal (predicted logits or probabilities) as an additional input feature to the authenticity classifier. More concretely, we define two passes:

Authorship Pass: Given a PLM encoder (e.g., BERT) and a classification head for human vs. AI, we compute ${\hat{y}}_{AI} (x) = σ ({Head}_{AI} (Encoder (x)))$ . If we produce logits $a \in R^{2}$ , we can convert them to probabilities $p_{AI}$ via softmax.
Authenticity Pass: We embed the predicted authorship distribution $p_{AI}$ (or the raw logits, $a$ ) into the input for the second classification pass, e.g.,

$h_{cascade} = [Encoder (x); p_{AI}],$

(11)

concatenating or otherwise merging the predicted authorship probabilities with the PLM embedding. A second head, ${Head}_{Fake}$ , then performs a cross-entropy classification for fake vs. real.

Thus, the second stage explicitly conditions on the authorship model’s outcome. In practice, this approach can be implemented by first training an authorship model, saving predictions, and then injecting them as features (or using a single PLM forward pass that computes both). An outline of the approach is shown in Figure 7. The advantage is that authorship predictions might highlight subtle cues (e.g., certain writing style or repeated tokens) that help the authenticity model, especially if human- vs. AI-authored texts differ in the likelihood of containing misinformation.

If we let

A (x)

be the authorship classifier’s output distribution, the second classifier’s decision for fake vs. real is:

{\hat{y}}_{fake} = softmax ({Head}_{F} (Encoder (x), A (x))),

(12)

where we might incorporate

A (x)

either by concatenation or by an auxiliary feed-forward layer that fuses the probabilities/logits into the authenticity embedding.

4.2.2. Single-Pass Cascade

Rather than explicitly saving intermediate outputs and re-feeding them, we can implement a single-pass hierarchical model that internally first computes authorship logits and then conditions authenticity logits on them. Concretely, let

h_{CLS}

be the pooled representation from the PLM, and define:

a = {Head}_{AI} (h_{CLS}),

(13)

which is the authorship logit vector. We then transform

a

into a hidden dimension equal to

h_{CLS}

or another suitable size, e.g.,

m = W_{cascade} a + b_{cascade}, z_{combined} = h_{CLS} + ReLU (m) .

(14)

Then,

z_{combined}

is fed to the authenticity head:

{\hat{y}}_{fake} = softmax ({Head}_{F} (z_{combined})) .

(15)

Hence, the authenticity classification layer sees not only the base embedding but also authorship information encoded in

a

. The entire forward pass is differentiable end-to-end, allowing authorship gradients to propagate back into the encoder while also steering the second stage authenticity head. Figure 8 represents the overall flow of the cascade process and highlights the difference between the approaches.

If

ℓ_{auth}

and

ℓ_{fake}

are cross-entropy losses for each subtask (using authorship labels and authenticity labels, respectively), then the single-pass model minimizes

L_{hier} = ℓ_{auth} + ℓ_{fake},

(16)

similar to a multitask scenario, but with an internal cascade from authorship logits into authenticity.

By enforcing this stepwise or cascade logic, hierarchical approaches may exploit the prior knowledge that AI-written texts differ in style or distribution and that misleading articles may be more prevalent among certain authorship types. Thus, the second task (authenticity) can condition on the first, often improving overall performance. Nevertheless, the results in the next section will confirm the extent to which hierarchical constraints outperform simpler multitask baselines.

4.3. Stylometric Feature Analysis

We additionally perform stylometric feature extraction to investigate whether certain lexical, syntactic, or readability attributes are indicative of human vs. AI authorship or fake vs. real authenticity. These features are inspired by prior studies in the domain of stylometric-based classification networks [47], and each feature contributes to a comprehensive profile of the text, encapsulating various aspects of its structure and composition [48]. Below, we first provide a concise list of the features and their mathematical definitions in Table 4 and then offer an expanded description of each feature.

The following are the detailed explanations of the features:

Word Count, Unique Word Count, Character Count: These capture the basic lexical footprint of each text. Word Count (W) is the total token count, Unique Word Count (U) measures vocabulary breadth, and Character Count (C) reflects orthographic length.
TTR and Hapax Legomenon: TTR (Type-Token Ratio) gauges lexical diversity, while Hapax Legomenon checks how many tokens appear exactly once, possibly indicating subject-specific or AI-like lexical patterns.
Sentence Count, Average Sentence Length, Average Sentence Complexity: Each measure describes syntactic distribution. Sentence Count simply counts the total sentences (S), while Average Sentence Length and AvgSentenceComplexity parse how dense or frequent these sentences are relative to word counts.
Punctuation Count: This summarizes how frequently punctuation tokens appear, which can be distinctive for emotional or AI-generated texts (e.g., repetitive exclamation marks).
Noun Count, Verb Count, Adjective Count, Adverb Count: Part-of-speech tallies can reveal shifts in style or usage, e.g., certain content might rely heavily on adjectives for sensational language (common in fake news or AI descriptions).
Stopword Count: This measures how many tokens come from a standard stopword list (the, and, of), which can reflect typical English usage.
Complex Sentence Count: This tracks how many sentences have additional subordinate or conjunctive clauses, often flagged by relations like advcl or conj in dependency parsing.
Question Mark Count, Exclamation Mark Count: These tally rhetorical or emotional punctuation, potential hints of sensational or interactive text (AI prompts or clickbait).
Flesch Reading Ease, Gunning Fog Index: Classic readability formulas. Flesch Reading Ease uses average sentence length and average word length, whereas Gunning Fog Index leverages sentence complexity.
First Person Pronoun Count: This identifies I, we, our, etc., which can differ between personal or AI-generated writing.
Person Entity Count, Date Entity Count: Named Entity Recognition tallies, checking if the mention of people or dates correlates with authenticity or authorship cues.
Uniqueness Bigram, Uniqueness Trigram: The ratio of distinct bigrams/trigrams to total. High uniqueness might suggest creative or unusual phrasing; low uniqueness could indicate repeated patterns.
SyntaxVariety: The number of unique POS tags (e.g., NOUN, VERB, ADV), reflecting diversity in syntactic structures.

After computing these stylometric attributes, we scale the resulting feature vectors (e.g., via MinMax scaling) and feed them into classical ML algorithms or shallow neural networks. This approach can shed light on which linguistic cues dominate classification. For instance, when we use a Random Forest or Logistic Regression model, we can derive feature importance values or coefficient weights, thus explaining whether frequent punctuation, certain entity mentions, or high sentence complexity correlates strongly with fake vs. real or human vs. AI predictions. Consequently, stylometric analysis offers an explainable contrast to purely deep contextual embeddings, potentially highlighting domain-specific patterns that advanced PLMs might learn in a less interpretable way.

In the next section, we compare these feature-based classifiers against our multitask PLM approaches. We find that certain stylometric cues (e.g., frequent first-person pronouns, low lexical diversity) can partially reveal AI authorship, while strong readability variation or emotional punctuation can hint at manipulative fake news. Ultimately, these stylometric signals can complement or explain model decisions, even if end-to-end PLMs yield higher raw accuracy.

4.4. Prompt-Based Classifiers

In addition to supervised neural models and stylometric approaches, we employ prompt-based classification with large language models, specifically GPT-4o and Llama3.2. Rather than training a classifier directly, we issue targeted system instructions and user prompts, asking the model to output a single label (AI vs. Human or Real vs. Fake). This leverages the LLM’s internal knowledge from pretraining, enabling classification in a near-zero-shot setting. Following are the LLMs we selected.

GPT-4o: A variant of GPT-4 accessed via a chat-completions API. It offers high-quality outputs at the expense of potential rate or cost limitations.
Llama3.2: An open-source (or locally hosted) model, also responding to system plus user prompts for classification tasks.

We craft two specialized tasks: AI vs. Human for authorship and Real vs. Fake for authenticity. Each prompt pair consists of a system instruction defining the classification role and a user prompt providing the article title and text and then requesting a direct label. Table 5 summarizes these templates.

At inference, we insert an article’s {title} and {text} into the relevant user prompt and then send both system instruction and user prompt to GPT-4o or Llama3.2. The LLM responds with a single token (“AI” or “Human”, “Real” or “Fake”). Temperature is set to zero (temperature = 0), aiming for deterministic behavior. We record these outputs as predicted labels and evaluate them against ground truth.

Compared to fine-tuned or feature-based models, LLM prompts require minimal additional data or training steps—each classification query essentially leverages the LLM’s pretraining. However, the method can be sensitive to prompt phrasing, and it may produce off-format answers if not carefully constrained. Nonetheless, prompt-based classification can be deployed quickly for new tasks, provided the instructions are unambiguous and the LLM comprehends the textual domain. We later discuss how these responses compare to supervised baselines in the results section, assessing whether LLMs can robustly infer authenticity and authorship from a straightforward question-and-answer prompt.

5. Experiments

To rigorously evaluate the proposed methods, we conduct a series of controlled experiments that aim to gauge effectiveness in both authenticity (fake vs. real) and authorship (human vs. AI) classification. In this section, we describe the experimental design, including data splits, training and validation protocols, hyperparameter configurations, and the metrics used for evaluation. We also provide details on ablation studies and other relevant implementation aspects. Following this comprehensive overview, we present all quantitative and qualitative findings in the subsequent Results Section, where we compare the performance of our novel approaches against relevant baselines and discuss the implications in detail.

5.1. Experimental Setup

All experiments were conducted on a system equipped with an NVIDIA A4000 GPU featuring 24 GB of VRAM and 2 virtual CPUs (vCPUs). This setup provided sufficient memory to handle large PLMs (e.g., BERT, DistilBERT, GPT-based architectures) and multitask or hierarchical pipelines without frequent out-of-memory errors. We relied on the Hugging Face Transformers library for model definitions, tokenization, and training routines, and we leveraged the built-in Trainer class to streamline fine-tuning, evaluation, and checkpoint management.

For single-task experiments, we typically used a batch size of 8, trained for 3 epochs, and set load-best-model-at-end = True to ensure that the best checkpoint according to validation loss was restored before final evaluation. The same batch size, number of epochs, and best-model restoration logic were applied to multitask and hierarchical models as well, though we occasionally experimented with ablation runs (e.g., removing stylometric features, adjusting task-specific loss weights) under similar or identical hyperparameters to maintain comparability. Our logging interval (logging-steps = 50) provided enough granularity to monitor training progress without incurring excessive overhead.

Additionally, we occasionally invoked ablation protocols that removed or altered certain features (e.g., stylometric input, shared layers in a multitask architecture) while preserving the same batch size, number of epochs, and validation strategy. This consistency allowed us to draw fair comparisons across different approaches without conflating changes in hyperparameters with changes in model design.

5.2. Traditional Baseline Using Classical ML Methods

Our traditional (baseline) approach employs bag-of-words representations with TF-IDF weighting to classify texts as either fake vs. real (authenticity) or human vs. AI (authorship). In our TF-IDF baseline, each document is first transformed into a high-dimensional vector

x \in R^{d}

, where d corresponds to the size of the TF-IDF feature space (e.g., unigrams or n-grams). We then feed

x

into one of several classical machine learning classifiers to predict the label

y \in {0, 1}

(e.g., fake vs. real or human vs. AI). An overview of the approach is depicted in Figure 9. Specifically, we explore the following algorithms:

First we discuss Logistic Regression, given a weight vector

θ \in R^{d}

and bias

b \in R

, logistic regression models the probability that

y = 1

as:

P (y = 1 ∣ x) = σ (θ^{⊤} x + b),

(17)

where

σ (z) = \frac{1}{1 + exp (- z)}

is the sigmoid function. The prediction

\hat{y}

is then obtained by thresholding

P (y = 1 ∣ x)

at 0.5.

Next, Random Forest is an ensemble of M decision trees. Each tree is trained on a bootstrapped sample of the data and uses a subset of features to split nodes. The final classification is determined by majority vote among the M trees:

\hat{y} = majority_vote ({Tree}_{1} (x), \dots, {Tree}_{M} (x)) .

(18)

Further, an SVC attempts to find a maximum-margin hyperplane in

x \in R^{d}

. In its basic linear form:

\hat{y} = sign (w^{⊤} x + b),

(19)

though we often employ kernelized SVC for non-linear decision boundaries (e.g., RBF kernel).

Also, Multinomial Naive Bayes assumes word occurrences follow a multinomial distribution and treats features independently given the class. For a vocabulary size d, the posterior probability of class k given

x

is:

P (y = k ∣ x) \propto P (y = k) \prod_{j = 1}^{d} P {(x_{j} ∣ y = k)}^{x_{j}},

(20)

where

x_{j}

is the count (or TF-IDF weight) of the j-th token and

P (x_{j} ∣ y = k)

is estimated from training data.

Moreover, XGBoost is a gradient boosting framework that incrementally fits an ensemble of decision trees to minimize a specified loss (often logistic loss for classification). Let

F_{m - 1} (x)

be the ensemble’s prediction after

m - 1

trees; the m-th tree is fitted to the negative gradient of the loss with respect to

F_{m - 1} (x)

, producing

h_{m} (x)

. The model update is then:

F_{m} (x) = F_{m - 1} (x) + η h_{m} (x),

(21)

where

η

is the learning rate. The final prediction is obtained by the sign or threshold of

F_{M} (x)

after M trees.

All of these methods take as input the same TF-IDF vectors derived from lowercased, tokenized text. We tune hyperparameters (e.g., regularization strength in logistic regression, max depth in XGBoost) via a validation set to maximize classification performance on each specific task (authenticity or authorship). Despite the more limited representational capacity of these models compared to deep neural networks, they serve as a computationally efficient and interpretable baseline that often performs competitively on text classification tasks involving TF-IDF features.

5.3. Ablation Studies

A key contribution of our approach is the flexible architecture that can toggle on or off: (i) the shared layer after the PLM, (ii) the private layers for each task, and (iii) different loss weights for authenticity (fake vs. real) and authorship (human vs. AI). To systematically evaluate which components and hyperparameters lead to gains, we conduct a set of ablation experiments, modifying the configuration in the following ways:

Shared Layer On/Off: When shared layer is used, we insert an additional linear + ReLU block to transform the pooled BERT output ( $h_{CLS}$ ). If set to False, the model directly passes $h_{CLS}$ into either private layers or heads without the extra shared projection.
Mathematically, if shared layer is used, we compute:

$z_{shared} = ReLU (W_{shared} h_{CLS} + b_{shared}),$

(22)

else we set $z_{shared} = h_{CLS}$ directly.
Private Layers On/Off: If private layers is turned on, each task obtains its own feed-forward (FF) head, e.g., ${PrivateFF}_{auth}$ for authorship and ${PrivateFF}_{fake}$ for authenticity. Otherwise, the model simply applies two linear heads (one for authenticity, one for authorship) directly to $z_{shared}$ . In the private layers scenario:

$z_{auth} = {PrivateFF}_{auth} (z_{shared}), z_{fake} = {PrivateFF}_{fake} (z_{shared}),$

(23)

which then feed two distinct classifier heads. Disabling private layers merges the two tasks’ feed-forward paths so that each classification head sees the same representation.
Loss Weight Modifications: By varying authenticity loss weight ( $α$ ) and authorship loss weight ( $β$ ), we emphasize one task’s loss over the other. Our total loss is:

$L_{total} = α L_{authenticity} + β L_{authorship},$

(24)

where each $L$ represents cross-entropy for its respective output logits. Typically, $α = β = 1.0$ indicates equal importance, but we test an ablation where authenticity is weighted more heavily.

For implementation and Forward Pass each ablation run instantiates the base BERT model and applies the following toggles:

If Equation (22) is used or skipped;
use_private_layers decides if task-specific FF transformations are added;
Set $α$ and $β$ in Equation (24).

During training, the forward pass produces both authenticity logits and authorship logits from whichever structure emerges after toggling. These logits go into separate cross-entropy losses, scaled by

α

or

β

, then summed. Such design lets us confirm whether the shared block or private layers significantly boost synergy between tasks.

Table 6 summarizes the ablation variants we explore. In practice, each variant is launched with the same training hyperparameters (3 epochs, batch size 8, learning rate

2 \times 10^{- 5}

), and a final evaluation is performed on test data for both authorship and authenticity. We detail performance differences in the Results Section.

Table 6 displays the four main variations: (1) full_model sets everything on with equal weighting, (2) no_shared_layer removes the additional linear + ReLU block, (3) no_private_layers omits separate feed-forward transformations for each task, and (4) authenticity_loss_2x doubles the weighting for fake vs. real. By evaluating test accuracy and F1 for each configuration, we ascertain the individual impact of each design element (shared layer, private layers, task weighting) on overall performance.

We anticipate that no_shared_layer and no_private_layers might hinder the model’s ability to capture cross-task or task-specific nuances, thus lowering the synergy between authorship and authenticity. On the other hand, multiplying

α

by 2.0 in authenticity_loss_2x should strengthen fake vs. real classification, albeit possibly at the cost of slightly weaker human vs. AI metrics. Results from these ablation runs, compared to full_model, clarify how each structural or weighting choice contributes to the model’s final accuracy and F1-scores.

6. Results

In this section, we present our empirical findings for classifying authenticity (fake vs. real) and authorship (human vs. AI). Our experiments encompass a broad range of models, starting with baselines, which include not only straightforward approaches (e.g., classical machine learning pipelines) but also some of the most recent and best performing models in the domain. Further, we move into PLM-based approaches, including single-task and multitask setups. We then perform ablation studies to measure how toggling key architectural elements (such as shared or private layers) affects performance on both tasks.

Next, we explore a stylometric analysis, relying solely on handcrafted linguistic features to highlight whether explicit lexical and syntactic cues can match or complement the accuracy of end-to-end neural methods. Finally, we evaluate prompt-based classification with large language models, comparing how zero- or few-shot queries to GPT-4o and Llama3.2 stack up against our supervised or feature-driven methods. Throughout these experiments, we report both accuracy and F1-scores (macro or weighted) to account for any class imbalance in real vs. fake or human vs. AI. In some cases, we also examine precision, recall, and confusion matrices to clarify which error patterns arise most often.

By contrasting all these results, we aim to identify which techniques effectively capture the nuances of authorship and authenticity and which design decisions—hierarchical constraints, prompt-based questions, or stylometric signals—meaningfully enhance performance and explainability. In the paragraphs that follow, we present each method’s outcomes and discuss their comparative advantages.

6.1. Baseline Results

We begin our empirical evaluation with traditional machine learning baselines, which leverage a TF-IDF feature pipeline to classify authenticity (fake vs. real) and authorship (human vs. AI). Our evaluation initially considers five classifiers: Logistic Regression, Random Forest, SVC, Multinomial Naive Bayes, and XGBoost (XGB). Their performance—reported in terms of accuracy, precision, recall, and F1-scores (both weighted and macro)—is summarized in Table 7.

In addition to these traditional methods, we extend our analysis to include several state-of-the-art models. First, we evaluate graph neural network (GNN)-based algorithms [49] (GCN, graphSAGE, GIN, and GAT), which treat TF-IDF vectorized articles as nodes in a synthetic graph to capture inter-document relationships. We then consider deep learning architectures from the recent literature [50,51]: CNN, LSTM, and a bi-LSTM attention-based model. In these neural models, fasttext [52] embeddings are used for tokenization and feature representation. Finally, for authorship detection, we incorporate two popular machine-generated text (MGT) detection algorithms, GPTZero [53] and DetectGPT [16].

Table 7 reveals several trends. Among the traditional baselines, Logistic Regression and XGB yield robust performance, achieving authenticity accuracies of 89.16% and 88.39% and authorship accuracies of 90.91% and 90.21%, respectively. In contrast, deep learning models significantly outperform these methods: the CNN, LSTM, and bi-LSTM attention networks reach authenticity accuracies of 92.92%, 93.37%, and 94.61% and authorship accuracies of 94.53%, 95.20%, and 95.17%, respectively. These results indicate that convolutional and recurrent architectures—especially when augmented with attention mechanisms—are highly effective at capturing complex textual patterns.

On the other hand, the GNN-based models (GCN, graphSAGE, GAT, and GIN) exhibit considerably lower performance, with authenticity accuracies hovering around 54% and authorship accuracies ranging between 41.47% and 58.53%. This suggests that representing articles as nodes with TF-IDF features and synthetic connections does not capture the nuanced relationships required for accurate classification in these tasks. For authorship detection specifically, the specialized MGT detection algorithms—GPTZero and DetectGPT—yield much lower accuracies (64.74% and 49.89%, respectively). Their subpar performance underscores the challenges these approaches face when distinguishing human-written texts from AI-generated content, as compared to both traditional and deep learning classifiers.

Overall, while traditional methods such as Logistic Regression and XGB remain competitive, the deep learning architectures, particularly the bi-LSTM attention model, deliver superior performance across both authenticity and authorship classification tasks.

6.2. PLM-Based Approaches

In the prior literature, the best performing strategies for both authenticity and authorship classification have been based on pretrained language models (PLMs) [54,55]. In our work, we have incorporated all of these top-performing strategies to ensure a comprehensive comparison. We now present our PLM-based approaches for both tasks. Table 8 shows the accuracy, precision, recall, and F1-scores (weighted and macro) achieved by various configurations, including single-task versus multitask fine-tuning, a shared–private sub-network (SPSM) approach, hierarchical variants, and an authenticity_loss_2x configuration that emphasizes the authenticity objective.

The SPSM surpasses naive multitask or separate training (96.97% vs. 95–96% authenticity), indicating that partial parameter sharing plus private sub-networks effectively capture cross-task cues.

Notably, authenticity_loss_2x improves authenticity accuracy to 97.09%—a jump from the original BERT metrics—by weighting authenticity’s cross-entropy loss more heavily. Though this also slightly elevates authorship performance (98.92% accuracy), the difference is smaller, highlighting how scaling one task’s objective can yield an overall synergy without excessively penalizing the other task.

Meanwhile, DistilBERT (Multitask) and ELECTRA (Multitask) remain competitive, with ELECTRA hitting 96.84% for authenticity and 98.80% for authorship. Finally, Hierarchical B/A show how authorship signals can be used to drive authenticity classification in a single-pass or two-stage approach, with Hierarchical B surpassing basic multitask BERT in authenticity by about 0.26% (96.20% vs. 95.94%).

Overall, these results reinforce that multitask architectures leveraging shared representation and task-specific specialization can exceed simpler single-task or naive two-head solutions. Moreover, adjusting loss weights (e.g., doubling authenticity’s contribution) can significantly boost that particular task’s performance without unduly harming the other classification goal.

6.3. Ablation Studies

To isolate the impact of shared layers, private layers, and loss weighting, we carry out a series of ablation experiments. Table 9 summarizes how toggling each component affects authenticity (fake vs. real) and authorship (human vs. AI) metrics. Below, we discuss the four configurations:

full_model w both loss: It enables both shared and private layers with default weighting ( $α = β = 1.0$ ).
no_shared_layer: It disables the shared feed-forward transformation while keeping private layers.
no_private_layers: It retains the shared layer but removes per-task private layers, mapping the shared representation directly to logits.
authenticity_loss_2x: It doubles the authenticity loss weight ( $α = 2.0$ ), amplifying the fake vs. real objective.

full_model with both loss achieves 96.95% authenticity accuracy and 98.53% authorship accuracy, validating that including both shared and private layers balances synergy and task-specific nuance. no_shared_layer slightly reduces authenticity accuracy to 96.71%, indicating that the additional feed-forward block may indeed help general representation. Interestingly, no_private_layers yields identical metrics to the full model in this case, suggesting the shared feed-forward transformation alone can suffice under certain data conditions.

Finally, authenticity_loss_2x raises authenticity accuracy to 97.09% while also nudging authorship up to 98.92%, showing that boosting one task’s weight (

α = 2.0

) can still improve joint performance without noticeably harming the secondary objective. This further underscores that controlling relative loss terms can help a multitask network prioritize more difficult tasks (fake vs. real in this scenario) while retaining high performance on authorship.

Figure 10 presents a superimposed radar chart comparing the performance of four model variants across multiple authenticity and authorship metrics. Each axis represents an individual metric, with higher values toward the outer edge indicating better performance. To accentuate subtle differences, the radial axis is limited to the range 96–100. The authenticity_loss_2x variant, highlighted in purple, demonstrates consistently strong performance across all metrics. In contrast, the ablated variants—where either shared or private layers are removed—exhibit declines in specific areas. For example, the removal of shared layers primarily affects authenticity metrics, suggesting that shared representations are critical for accurate authenticity classification. Similarly, the drop in authorship-related performance observed in the variant lacking private layers indicates that these layers are important for capturing nuances required to distinguish between authors.

6.4. Stylometric-Based Approaches

We next evaluate how well stylometric feature vectors (Section 4.3) alone can classify authenticity (fake vs. real) and authorship (human vs. AI). Table 10 presents performance for a shallow neural network (Stylometric NN) and three classical ML algorithms (Logistic Regression, Random Forest, SVC) trained solely on handcrafted linguistic attributes (e.g., word count, TTR, punctuation usage, reading ease).

For authenticity, both the shallow NN and Random Forest surpass 86% accuracy, suggesting that certain textual cues (e.g., average word length, punctuation usage, or entity counts) correlate with whether an article is real or fake. In contrast, Stylometric LR lags behind, possibly due to linear constraints on feature interactions. Meanwhile, Stylometric SVC and Stylometric RF offer comparable authenticity metrics (around 85–86%), showing that tree-based or margin-based classifiers can partially exploit these linguistic signals.

For authorship, performance is notably lower—ranging from 55–58% for classical methods to 85.06% for the NN. This indicates that identifying AI vs. human texts purely from stylometric attributes is more challenging than detecting factual consistency. The shallow NN, possibly benefiting from non-linear transformations of the stylometric features, outperforms classical ML by a wide margin (85.06% vs. 57–59%). This underscores that while stylometry alone may capture some aspects of AI writing, it might require a more flexible (e.g., neural) model to glean subtle patterns.

Although stylometric classifiers remain behind PLM-based pipelines (Section 6.2), they provide interpretability and computational efficiency. Feature importances (e.g., from Random Forest or logistic coefficients) can explain whether, for instance, an unusual ratio of first-person pronouns or a high Gunning Fog Index signals AI authorship or misinformation. As a standalone method, stylometry yields moderate success—particularly for authenticity—but struggles more with authorship classification, possibly because advanced LLMs adopt varied and increasingly human-like styles. Nonetheless, these results highlight how explicit linguistic features can still distinguish certain phenomena in real vs. fake or human vs. AI content.

6.5. Prompt-Based Classification

Finally, we explore prompt-based classification using large language models, specifically a version of GPT (labeled Prompt classifier gpt) and Llama3.2 (Prompt classifier llama). Rather than fine-tuning, each article’s title and text are fed into a carefully designed system plus user prompt (Section 4.4), and the model returns a short label (Real or Fake, Human or AI). Table 11 lists the accuracy, precision, recall, and F1-scores (weighted and macro) for both authenticity and authorship tasks under these zero-/few-shot paradigms.

For authenticity, the GPT-based prompt classification achieves a respectable 88.95% accuracy, with strong precision and recall. However, the same GPT prompt falls sharply on authorship accuracy (51.86%), suggesting the model struggles to discern subtle stylistic cues of AI vs. human text under the provided instructions. Conversely, Llama3.2 yields only 73.08% authenticity accuracy but obtains a slightly higher authorship accuracy (52.77%), though with lower F1 metrics due to class imbalances and off-format answers in some cases.

These results confirm that while LLM prompts can perform reasonably well on simpler tasks (like identifying factual consistency), distinguishing human vs. AI writing may require more carefully engineered prompts or additional context. The typical zero-shot or few-shot setup also lacks the domain-specific tuning that our supervised or multitask models enjoy. Nevertheless, such prompt-based classifiers remain appealing for quick deployment, especially if training data or computation for fine-tuning are limited, and can serve as a flexible baseline for new tasks with minimal overhead.

7. Further Analysis of Experimental Results

Overall, our results highlight a clear performance gap between classical or stylometric methods and the top-scoring PLM-based architectures (see Figure 11). Proposed models with the SPSM architecture secure the highest accuracy on both authorship (human vs. AI) and authenticity (fake vs. real). In contrast, stylometric-only or traditional ML baselines typically hover below 90% accuracy for authenticity and even lower for authorship—suggesting that detecting AI-written text from handcrafted features alone can be more challenging.

A closer inspection of feature importance (shown in the top and bottom panels of Figure 12) helps explain why stylometric methods can still provide meaningful signals for each task, even if they do not match PLM performance. For authorship, features like Date Entity Count, Noun Count, and Syntax Variety rank among the highest in importance scores. This suggests that AI-generated text might differ slightly in how it employs (or omits) date references, uses certain noun constructions, or varies its part-of-speech patterns. Seeing Complex Sentence Count and Average Word Length relatively high also indicates that advanced LLMs may simulate complex grammar but do so in ways detectably different from human authors. Hence, while these explicit counts do not guarantee perfect classification, they do reveal observable linguistic traits that can be used for partial explainability—e.g., if a model leans on unusual date usage to flag an article as AI-written.

For authenticity, the lower panel in Figure 12 shows that exclamation marks, question marks, and certain entity counts (particularly Date Entity Count and Person Entity Count) can prove indicative of fake vs. real. Evidently, emotional emphasis (many exclamation marks) or overuse of dates/people might signal attempts at sensationalism or fabricated storytelling. However, these features alone sometimes fail to capture the deeper context that PLMs learn from massive pretraining—explaining why stylometric models plateau around the mid-80% range of authenticity accuracy.

In terms of architectural differences, hierarchical architectures and the SPSM significantly outperform naive two-head multitask or separate-task baselines (Figure 11). The hierarchical models impose a cascade where authorship logits guide authenticity classification (or vice versa), which can boost synergy if AI writing correlates strongly with misinformation styles. Meanwhile, shared–private layers allow a common feed-forward network to model universal language cues while devoting private layers to each label’s unique nuances. This design often yields a higher synergy than purely “two tasks, two heads”, as it enforces partial parameter sharing and preserves specialized capacity for each objective.

Lastly, prompt-based classifiers using GPT or Llama for zero-/few-shot classification exhibit variable success. They score moderate to strong accuracy on authenticity—arguably because factual correctness can be partially inferred by large LLMs—but they fall behind for authorship (especially Llama) with near 50–55% accuracy. This discrepancy indicates that distinguishing AI from human text requires more precise prompting or additional training. In all, the SPSM(authenticity_loss_2x) approach shows that weighting tasks differently can improve performance on the more difficult classification (fake vs. real) while still retaining high performance on the secondary authorship task.

In summary, these findings illustrate that PLM-based pipelines—particularly those with multitask synergy and carefully tuned architectures—consistently yield top-tier results. Stylistic features, though less powerful in isolation, provide valuable explanatory insights: we can pinpoint which punctuation marks, entity references, or syntactic complexities signal AI authorship or potential misinformation. Meanwhile, hierarchical and shared–private models leverage both task interdependence and partial specialization to outperform naive baselines, reaffirming that how tasks share or separate parameters profoundly affects real-world classification accuracy.

7.1. Computational and Resource Requirements for Deployment

The SPSM multitask framework presents significant computational requirements due to the model’s large number of parameters and high resource requirements. A standard BERT-based model consists of 110 million parameters, with a hidden size of 768 and 12 attention heads, resulting in a memory footprint of approximately 400 MB when stored on a disk. The BERT-large variant, which scales up to 340 million parameters with 1024 hidden dimensions and 24 attention heads, requires even more computational power, often demanding 32GB+ of GPU VRAM for training. Since the SPSM shares a single transformer encoder across multiple tasks, the additional parameters primarily come from task-specific classification heads, each adding 1–2 million parameters per task. While this setup reduces the number of parameters compared to separate models for each task, the overall computational demand increases with the number of tasks, requiring careful optimization.

Training is computationally intensive due to the quadratic complexity (O(n²)) of self-attention, which makes processing long input sequences expensive in terms of both memory and computation time. A typical BERT-based SPSM with two to three tasks requires at least 16GB of GPU VRAM (e.g., NVIDIA RTX 3090, A100, or V100) to handle batch sizes of 16–32 efficiently. For larger models like BERT-large, training becomes impractical on consumer-grade GPUs, necessitating gradient accumulation or very small batch sizes (≤8) to avoid out-of-memory (OOM) errors. On CPU-only systems, training is virtually infeasible, as the model requires heavy matrix multiplication in the self-attention layers, leading to training times that can stretch over several days per epoch.

Inference with the SPSM also presents scalability challenges, especially in real-time applications. On high-end GPUs such as the NVIDIA A100 or RTX 4090, inference times range from 20 to 50 ms per query, making deployment feasible in cloud-based environments. However, on mid-range GPUs (RTX 3060, T4), inference latency increases to 50–100 ms per query, while on a CPU-only system, it can take 300–500 ms per query, making real-time response impractical. The problem worsens for edge devices like Jetson Nano, Raspberry Pi, or mobile processors, where memory limitations (1–2GB RAM per model instance) and slow processing speeds (1–2 s per query) severely impact feasibility. Furthermore, high power consumption makes battery-powered deployment inefficient, significantly reducing device runtime.

The primary challenges in deploying models in real time or on edge devices stem from high memory usage, latency issues, and power constraints. Real-time applications, such as fraud detection, chatbots, and speech recognition, require inference times below 10 ms, which BERT-based models struggle to achieve without optimization. To make deployment practical, various optimization techniques can be applied, including model distillation (TinyBERT, MiniLM), quantization (reducing model precision to 8-bit), and hardware acceleration (using TensorRT, ONNX for faster inference). Without such optimization, deploying the SPSM directly on edge devices remains impractical, making cloud-based hosting the most viable option for real-world applications requiring multitask transformers.

7.2. Explainability Using LIME and SHAP

In order to provide deeper insight into the model’s decision-making process for both the authenticity (real vs. fake) and authorship (human vs. AI) tasks, we employed two popular post hoc explainability methods: LIME (Local Interpretable Model-Agnostic Explanations) and SHAP (SHapley Additive exPlanations) on our best performing SPSM model. Below, we summarize the key findings from both local and global perspectives.

Figure 13 shows the top 30 global LIME features for both tasks. On the x-axis, we have the average LIME weight of each word, and on the y-axis, we list the most influential words. Positive bars (to the right) indicate that a word contributes positively to predicting the class at index 1 (e.g., fake for authenticity or AI for authorship), whereas negative bars (to the left) indicate a contribution toward the class at index 0 (e.g., real or human).

Several interesting observations can be drawn: Words such as “russia”, “taiwan”, and “propaganda” appear to strongly influence the authenticity classifier, suggesting that the model associates these terms more with “fake” content. For authorship, words like “emerging” and “agency” show high positive weights, indicating they push the model’s decision toward AI-generated text, whereas words such as “none” and “senator” steer the model toward human-generated text.

Figure 14 displays the global SHAP feature importance for authenticity, showing the top 30 tokens ranked by their average absolute SHAP value. Larger values indicate greater impact on the prediction (real vs. fake). Common words (e.g., “the”, “of”, “and”) appear among the top tokens, likely because they frequently co-occur in certain linguistic patterns within either real or fake news articles. Proper nouns such as “trump” and “donald” have a high impact, suggesting the classifier uses political references to distinguish real from fake.

Such tokens can reflect data distribution biases, especially if the training corpus contains articles heavily referencing political figures in fake contexts.

Figure 15 presents a SHAP summary plot over many tokens (the x-axis is the mean absolute SHAP value). It highlights how a large portion of the predictive power can be attributed to a small set of highly influential tokens. Additionally, it underscores the long tail of “other features”, each contributing marginally. This aligns with the intuition that in text classification, many words contribute small, context-dependent signals.

Figure 16 shows a pair of local LIME bar charts for a single test sample, one for authenticity (left) and one for authorship (right). The top portion of each bar chart indicates a feature (token) that pushes the model toward predicting “fake” or “AI”, whereas negative bars push toward “real” or “human.” The predicted probabilities are included in the figure captions. For authenticity, words like “u-”, “congress”, or “raucous” are more strongly associated with the model’s decision to classify the article as real. In contrast, “contentious”, “polling”, and “gamered” nudge the authorship classifier toward AI, illustrating how distinct features can influence the two tasks differently.

Overall, these explainability methods help validate that our model is leveraging semantically meaningful features rather than relying on spurious correlations. Global explanations reveal the top words driving the classifier across the entire dataset, whereas local explanations show which tokens matter most for a single example. These insights can inform data curation (e.g., identifying biases or over-reliance on specific keywords) and guide model improvements in future work.

8. Conclusions

In this study, we proposed and evaluated a comprehensive framework for addressing the challenges of classifying news articles across two dimensions: authenticity (real vs. fake) and authorship (human vs. AI). Our approach spanned multiple methodologies, including traditional machine learning baselines, stylometric feature-based analysis, prompt-based classifiers, and advanced PLM-based architectures with multitask and hierarchical capabilities. Through extensive experimentation, we demonstrated that our proposed SPSM architecture with synergetic layers and hierarchical task dependencies consistently outperforms simpler baselines, achieving state-of-the-art results with accuracies exceeding 96% for authenticity and 98% for authorship. Furthermore, our analysis highlighted the value of stylometric features in providing interpretable insights, revealing linguistic and structural patterns unique to fake news and AI-generated content. These findings underscore the importance of integrating modern architectures with interpretable features for addressing misinformation in the age of AI.

8.1. Limitations

Despite the promising results, this work has several limitations that warrant further investigation. First, our dataset was limited to textual news articles, which may not capture the nuances of multimodal misinformation that often includes images, videos, or other contextual metadata. Incorporating such multimodal data could improve the applicability and robustness of our models. Second, while stylometric features offered interpretability, they were less effective for standalone classification compared to PLM-based methods, especially as AI-generated content continues to evolve in sophistication. This limitation highlights the need for richer feature engineering to capture subtle patterns in modern AI-generated text.

Another limitation lies in the performance of prompt-based classifiers, which were sensitive to prompt design and often underperformed compared to fine-tuned PLMs. This suggests the need for more advanced prompt optimization techniques. Additionally, our PLM-based models, while highly accurate, rely heavily on pretrained models like BERT and Electra, limiting their scalability to less-resourced languages or domains. Finally, the computational cost of training and evaluating these models presents a challenge for deployment in real-time or resource-constrained settings.

Furthermore, the current version of the FAANR dataset was constructed using outputs from a select group of best performing and popular generative models—specifically, GPT-4, Llama, gemma, and Mistral. While these models are widely recognized for their performance and have been chosen to represent diverse characteristics of AI-generated content, their inclusion may introduce inherent biases. In particular, since these models can share similar stylistic or linguistic features, the dataset might not fully capture the broader spectrum of AI-generated text produced by other emerging architectures. Furthermore, as the dataset primarily comprises English-language content, its applicability to non-English or cross-lingual contexts remains limited.

8.2. Future Work

Building on this work, several avenues for future research can be pursued to address the limitations and further advance the field. A key direction is the integration of multimodal data, such as text combined with images, videos, or metadata, using advanced architectures like multimodal transformers or graph neural networks. This would enable a more holistic approach to misinformation detection. Additionally, improving the performance of prompt-based classifiers through automated prompt optimization techniques, such as reinforcement learning or prompt tuning, could make these methods more robust and adaptable to diverse tasks and domains.

Extending this study to multilingual datasets or domain-specific corpora, such as scientific misinformation, is another important direction. Evaluating cross-lingual and cross-domain transferability would provide insights into the generalizability of the proposed methods. Enhancing interpretability remains a critical area, and future work could combine feature importance metrics with attention visualizations to make model predictions more transparent and trustworthy. Additionally, future iterations of the FAANR dataset will seek to incorporate a wider array of generative models, including those that are emerging and potentially offer different stylistic or domain-specific outputs. We plan to expand the dataset by integrating cross-lingual content to enhance its robustness and ensure broader applicability. Additionally, further empirical studies will be conducted to assess the impact of these biases on detection performance, with the aim of refining the dataset and improving the generalizability of the associated detection frameworks. Lastly, the development of lightweight and parameter-efficient architectures, or the pruning of existing models, could reduce computational costs, making these solutions accessible for real-time applications and deployment in low-resource settings.

Moreover, we will focus on developing advanced fusion strategies to better integrate stylometric features with deep transformer-based representations. Our initial experiments using simple concatenation showed a decrease in performance, likely due to the inherent heterogeneity between the explicit, surface-level signals of stylometric features and the deep, context-rich embeddings produced by transformers. To address this, we plan to explore techniques such as attention-based mechanisms and multi-view learning frameworks that can dynamically weight and harmonize these disparate feature sets. Future studies can aim to conduct a more granular analysis of the interactions between stylistic and semantic features, with the goal of enhancing both detection performance and model interpretability in practical deployment scenarios.

In conclusion, this work provides a robust foundation for tackling the dual challenges of misinformation detection and authorship classification in the context of AI-generated content. By leveraging the strengths of advanced PLM-based architectures alongside interpretable stylometric features, we achieved not only high classification performance but also meaningful insights into the linguistic and structural patterns of fake and AI-generated text. These contributions lay the groundwork for future innovations in this rapidly evolving field.

Author Contributions

Conceptualization, G.S.C. and J.Z.; Methodology, G.S.C. and J.Z.; Software, G.S.C.; Formal analysis, G.S.C.; Investigation, G.S.C. and J.Z.; Resources, J.Z.; Data curation, G.S.C.; Writing—original draft, G.S.C.; Writing—review & editing, J.Z.; Supervision, J.Z.; Project administration, J.Z.; Funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by NSERC Discovery RGPIN-2020-05588.

Data Availability Statement

All the data and codes are public and will be available at https://github.com/gurnameh-99/misinfo-detect (accessed on 9 March 2025).

Acknowledgments

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MDPI	Multidisciplinary Digital Publishing Institute
AI	artificial intelligence
SPSM	Shared–Private Synergy Model
LLM	large language model
PLM	pretrained language model
ML	machine learning
DL	deep learning
CNN	Convolutional Neural Network
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
TF-IDF	Term Frequency-Inverse Document Frequency
GNN	graph neural network
GPT	General Pretrained Transformer

References

Saleh, H.; Alharbi, A.; Alsamhi, S.H. OPCNN-FAKE: Optimized Convolutional Neural Network for Fake News Detection. IEEE Access 2021, 9, 129471–129489. [Google Scholar] [CrossRef]
Dou, Y.; Shu, K.; Xia, C.; Yu, P.S.; Sun, L. User Preference-aware Fake News Detection. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 11–15 July 2021; pp. 2051–2055. [Google Scholar] [CrossRef]
Choudhary, A.; Arora, A. Linguistic feature based learning model for fake news detection and classification. Expert Syst. Appl. 2021, 169, 114171. [Google Scholar] [CrossRef]
Zhang, Q.; Guo, Z.; Zhu, Y.; Vijayakumar, P.; Castiglione, A.; Gupta, B.B. A Deep Learning-based Fast Fake News Detection Model for Cyber-Physical Social Services. Pattern Recognit. Lett. 2023, 168, 31–38. [Google Scholar] [CrossRef]
Alonso, M.A.; Vilares, D.; Gómez-Rodríguez, C.; Vilares, J. Sentiment Analysis for Fake News Detection. Electronics 2021, 10, 1348. [Google Scholar] [CrossRef]
Nan, Q.; Cao, J.; Zhu, Y.; Wang, Y.; Li, J. MDFEND: Multi-domain Fake News Detection. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual Event, Queensland, Australia, 1–5 November 2021; pp. 3343–3347. [Google Scholar] [CrossRef]
Khanam, Z.; Alwasel, B.N.; Sirafi, H.; Rashid, M. Fake News Detection Using Machine Learning Approaches. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1099, 012040. [Google Scholar] [CrossRef]
Sahoo, S.R.; Gupta, B.B. Multiple features based approach for automatic fake news detection on social networks using deep learning. Appl. Soft Comput. 2021, 100, 106983. [Google Scholar] [CrossRef]
Xia, H.; Wang, Y.; Zhang, J.Z.; Zheng, L.J.; Kamal, M.M.; Arya, V. COVID-19 fake news detection: A hybrid CNN-BiLSTM-AM model. Technol. Forecast. Soc. Change 2023, 195, 122746. [Google Scholar] [CrossRef]
Qu, Z.; Meng, Y.; Muhammad, G.; Tiwari, P. QMFND: A quantum multimodal fusion-based fake news detection model for social media. Inf. Fusion 2024, 104, 102172. [Google Scholar] [CrossRef]
Luvembe, A.M.; Li, W.; Li, S.; Liu, F.; Xu, G. Dual emotion based fake news detection: A deep attention-weight update approach. Inf. Process. Manag. 2023, 60, 103354. [Google Scholar] [CrossRef]
Hua, J.; Cui, X.; Li, X.; Tang, K.; Zhu, P. Multimodal fake news detection through data augmentation-based contrastive learning. Appl. Soft Comput. 2023, 136, 110125. [Google Scholar] [CrossRef]
Uchendu, A.; Le, T.; Shu, K.; Lee, D. Authorship Attribution for Neural Text Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8384–8395. [Google Scholar] [CrossRef]
Clark, E.; August, T.; Serrano, S.; Haduong, N.; Gururangan, S.; Smith, N.A. All That’s ’Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text. arXiv 2021, arXiv:2107.00061. [Google Scholar] [CrossRef]
Guo, B.; Zhang, X.; Wang, Z.; Jiang, M.; Nie, J.; Ding, Y.; Yue, J.; Wu, Y. How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. arXiv 2023, arXiv:2301.07597. [Google Scholar] [CrossRef]
Mitchell, E.; Lee, Y.; Khazatsky, A.; Manning, C.D.; Finn, C. DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 24950–24962. [Google Scholar]
Sadasivan, V.S.; Kumar, A.; Balasubramanian, S.; Wang, W.; Feizi, S. Can AI-Generated Text be Reliably Detected? arXiv 2024, arXiv:2303.11156. [Google Scholar] [CrossRef]
Schuster, T.; Schuster, R.; Shah, D.J.; Barzilay, R. The Limitations of Stylometry for Detecting Machine-Generated Fake News. Comput. Linguist. 2020, 46, 499–510. [Google Scholar] [CrossRef]
Krishna, K.; Song, Y.; Karpinska, M.; Wieting, J.; Iyyer, M. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. Adv. Neural Inf. Process. Syst. 2023, 36, 27469–27500. [Google Scholar]
Orenstrakh, M.S.; Karnalim, O.; Suárez, C.A.; Liut, M. Detecting LLM-Generated Text in Computing Education: Comparative Study for ChatGPT Cases. In Proceedings of the 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), Osaka, Japan, 2–4 July 2024; pp. 121–126. [Google Scholar] [CrossRef]
Dergaa, I.; Chamari, K.; Zmijewski, P.; Saad, H.B. From human writing to artificial intelligence generated text: Examining the prospects and potential threats of ChatGPT in academic writing. Biol. Sport 2023, 40, 615–622. [Google Scholar] [CrossRef]
Zhou, J.; Zhang, Y.; Luo, Q.; Parker, A.G.; De Choudhury, M. Synthetic Lies: Understanding AI-Generated Misinformation and Evaluating Algorithmic and Human Solutions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, Hamburg, Germany, 23–28 April 2023; pp. 1–20. [Google Scholar] [CrossRef]
Shoaib, M.R.; Wang, Z.; Ahvanooey, M.T.; Zhao, J. Deepfakes, Misinformation, and Disinformation in the Era of Frontier AI, Generative AI, and Large AI Models. In Proceedings of the 2023 International Conference on Computer and Applications (ICCA), Cairo, Egypt, 28–30 November 2023; pp. 1–7. [Google Scholar] [CrossRef]
Kreps, S.; McCain, R.M.; Brundage, M. All the News That’s Fit to Fabricate: AI-Generated Text as a Tool of Media Misinformation. J. Exp. Political Sci. 2022, 9, 104–117. [Google Scholar] [CrossRef]
Demartini, G.; Mizzaro, S.; Spina, D. Human-in-the-loop Artificial Intelligence for Fighting Online Misinformation: Challenges and Opportunities. IEEE Data Eng. Bull. 2020, 43, 65–74. [Google Scholar]
Chen, C.; Shu, K. Can LLM-Generated Misinformation Be Detected? arXiv 2024, arXiv:2309.13788. [Google Scholar] [CrossRef]
Blauth, T.F.; Gstrein, O.J.; Zwitter, A. Artificial Intelligence Crime: An Overview of Malicious Use and Abuse of AI. IEEE Access 2022, 10, 77110–77122. [Google Scholar] [CrossRef]
Kumari, R.; Ashok, N.; Ghosal, T.; Ekbal, A. A Multitask Learning Approach for Fake News Detection: Novelty, Emotion, and Sentiment Lend a Helping Hand. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
Choudhry, A.; Khatri, I.; Jain, M.; Vishwakarma, D.K. An Emotion-Aware Multitask Approach to Fake News and Rumor Detection Using Transfer Learning. IEEE Trans. Comput. Soc. Syst. 2024, 11, 588–599. [Google Scholar] [CrossRef]
Jing, Q.; Yao, D.; Fan, X.; Wang, B.; Tan, H.; Bu, X.; Bi, J. TRANSFAKE: Multi-task Transformer for Multimodal Enhanced Fake News Detection. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
Wu, J.; Guo, J.; Hooi, B. Fake News in Sheep’s Clothing: Robust Fake News Detection Against LLM-Empowered Style Attacks. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 3367–3378. [Google Scholar] [CrossRef]
Satapara, S.; Mehta, P.; Ganguly, D.; Modha, S. Fighting Fire with Fire: Adversarial Prompting to Generate a Misinformation Detection Dataset. arXiv 2024, arXiv:2401.04481. [Google Scholar] [CrossRef]
Hu, S.; Ding, N.; Wang, H.; Liu, Z.; Wang, J.; Li, J.; Wu, W.; Sun, M. Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification. arXiv 2022, arXiv:2108.02035. [Google Scholar] [CrossRef]
Thaminkaew, T.; Lertvittayakumjorn, P.; Vateekul, P. Prompt-Based Label-Aware Framework for Few-Shot Multi-Label Text Classification. IEEE Access 2024, 12, 28310–28322. [Google Scholar] [CrossRef]
Hou, B.; O’Connor, J.; Andreas, J.; Chang, S.; Zhang, Y. PromptBoosting: Black-Box Text Classification with Ten Forward Passes. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 13309–13324. [Google Scholar]
Sun, X.; Li, X.; Li, J.; Wu, F.; Guo, S.; Zhang, T.; Wang, G. Text Classification via Large Language Models. arXiv 2023, arXiv:2305.08377. [Google Scholar] [CrossRef]
Chen, Q.; Li, D.; He, X. Concept Based Continuous Prompts for Interpretable Text Classification. arXiv 2024, arXiv:2412.01644. [Google Scholar] [CrossRef]
Balkus, S.V.; Yan, D. Improving short text classification with augmented data using GPT-3. Nat. Lang. Eng. 2024, 30, 943–972. [Google Scholar] [CrossRef]
Xie, Z.; Li, Y. Discriminative Language Model as Semantic Consistency Scorer for Prompt-based Few-Shot Text Classification. arXiv 2022, arXiv:2210.12763. [Google Scholar] [CrossRef]
Wang, W.Y. “liar, liar pants on fire”: A new benchmark dataset for fake news detection. arXiv 2017, arXiv:1705.00648. [Google Scholar]
Shu, K.; Zheng, G.; Li, Y.; Mukherjee, S.; Awadallah, A.H.; Ruston, S.; Liu, H. Leveraging multi-source weak social supervision for early detection of fake news. arXiv 2020, arXiv:2004.01732. [Google Scholar]
Luger, S.; Anto-Ocrah, M.; Tapo, A.; Homan, C.; Zampieri, M.; Leventhal, M. Health Care Misinformation: An Artificial Intelligence Challenge for Low-resource languages. In Proceedings of the AI4SG@ AAAI Fall Symposium, Arlington, VA, USA, 13–14 November 2020. [Google Scholar]
De, A.; Bandyopadhyay, D.; Gain, B.; Ekbal, A. A Transformer-Based Approach to Multilingual Fake News Detection in Low-Resource Languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2022, 21, 1–20. [Google Scholar] [CrossRef]
Ahmed, H.; Traore, I.; Saad, S. Detecting opinion spams and fake news using text classification. Secur. Priv. 2018, 1, e9. [Google Scholar] [CrossRef]
Traore, I.; Woungang, I.; Awad, A. (Eds.) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments: First International Conference, ISDDC 2017, Vancouver, BC, Canada, 26–28 October 2017; Springer: Berlin/Heidelberg, Germany, 2017; Volume 10618. [Google Scholar]
Rahman, M.; Raza, S. Analyzing the Influence of Fake News in the 2024 Elections: A Comprehensive Dataset. arXiv 2023, arXiv:2312.03750. [Google Scholar]
Opara, C. StyloAI: Distinguishing AI-generated content with stylometric analysis. In Artificial Intelligence in Education: Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky, Recife, Brazil, 8–12 July 2024; Springer: Cham, Switzerland, 2024; pp. 105–114. [Google Scholar]
Kumarage, T.; Garland, J.; Bhattacharjee, A.; Trapeznikov, K.; Ruston, S.; Liu, H. Stylometric detection of ai-generated text in twitter timelines. arXiv 2023, arXiv:2303.03697. [Google Scholar]
Kuntur, S.; Krzywda, M.; Wróblewska, A.; Paprzycki, M.; Ganzha, M. Comparative Analysis of Graph Neural Networks and Transformers for Robust Fake News Detection: A Verification and Reimplementation Study. Electronics 2024, 13, 4784. [Google Scholar] [CrossRef]
Das, R.K.; Dodge, J. Fake News Detection After LLM Laundering: Measurement and Explanation. arXiv 2025, arXiv:2501.18649. [Google Scholar]
Wani, A.; Joshi, I.; Khandve, S.; Wagh, V.; Joshi, R. Evaluating deep learning approaches for covid19 fake news detection. In Combating Online Hostile Posts in Regional Languages During Emergency Situation: First International Workshop, CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, 8 February 2021; Revised Selected Papers 1; Springer: Berlin/Heidelberg, Germany, 2021; pp. 153–163. [Google Scholar]
Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. arXiv 2016, arXiv:1607.04606. [Google Scholar] [CrossRef]
Tian, E.; Cui, A. GPTZero: Towards Detection of AI-Generated Text Using Zero-Shot and Supervised Methods. 2023. Available online: https://gptzero.me/ (accessed on 9 March 2025).
Iceland, M. How good are SOTA fake news detectors. arXiv 2023, arXiv:2308.02727. [Google Scholar]
Su, J.; Cardie, C.; Nakov, P. Adapting fake news detection to the era of large language models. arXiv 2023, arXiv:2311.04917. [Google Scholar]

Figure 1. Distribution of news articles among sources and labels.

Figure 2. Correlation heatmap between sentiment, text length, and labels.

Figure 3. WordClouds of most common words in real and fake articles.

Figure 4. Fine-tuning of BERT PLM for separate tasks.

Figure 5. Multitask PLM with separate classification heads architecture.

Figure 6. The SPSM architecture. The BERT [CLS] embedding is further processed by a shared feed-forward stack before splitting into private layers specialized for each task.

Figure 7. In the two-stage approach, authorship predictions from the first model are appended as features for the second model that classifies fake vs. real.

Figure 8. This single-pass design computes authorship logits a and merges them with the h[cls] embedding, enabling authenticity classification to condition directly on the authorship output.

Figure 9. Pipeline for traditional baseline classification.

Figure 10. Radar chart comparing performance of four model variants.

Figure 11. Performance comparison of all approaches in both tasks.

Figure 12. Top features in stylometric analysis for both tasks.

Figure 13. Top 30 global LIME features for authenticity and authorship.

Figure 14. Global SHAP feature importance for authenticity (top 30).

Figure 15. SHAP summary plot for authenticity.

Figure 16. Local explanation for authenticity (left) and authorship (right) with LIME.

Table 1. Example prompts used for generating fake and real news articles.

Narrative Type

Prompt

Real

You are a professional journalist. The user will provide a text describing an actual event with factual details. Your task is to rewrite this content into a coherent, well-structured news article in your own words. Ensure all events, details, and factual information remain accurate, but do not copy the text verbatim. Instead, reorganize, paraphrase, and present the information with a professional, journalistic style.
Additional Requirements:
- Do not include a headline; start directly with the article text.
- Include the main event details, parties involved, date, location, and other factual elements from the user’s text.
- Maintain a neutral, objective tone.
- Do not introduce any new or fabricated details.
- If the original text includes direct quotes, you may rephrase them as reported speech or quote them accurately.
- Keep the article concise (about 300–400 words), focusing on clarity and factual reporting.

Fake

You are a journalist tasked with writing a fictional news article for a fictional news outlet. Your article should be entirely fabricated but appear as if it were a legitimate news story.
Title Narrative: “title”
Instructions:
- Write a 300–400 word fictional news article based on the title narrative.
- Start directly with the article text; do not include a headline or title.
- Include context, background, and fictional sources (e.g., anonymous officials or witnesses) that align with the title.
- Maintain a neutral, journalistic tone consistent with professional news reporting.
- Do not include any disclaimers or indications or notes that this is fictional; it should appear as a credible article.
- Just give the article text, no need to include the title.

Prompts may be varied to suit each generative model.

Table 2. The authorship sources of the texts available in the FAANR dataset.

Source	Fake	Real	Total
Human	20,631	27,987	48,618
gemma	4999	4981	9980
gpt4o	5000	4955	9955
llama	2372	4999	7371
mistral	4999	2145	7144

Table 3. Label- wise distribution of the articles in the dataset.

Label	Num Articles	Avg Text Length	Proportion
AI Fake	17,370	355.140242	0.209106
AI Real	17,080	260.978571	0.205615
Human Fake	20,631	410.912123	0.248363
Human Real	27,987	408.048129	0.336917

Table 4. Stylometric features and their formulas. T is the set of tokens, S the set of sentences, and

| \cdot |

denotes cardinality.

Table 4. Stylometric features and their formulas. T is the set of tokens, S the set of sentences, and

| \cdot |

denotes cardinality.

Feature	Formula
WordCount	$W = \| T \|$
Unique Word Count	$U = \| {u \in T} \|$
Chararcter Count	$C = \sum_{t \in T} \| t \|$
Average Word Length	$AWL = \frac{C}{W}$
TTR	$TTR = \frac{U}{W}$
Hapax Legomenon	$HLR = \frac{\| {t \in T : freq (t) = 1} \|}{W}$
Sentence Count	$S = \| S \|$
Average Sentence Length	$ASL = \frac{W}{S}$
Average Sentence Complexity	$ASC = \frac{S}{W}$
Punctuation Count	$P = \sum_{t \in T} 1 (t \in P)$
Noun Count	$N = \sum_{t \in T} 1 (upos (t) = NOUN)$
Verb Count	$V = \sum_{t \in T} 1 (upos (t) = VERB)$
Adjective Count	$A = \sum_{t \in T} 1 (upos (t) = ADJ)$
Adverb Count	$D = \sum_{t \in T} 1 (upos (t) = ADV)$
Stopword Count	$Stop = \sum_{t \in T} 1 (t \in SW)$
Complex Sentence Count	$X = \| {sent \in S : clauseCount (s e n t) > 0} \|$
Question Mark Count	$Q = count (‘ ? ’ etc .)$
Exclamation Mark Count	$E = count (‘! ’ etc .)$
Flesch Reading Ease	$FRE = 206.835 - (1.015 \times ASL) - (84.6 \times AWL)$
Gunning Fog Index	$GFI = 0.4 \times (ASL + 100 \times \frac{ComplexSents}{W})$
First Person Pronoun Count	$F = \sum_{t \in T} 1 (lower (t) \in FP)$
Person Entity Count	$Person = \sum_{ent} 1 (ent . type = ‘ PERSON ’)$
Date Entity Count	$Date = \sum_{ent} 1 (ent . type = ‘ DATE ’)$
Uniqueness Bigram	$\frac{Unique Bigrams}{Total Bigrams}$
UniquenessTrigram	$\frac{Unique Trigrams}{Total Trigrams}$
Syntax Variety	$SynV = \| {upos (t) : t \in T} \|$

Table 5. Prompt templates for authorship and authenticity classification. The placeholders {title}, {text} are replaced with each sample’s content at runtime.

Task	System Instruction	User Prompt Template
AI vs. Human	You are a classifier that determines whether a news article is written by a human or by an AI. Respond with either ‘Human’ or ‘AI’ and nothing else.	Title: {title} Text: {text} Is the above article written by a human or an AI?
Real vs. Fake	You are a classifier that determines whether a news article is factually correct (Real) or deliberately fabricated (Fake). Respond with either ‘Real’ or ‘Fake’ and nothing else.	Title: {title} Text: {text} Is the above article Real or Fake?

Table 6. Ablation configurations toggling shared/private layers and loss weighting.

Variant Name	Shared Layer?	Private Layers?	$α$	$β$
full_model	True	True	1.0	1.0
no_shared_layer	False	True	1.0	1.0
no_private_layers	True	False	1.0	1.0
authenticity_loss_2x	True	True	2.0	1.0

Table 7. Performance of baselines on authenticity (fake vs. real) and authorship (human vs. AI). We report accuracy, precision, recall, and F1-scores (weighted and macro).

Approach	Authenticity (Fake vs. Real)					Authorship (Human vs. AI)
Approach	Acc	Prec	Rec	F1 wtd	F1 mac	Acc	Prec	Rec	F1 wtd	F1 mac
Logistic Regression	89.16	89.16	88.74	88.74	88.75	90.91	89.98	89.96	89.98	88.43
Random Forest	84.34	83.63	83.65	83.63	83.65	86.88	85.62	85.64	85.62	85.62
SVC	87.65	87.82	87.82	87.84	87.83	87.75	87.06	87.11	87.06	87.06
MultinomialNB	83.61	84.17	84.17	84.11	84.18	75.76	72.89	72.88	72.89	72.88
XGB	88.39	87.94	87.94	87.94	87.94	90.21	89.22	89.22	89.22	89.22
GCN	54.25	29.43	54.25	38.16	35.17	46.46	50.62	46.45	45.33	46.08
graphSAGE	54.11	49.68	54.11	39.03	36.15	58.53	34.26	58.53	43.22	36.92
GAT	54.25	29.43	54.25	38.16	35.17	58.53	34.26	58.53	43.22	36.92
GIN	53.96	50.40	53.96	40.69	38.01	41.47	17.20	41.47	24.31	29.31
CNN	92.92	93.00	92.92	92.93	92.90	94.53	94.67	94.53	94.55	94.40
LSTM	93.37	93.64	93.37	93.33	93.27	95.20	95.26	95.20	95.17	95.00
bi-LSTM attn	94.61	94.74	94.61	94.59	94.54	95.17	95.17	95.17	95.05	95.17
GPTZero	NA	NA	NA	NA	NA	64.74	67.89	64.74	64.55	64.70
DetectGPT	NA	NA	NA	NA	NA	49.89	53.31	49.89	49.15	49.65

Table 8. Performance of PLM-based models, including single-task or multitask BERT, shared–private, hierarchical variants, and doubled authenticity loss weighting. Each row reports accuracy, precision, recall, and F1 (weighted and macro) for both authenticity (fake vs. real) and authorship (human vs. AI).

Approach	Authenticity (Fake vs. Real)					Authorship (Human vs. AI)
Approach	Acc	Prec	Rec	F1 wtd	F1 mac	Acc	Prec	Rec	F1 wtd	F1 mac
BERT w/Multitask	95.94	95.98	95.94	95.94	95.90	98.19	98.04	98.19	98.19	98.14
BERT Separate	95.33	95.33	95.27	95.33	95.29	98.61	98.56	98.58	98.61	98.57
SPSM	96.97	96.98	96.88	96.97	96.95	98.84	98.74	98.84	98.85	98.84
SPSM(authenticity_loss_2x)	97.09	97.12	97.36	97.09	96.98	98.92	99.07	99.08	98.92	98.89
distilBERT (Multitask)	95.21	95.22	95.21	95.20	95.17	97.85	97.86	97.85	97.85	97.78
ELECTRA (Multitask)	96.84	96.86	96.84	96.84	96.81	98.80	98.72	98.82	98.80	98.77
Hierarchical B	96.20	96.24	96.20	96.16	96.19	98.45	98.45	98.45	98.45	98.41
Hierarchical A	94.07	94.07	93.97	94.06	94.01	98.01	98.03	98.03	98.01	97.92

Table 9. Ablation results comparing different architectural toggles (shared vs. private layers) and modifying authenticity loss weighting. We report accuracy, precision, recall, weighted F1, and macro F1 for both tasks.

Approach	Authenticity (Fake vs. Real)					Authorship (Human vs. AI)
Approach	Acc	Prec	Rec	F1 wtd	F1 mac	Acc	Prec	Rec	F1 wtd	F1 mac
full_model w both loss	96.95	96.36	98.09	96.95	96.93	98.53	99.39	98.07	98.53	98.49
no_shared_layer	96.71	97.18	96.75	96.71	96.69	98.35	99.32	97.85	98.35	98.31
no_private_layers	96.95	96.36	98.09	96.95	96.93	98.53	99.39	98.07	98.53	98.49
authenticity_loss_2x	97.09	97.12	97.36	97.09	96.98	98.92	99.07	99.08	98.92	98.89

Table 10. Accuracy, precision, recall, and F1 (weighted and macro) for stylometric-based classification. Each method uses only the stylometric feature vectors (no TF-IDF or deep embeddings).

Approach	Authenticity (Fake vs. Real)					Authorship (Human vs. AI)
Approach	Acc	Prec	Rec	F1 wtd	F1 mac	Acc	Prec	Rec	F1 wtd	F1 mac
Stylometric NN	86.14	86.21	85.78	86.14	85.96	85.06	85.03	84.28	84.98	84.57
Stylometric LR	83.50	83.69	83.50	83.37	83.11	55.13	54.76	55.13	54.90	53.69
Stylometric RF	86.75	86.89	86.75	86.67	86.48	57.87	57.66	57.87	57.75	56.68
Stylometric SVC	85.75	85.82	85.75	85.68	85.49	58.63	58.50	58.63	58.56	57.56

Table 11. The performance of prompt-based LLM classifiers on authenticity (fake vs. real) and authorship (human vs. AI). The prompts supply each article’s {title} and {text}, and the LLM outputs a single label.

Approach	Authenticity (Fake vs. Real)					Authorship (Human vs. AI)
Approach	Acc	Prec	Rec	F1 wtd	F1 mac	Acc	Prec	Rec	F1 wtd	F1 mac
Prompt classifier gpt	88.95	90.05	89.32	88.90	88.92	51.86	54.35	51.86	49.16	49.74
Prompt classifier llama	73.08	81.51	73.08	70.55	73.08	52.77	52.66	52.77	39.81	38.27

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chhatwal, G.S.; Zhao, J. Multitask Learning for Authenticity and Authorship Detection. Electronics 2025, 14, 1113. https://doi.org/10.3390/electronics14061113

AMA Style

Chhatwal GS, Zhao J. Multitask Learning for Authenticity and Authorship Detection. Electronics. 2025; 14(6):1113. https://doi.org/10.3390/electronics14061113

Chicago/Turabian Style

Chhatwal, Gurunameh Singh, and Jiashu Zhao. 2025. "Multitask Learning for Authenticity and Authorship Detection" Electronics 14, no. 6: 1113. https://doi.org/10.3390/electronics14061113

APA Style

Chhatwal, G. S., & Zhao, J. (2025). Multitask Learning for Authenticity and Authorship Detection. Electronics, 14(6), 1113. https://doi.org/10.3390/electronics14061113

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multitask Learning for Authenticity and Authorship Detection

Abstract

1. Introduction

1.1. Interdependent Challenges

1.2. Motivations

1.3. Research Objectives

1.4. Contributions

2. Related Work

2.1. Fake News Detection

2.2. Machine-Generated Content Attribution

2.3. AI-Generated Misinformation and Multitask Approaches

2.4. LLM-Based and Zero-Shot Detectors

3. Dataset and Preprocessing

3.1. Data Collection and Generation

3.2. Class Distribution and Preprocessing

4. Proposed Approaches

4.1. PLM-Based Approaches

4.1.1. Single-Task Fine-Tuning

4.1.2. BERT Multitask Learning

4.1.3. Shared–Private Synergy Model (SPSM)

4.2. Hierarchical Approaches

4.2.1. Two-Stage Pipeline

4.2.2. Single-Pass Cascade

4.3. Stylometric Feature Analysis

4.4. Prompt-Based Classifiers

5. Experiments

5.1. Experimental Setup

5.2. Traditional Baseline Using Classical ML Methods

5.3. Ablation Studies

6. Results

6.1. Baseline Results

6.2. PLM-Based Approaches

6.3. Ablation Studies

6.4. Stylometric-Based Approaches

6.5. Prompt-Based Classification

7. Further Analysis of Experimental Results

7.1. Computational and Resource Requirements for Deployment

7.2. Explainability Using LIME and SHAP

8. Conclusions

8.1. Limitations

8.2. Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI