1. Introduction
Since the emergence of information distribution channels, the propagation of fake news has posed a persistent challenge to society. In the early days, responsible institutions and broadcasting agencies avoided such issues through editorial checks and mutual trust. However, with the introduction of the Internet came an information explosion that made it difficult to isolate genuine content from malicious content. The abundance of information empowers users with a wealth of knowledge, but it has also led to the dissemination of false or misleading stories, commonly referred to as fake news. The impact of such fake news can be severe, affecting domains such as public health, politics, and social harmony. Meanwhile, in recent times, the capabilities of generative artificial intelligence models such as large language models (LLMs) have improved to generate coherent human-like text. This supercharges the ability to produce large volumes of content but also makes its origin and veracity more ambiguous.
Although these models can be exploited by malicious actors to create deceptive narratives on a scale, it is equally important to recognize that not all AI-generated material is necessarily fake or harmful. The availability of tools like chatGPT, ClaudeAI, Gemini, and their open-source counterparts such as llama, Mistral, Bloom, etc., have provided bad actors in society with the ability to produce fake news at scale, amplifying the reach and speed of disinformation campaigns. Authorship, in particular, has significant weight in our society because it confers responsibility. When a specific individual or organization claims authorship, they accept liability for the claims and perspectives they present. In a world where AI can rapidly generate large volumes of text, it becomes imperative to distinguish who (or what) authored a piece of content. Failing to do so could erode accountability, making it easier for malicious actors to publish fabricated stories without fear of repercussion.
This dynamic poses intensified concerns for a secure and intelligent civilization: we must ensure the authenticity of our information and hold its creators accountable. Such measures become especially crucial in contexts like smart cities, where digital infrastructure underpins public services, governance, and everyday citizen interactions. Fake or misleading information, whether produced by humans or AI, can disrupt civic processes, spread disinformation about local events, and undermine trust in essential urban systems. Thus, it is vital to develop advanced detection frameworks that simultaneously assess the authenticity of a text (fake vs. real) and its authorship (human vs. AI). To contend with these intertwined problems, namely, authenticity (fake vs. real) and authorship (human vs. AI), this study explores a unified strategy. We propose a single, integrated framework, called the Shared–Private Synergy Model (SPSM), which leverages advanced transformer-based architectures to analyze text from both perspectives simultaneously. In essence, the SPSM architecture shares certain layers that capture general linguistic cues while dedicating private layers to each classification objective, maximizing synergy across tasks without sacrificing task-specific specialization. In doing so, it addresses the pressing need for a holistic system that can efficiently detect fake news and clarify whether content is produced by humans or AI. By unifying these tasks within a single architectural framework, one that learns shared linguistic cues but specializes in each classification goal, practitioners can address both the factual reliability of content and its source. This dual approach has the potential to protect the integrity of the information that circulates in a smart city environment, enhancing public confidence and ensuring that accountability remains intact even as AI continues to evolve.
1.1. Interdependent Challenges
With the rise of AI-generated fake news, broadly, two main classification tasks arise within the domain of AI-aided content generation:
Although these two tasks may appear unrelated in previous eras, in current scenarios, they are correlated due to the increasing use of AI in online content generation. For instance, fake news is increasingly generated, or at least aided, by AI systems that can produce deceptive content at scale. In contrast, AI-generated content that is benign or journalistic in nature might still be “authored by AI” but not necessarily fake. Hence, there is a likelihood that knowledge of authorship could reveal insights into the reality of content. In real-world scenarios, platforms or fact checkers may first discover that a large volume of suspicious text shares an AI fingerprint. This finding can then guide more in-depth authenticity checks. Moreover, if an established news source is known to produce mostly genuine (real) articles, discovering that certain content strays from the typical writing style of that source might indicate both potential AI involvement and deceptive intentions.
Beyond textual plausibility, generative models allow adversaries to tailor messages to specific audiences, weaving false narratives with alarming efficiency. Recognizing that text is AI-written raises the prior probability that it may be an attempt at mass-produced disinformation. Of course, legitimate AI usage (e.g., automated journalism for sports or financial updates) complicates the matter; therefore, we do not equate “AI-generated” with “fake”. Still, statistically, false or misleading content is more likely to exploit the scale and speed of AI generation.
1.2. Motivations
Escalation of Fake News: The global spread of misleading information has real-world consequences, ranging from undermining public health campaigns to fueling polarized politics. Accurate fake vs. real classification systems remain an open challenge as adversaries adapt new methods of content creation.
Rise of AI-Generated Text: Large language models are now capable of drafting entire articles that are difficult to distinguish from those authored by humans. Distinguishing between human and AI authorship has become essential to ensuring accountability.
Potential Synergy: If authorship classification can improve the accuracy of identifying misinformation (and vice versa), a hierarchical or multitask approach may outperform traditional single-task systems. Consequently, it is valuable to investigate whether incorporating both tasks, authenticity and authorship, within a unified or cascaded pipeline increases overall classification performance.
1.3. Research Objectives
In light of the challenges outlined above both in detecting misinformation (fake vs. real) and attributing authorship (human vs. AI), this work pursues the following primary objectives:
Dataset Creation and Curation: We assembled and refined a novel dataset that explicitly labels each article for both authenticity (fake vs. real) and authorship (human vs. AI). This dataset is drawn from multiple sources, ensuring a diversity of topics, writing styles, and content generation methods. By incorporating comprehensive synthetic data generation for both tasks, our dataset aims to facilitate richer experimentation and cross-task insights.
PLM-Based Investigations: We implement and compare single-task and novel multitask fine-tuning of pretrained language encoder models (e.g., BERT, DistilBERT, Electra), examining how they perform on authenticity and authorship classification. Moreover, we aim to innovate on synergetic models capable of completing both tasks.
Ablation Studies: To identify the most critical model components, we conduct systematic ablations, removing or altering features (e.g., loss weighting, shared layers, specific stylometric inputs) and evaluating the resulting performance impact.
Hierarchal Modeling: We introduce and evaluate hierarchical pipelines in which the authorship task (human vs. AI) informs authenticity classification (fake vs. real). Our goal is to empirically validate the hypothesis that knowledge of authorship significantly aids the detection of fake vs. real content.
Stylometric Feature Analysis: We incorporate stylistic and linguistic features, such as reading ease, lexical diversity, and syntactic complexity, to probe whether certain characteristics are especially indicative of AI-generated text or fake news content.
LLM-Based Approaches: Finally, we assess zero-shot or few-shot prompt engineering with state-of-the-art LLMs, contrasting their classification capabilities (for both tasks) against fully fine-tuned models.
1.4. Contributions
Novel Dual-Labeled Dataset: We present a newly curated dataset that explicitly labels each article for both authenticity (fake vs. real) and authorship (human vs. AI), capturing diverse writing styles and domains.
Unified Dual-Task Framework: We introduce a method that simultaneously classifies text authenticity (fake vs. real) and authorship (AI vs. human), addressing overlapping challenges in today’s AI-driven information landscape.
SPSM: We propose a shared–private multitask framework design, in which common layers capture universal text patterns while specialized layers focus on each classification task. This structure demonstrates superior performance and interpretability.
Comprehensive Experimental Analysis: We offer exhaustive comparisons, including classical ML baselines, stylometric methods, PLM-based ablations, hierarchical setups, and prompt-based evaluations, culminating in a deeper understanding of the strengths and limitations of each approach.
The remainder of this paper is structured as follows.
Section 2 discusses the relevant literature on fake news detection, authorship attribution, and hierarchical/multitasking strategies.
Section 3 describes our dataset and preprocessing steps.
Section 4 outlines the novel methods we propose, PLM-based approaches, hierarchical setups, stylometric approaches, and LLM-based classification.
Section 5 further details the experiments including setup, baselines, and ablation studies.
Section 6 presents comprehensive results for each method, including performance tables, ablation findings, stylometry interpretations, and prompt-based outcomes.
Section 7 offers further analysis on experimental results that tie together the quantitative results with practical insights into intertask synergy and model efficacy. Finally,
Section 8 concludes this paper by summarizing the major findings and suggesting directions for future research.
2. Related Work
Research on misinformation detection has rapidly evolved alongside the growing availability of advanced natural language processing tools, introducing a wide spectrum of methods ranging from classical feature-based approaches to PLM-based neural models. At the same time, the authorship attribution domain, once focused on human stylometry, has pivoted toward recognizing machine-generated text, fueled by large language models capable of producing near-human writing. This section reviews key developments in fake news detection, AI authorship classification, popular architectures for interconnected NLP tasks, and zero-shot classifiers, including LLM-based prompt classifiers, providing context for our integrated approach to authenticity and authorship classification.
2.1. Fake News Detection
The detection of fake news has been extensively studied, with various machine learning (ML) and deep learning (DL) approaches proposed. Saleh et al. introduced OPCNN-FAKE, an optimized Convolutional Neural Network (CNN) that demonstrated superior performance over RNN, LSTM, and traditional ML methods in benchmark datasets, using TF-IDF and GloVe embeddings [
1]. Dou et al. proposed the User Preference-Aware Fake News Detection (UPFD) framework, which integrates user preferences and social context using graph neural networks (GNNs) to enhance detection accuracy [
2]. Another approach emphasized linguistic feature-based learning models, extracting textual and syntactic attributes to effectively classify fake news [
3]. Furthermore, a lightweight DL-based model was developed for the detection of fast fake news in cyber-physical social services, optimizing speed and accuracy for real-time applications [
4].
Alonso et al. (2021) [
5] highlight sentiment analysis as a key tool in identifying emotional manipulation, advocating for multilingual and multimedia capabilities in detection systems. Addressing domain-specific challenges, Nan et al. (2021) propose MDFEND [
6], a model that leverages domain-specific features to improve accuracy across multiple domains. Khanam et al. (2021) focus on supervised learning, emphasizing feature extraction and vectorization using tools like Python’s scikit-learn for precise classification [
7]. This extends detection capabilities by integrating multimodal features such as text, metadata, and images through deep learning [
8].
A hybrid CNN-BiLSTM-AM model was proposed to detect fake news related to COVID-19, integrating convolutional and recurrent networks with attention mechanisms for feature extraction and classification [
9]. Another innovative model, QMFND, employed quantum-inspired multimodal fusion techniques, combining text, images, and metadata to enhance detection accuracy on social media platforms [
10]. A dual-emotion-based fake news detection framework utilized deep attention weight updates to capture emotional patterns from text, highlighting the role of sentiment in assessing news credibility [
11]. Lastly, a multimodal approach using data augmentation-based contrastive learning improved feature representation, leveraging complementary modalities to enhance detection robustness [
12]. These studies collectively underscore the evolution of fake news detection, moving toward sophisticated, domain-aware, and multimodal approaches.
2.2. Machine-Generated Content Attribution
Recent studies have explored challenges in distinguishing human-authored texts from machine-generated ones. Uchendu et al. investigated authorship attribution across tasks like identifying the generation model or distinguishing human from machine texts, highlighting the superior quality of GPT-2 and GROVER in evading detection [
13]. Clark et al. examined the human evaluation of generated texts, finding non-experts performed poorly in distinguishing GPT-3 outputs and achieved only marginal improvements with training, raising concerns about the reliability of human assessments [
14]. Guo et al. analyzed ChatGPT’s responses across domains, noting its gaps in contextual understanding compared to human experts, and proposed effective detection systems to address potential misuse [
15].
DetectGPT, introduced by Mitchell et al., leverages the curvature of a model’s probability function for zero-shot detection, improving accuracy without requiring additional training data or fine-tuning, and performs notably better in detecting LLM-generated fake news [
16]. Sadasivan et al. demonstrated vulnerabilities in current detectors, including watermarking techniques, by developing a paraphrasing attack that significantly reduces detection accuracy, raising concerns about detector reliability in practical scenarios [
17]. Schuster et al. explored stylometry’s limitations in distinguishing malicious from legitimate uses of LLMs, noting that language models maintain stylistic consistency, complicating the detection of machine-generated misinformation [
18].
Studies emphasize the growing complexity of detecting AI-generated texts due to advancements in paraphrasing and sophisticated language models. Krishna et al. demonstrated that paraphrasing can evade state-of-the-art detection methods like GPTZero and DetectGPT but proposed retrieval-based defenses as a potential solution [
19]. Orenstrakh et al. evaluated eight detection tools in educational contexts, highlighting limitations in accuracy and resilience, particularly when handling paraphrased or non-English content, and called for enhancements to better preserve academic integrity [
20]. Dergaa et al. explored ChatGPT’s role in academic writing, discussing its potential to enhance efficiency but also warning about risks to authenticity and credibility, stressing ethical guidelines and transparency [
21].
2.3. AI-Generated Misinformation and Multitask Approaches
Recent research underscores the escalating complexity of detecting and mitigating AI-generated misinformation. Zhou et al. analyzed AI-generated misinformation’s linguistic distinctiveness, finding it to be more detailed and emotionally engaging than human-crafted content, which often misleads existing detection systems [
22]. Shoaib et al. explored the implications of generative AI for creating deepfakes and misinformation, advocating for a multimodal defense framework integrating digital watermarking and machine learning [
23]. Kreps et al. highlighted AI’s capability to produce credible-sounding media at scale, raising concerns about its potential misuse in misinformation campaigns targeting public opinion [
24]. Demartini et al. proposed a hybrid human-in-the-loop approach to misinformation detection, combining AI scalability with human expertise to enhance reliability and fairness [
25].
Chen and Shu investigated the challenges posed by large language models (LLMs) such as ChatGPT, finding that misinformation generated by LLMs is harder to detect than human-written misinformation due to its deceptive styles and semantic preservation. Their taxonomy categorizes LLM misinformation based on types, sources, and intents, highlighting the need for robust detection mechanisms [
26]. Blauth et al. reviewed the malicious uses and abuses of AI, including misinformation, social engineering, and hacking, emphasizing the importance of global collaboration to develop effective mitigation strategies [
27]. Kumari et al. introduced a multitask learning framework that integrates novelty detection and emotion recognition to improve misinformation detection accuracy across multiple datasets [
28]. Choudhry et al. introduced an emotion-aware multitask approach using transfer learning to detect fake news and rumors, demonstrating that emotions serve as domain-independent features that enhance performance in cross-domain settings [
29]. Jing et al. proposed TRANSFAKE, a multimodal transformer-based model that integrates text, images, and user comments, leveraging multimodal interactions and sentiment variances to improve detection accuracy [
30]. Wu et al. highlighted the vulnerability of existing detectors to style-based attacks facilitated by large language models (LLMs). They proposed SheepDog, a robust, style-agnostic detector that prioritizes content over style, achieving resilience against LLM-empowered adversarial attacks [
31].
2.4. LLM-Based and Zero-Shot Detectors
Advancements in misinformation detection emphasize novel frameworks leveraging LLMs. Satapara et al. proposed an adversarial prompting approach to generate misinformation datasets with controlled factual inaccuracies, enabling improved training for detection models [
32]. Hu et al. introduced knowledgeable prompt tuning (KPT), incorporating external knowledge into prompt verbalizers to enhance zero-shot and few-shot text classification performance, a potential asset for misinformation detection [
33]. Thaminkaew et al. presented a prompt-based label-aware framework (PLAML) for multi-label text classification, integrating token weighting, label-aware templates, and dynamic thresholds to boost accuracy in few-shot settings [
34].
On the other hand, Hou et al. proposed PROMPTBOOSTING, a black-box approach utilizing Adaboost for ensemble learning, significantly improving computational efficiency and performance in few-shot classification tasks [
35]. Sun et al. introduced CARP, a reasoning-focused prompting strategy for large language models (LLMs), addressing complex linguistic phenomena in text classification and achieving state-of-the-art results on multiple benchmarks [
36]. Chen et al. developed Concept Decomposition (CD), a framework for interpretable text classification by aligning continuous prompts with human-readable concepts, enhancing both performance and explainability [
37]. Balkus and Yan explored GPT-3’s self-augmentation capabilities, demonstrating improved accuracy in short-text classification by generating high-quality training examples [
38]. Xie and Li presented DLM-SCS, leveraging discriminative language models for semantic consistency scoring, outperforming other prompt-based methods in few-shot classification [
39].
Overall, these studies underscore the increasing sophistication and scale of both fake news and AI-generated content, necessitating ever more capable detection frameworks. As PLM-based architectures gain prominence, researchers have combined traditional feature engineering (e.g., stylometric and sentiment cues) with modern deep learning (e.g., multimodal PLMs) to handle diverse sources, from textual posts to images and metadata. Meanwhile, prompt-based and zero-shot techniques leverage LLMs for flexible, domain-adaptive classification, although they face challenges such as prompt sensitivity and adversarial paraphrasing. Finally, multitask and hierarchical strategies demonstrate that integrating tasks (e.g., authorship and authenticity) or modalities (e.g., textual, social, multimodal features) can significantly strengthen misinformation detection. Building on these advancements, our research explores novel dataset construction, shared–private transformer layers, hierarchical pipelines, and stylometric analysis to further unify—and improve upon—the approaches highlighted in this section.
3. Dataset and Preprocessing
In real-world environments, datasets are often complex, encompassing diverse content across various domains and mixing multiple challenges such as authenticity and authorship. To better mimic these real-world conditions, we extend our dataset to include both fake vs. real and AI vs. human labels, thereby reflecting the multifaceted nature of modern information channels. We designate this extended resource as the Fused Authenticity–Authorship News Resource (FAANR) to underscore its dual focus on both content reliability and text origin.
The scarcity of comprehensive datasets tailored for detecting AI-generated fake news remains a significant challenge in advancing this field. While datasets such as LIAR [
40] and FakeNewsNet [
41] have been instrumental in detecting human-generated misinformation, they lack instances of AI-generated fake news, which limits the development of robust detection models capable of addressing this growing challenge. The majority of existing resources either focus on specific domains or omit key linguistic variations that can be exploited by modern LLMs [
42,
43].
To address these limitations, we developed a novel dataset by combining previously established human-generated news datasets across domains such as politics, health, and technology with AI-generated fake news. Using both open-source and proprietary LLMs, we designed innovative prompting strategies to create diverse and realistic fake news instances.
3.1. Data Collection and Generation
We aim to build a dataset (FAANR) that explicitly labels each article for both authenticity (fake vs. real) and authorship (human vs. AI), capturing a broad spectrum of writing styles and textual properties. The dataset created is an extension of two open-source fake news detection datasets, namely, the ISOT [
44,
45] and the news dataset labeled by the news media bias lab (nmbbias) [
46]. Combining these two sources, we have a diverse database of articles with two labels, fake and real, from various domains ranging from politics to healthcare. The ISOT and nmbbias datasets have 44,898 and 9513 articles, respectively. The former has articles ranging on global and US news from 2016 to 2017, and the latter has the latest election news articles from 2024. Thus, this provides data before and after the emergence of LLMs to provide a balanced generation in the time domain.
However, they do not adequately address the AI-generated aspect of fake news. To remedy this, we used a subset of 20,000 articles randomly selected from the real set of the FAANR dataset. Next, these articles were fed into four LLMs that include ChatGPT-4o, Llama3.2, gemma, and mistral, to produce respective AI-generated articles using carefully designed scientific prompts, as shown in
Table 1. In order to ensure carefully controlled prompts, in one subset, the model was instructed to maintain factual accuracy and coherent style, yielding AI Real; in the other subset, it was guided to create deliberately false or misleading information, forming AI Fake. Each of the LLMs was given 5000 articles to churn out the real-phrased and fake-fabricated versions. Using multiple LLMs ensures diversity in the AI-authored content and prevents the overfitting of models to certain stylistic attributes.
When generating AI content, we cross-checked linguistic and thematic variety, minimizing repetitive patterns that might bias subsequent classification tasks. This process resulted in a total of 83,068 articles, of which around 58% is human-generated and the rest is AI-authored.
Table 2 depicts the exact authorship source for the articles in the dataset. This approach allowed us to unify both human-origin (already labeled real/fake) and AI-origin (specially created) articles under a single dataset that covered four possible categories of interest.
3.2. Class Distribution and Preprocessing
After combining the prelabeled human-authored articles with the newly generated AI content, we shuffled and partitioned the dataset into train, validation, and test splits. To maintain balanced class distributions, we employed a stratification strategy, respecting both the real/fake and human/AI labels. As a result, each split contained a comparable proportion of Human Real, Human Fake, AI Real, and AI Fake samples. The final distribution is summarized in
Table 3, which reports the number of articles in each partition, the average word count per category, and the general proportions. Furthermore, in
Figure 1, bar graphs illustrate how the dataset is distributed between fake vs. real, AI vs. human, and different text sources. This visualization confirms that we have a substantial number of samples from each class, including those derived from GPT-4, Llama, Mistral, and a large body of human-written examples. Additionally,
Figure 2 displays a correlation heatmap that indicates how basic numeric variables, such as sentiment, text length, and the two labels (fake, is AI) relate to each other. Although the correlations remain relatively modest, the map offers hints that text length may differ slightly between AI and human samples, and sentiment could have a minor relationship with the fake vs. real label.
Before feeding these articles into our experiments, we applied a uniform preprocessing pipeline. Each article was lowercased and trimmed of extraneous whitespace, removing any non-ASCII artifacts that did not convey useful information. We retained punctuation that was potentially relevant to stylometric or linguistic features. For feature-based methods (e.g., stylometric analysis), a tokenization step (via Python’s NLTK or a simple whitespace tokenizer) was used to compute attributes such as word counts, sentence lengths, or readability metrics. In contrast, PLM-based models employed subword tokenizers (e.g., BERTTokenizer) that automatically handled splits and special tokens. Finally, we excluded articles with too few tokens (e.g., fewer than 20) or containing repetitive filler text, ensuring that all remaining samples had meaningful linguistic content for downstream tasks.
As part of an exploratory study on linguistic differences, we generated WordCloud visualizations (shown in
Figure 3) to compare the most frequent words in fake articles (left) versus real articles (right). Some terms, like “Trump” or “said”, appear prominently across both categories, underscoring ongoing political topics; however, the emphasis on certain key verbs or repeated names can differ in placement or frequency. Taken together, these observations, supported by the correlation heatmap, the distribution bar plots, and the WordClouds, help contextualize how authenticity and authorship may manifest in textual patterns. Our stratified train/validation/test splits in the ratio of 70/10/20, balanced across the four categories, ensure that subsequent experiments, whether using multitask or hierarchical models, can effectively learn from both AI-derived and human-authored content.
4. Proposed Approaches
Building on the challenges outlined in the earlier sections, we propose a series of novel solutions aimed at tackling the dual tasks of authenticity and authorship detection in a unified manner. Our goal is to move beyond standard single-task or naive multitask setups by introducing advanced architectures and training strategies that explicitly account for both shared linguistic cues and task-specific nuances. Key contributions include our Shared–Private Synergy Model (SPSM), hierarchical task designs, and adaptive loss weighting methods. These innovations ensure that each classification task benefits from the other’s insights while maintaining sufficient specialization to address its unique requirements. Below, we present the core components of our proposed approaches, detailing how they collectively enhance fake news and AI authorship detection.
4.1. PLM-Based Approaches
These approaches are built on modern pretrained encoders such as BERT, ELECTRA, or DistilBERT. Depending on the task setup, we perform single-task fine-tuning (for authenticity or authorship alone) or introduce a multitask framework (jointly learning both tasks). We first describe the single and naive multitask approaches to further introduce the novel SPSM architecture.
Regardless of the specific pretrained model (BERT, ELECTRA, DistilBERT), we can conceptualize it as a function:
where
are the token embeddings for an input sequence and
denotes the sequence of hidden states (or a single pooled embedding, depending on the architecture). Each model’s tokenizer transforms raw text into subword IDs, which feed into this encoder.
4.1.1. Single-Task Fine-Tuning
In the single-task approach, we follow a standard end-to-end fine-tuning procedure, targeting either authenticity (fake vs. real) or authorship (human vs. AI). Let
be the final hidden state corresponding to the special [CLS] token:
where BERT denotes a BERT-based or BERT-large pretrained model. We then apply a single linear classification head:
where
and
are trainable parameters, mapping
to a probability distribution over the two classes, e.g., {fake, real}. We optimize the negative log-likelihood loss (Cross-Entropy) on the labeled dataset for that specific task. For instance, if
is cross-entropy with ground truth
, we minimize
Moreover,
Figure 4 depicts the separate single-task fine-tuning process: raw text tokenized into subwords, passed into BERT, and the final [CLS] embedding going through one linear head for classification. The figure highlights how only one output layer is trained for a single classification objective.
4.1.2. BERT Multitask Learning
When we wish to classify both authenticity and authorship simultaneously, we can adapt BERT to a multitask setup. Let
still represent the pooled embedding, but now we have two heads:
Here,
is the probability distribution over {
human,
AI} and
is over {
real,
fake}. The training objective combines both cross-entropy losses:
where
handles authorship classification and
handles authenticity. This single [CLS]-based embedding thus influences two separate classification heads.
In place of BERT, we can load alternative pretrained encoders, ELECTRA and DistilBERT, while following the same multitask design. The only major difference lies in the underlying architecture of the encoder. For example, we have the following:
ELECTRA is pretrained with a “discriminator” objective that distinguishes real tokens from those replaced by a “generator”. At fine-tuning time, it also produces a hidden state for [CLS] that we map to our tasks.
DistilBERT is a lighter, distilled version of BERT with roughly 40% fewer parameters, but retaining enough representational power for many tasks. It omits the [CLS] pooled output layer in the original sense, so we often take the first token embedding or a special vector from the final layer.
Figure 5 illustrates this multitask approach. The same input tokens pass through the chosen encoder (BERT, ELECTRA, DistilBERT), producing a final hidden state for [CLS]. Two parallel heads transform that embedding into predicted labels:
for authorship and
for authenticity. The combined loss (
6) backpropagates through shared parameters, encouraging the model to learn features that benefit both tasks.
4.1.3. Shared–Private Synergy Model (SPSM)
While standard multitask learning on a PLM encoder (e.g., BERT, ELECTRA, DistilBERT) places multiple classification heads atop the same final hidden representation, we innovatively extend this approach by introducing shared and private layers. In other words, beyond the base PLM encoder, we split the network into two sets of feed-forward layers:
Shared Layers : A stack of additional feed-forward blocks (or MLPs) that process pooled embedding for all tasks, thus capturing common features relevant to both authenticity (fake vs. real) and authorship (human vs. AI).
Private Layers : For each task , there is a distinct feed-forward network that takes the output of the shared layers as input. Each private sub-network then specializes in distinguishing either human vs. AI or fake vs. real, preserving nuances unique to the respective classification objective.
Formally, let
be the final hidden state from the PLM. We first compute
where
denotes
layers of feed-forward blocks (potentially interleaved with dropout and activation functions). We then
branch into two private sub-networks:
where
and
are the private representations passed to task-specific classification heads. Finally, each task head predicts a probability distribution over its labels (e.g.,
human vs. AI or
fake vs. real):
Boosting accuracy via shared and private layers, this design capitalizes on two complementary ideas:
Shared Feature Extraction: By letting both tasks pass through , we encourage the model to learn common linguistic cues that might benefit both authorship and authenticity classification (e.g., capturing writing style, sentiment usage, or certain domain-specific patterns).
Task-Specific Nuance: The private layers preserve specialized features that differ between tasks (e.g., lexical signals of AI text vs. stylistic cues of misinformation). Thus, each private sub-network can focus on the subtleties of its classification goal without being overwhelmed by the competing task’s objectives.
In practice, we find that introducing shared and private layers significantly enhances performance compared to a simple “two-head” approach that shares all downstream parameters. Empirically, this hierarchy yields an improvement in accuracy/F1 because the shared block learns robust, general representations, whereas private blocks can refine them for authenticity or authorship independently. The overall architecture is depicted in
Figure 6. We detail quantitative improvements in the Results Section, showing that this SPSM strategy consistently outperforms both single-task and naive multitask (one [CLS] embedding, two linear heads) baselines.
4.2. Hierarchical Approaches
In addition to the shared–private multitask paradigm, we explore hierarchical strategies for learning authorship (human vs. AI) and authenticity (fake vs. real) in a sequential or cascade manner. Unlike the multitask approach—where both tasks are trained in parallel from a common [CLS] embedding—hierarchical setups explicitly make the output (or an intermediate representation) of the first task feed into the second task. This design can capture the intuitive notion that AI-generated articles may, statistically, be more likely to be fake or vice versa, thus exploiting potential dependencies between tasks.
4.2.1. Two-Stage Pipeline
In the two-stage pipeline, we first train or compute an authorship prediction for each article and then use that authorship signal (predicted logits or probabilities) as an additional input feature to the authenticity classifier. More concretely, we define two passes:
Authorship Pass: Given a PLM encoder (e.g., BERT) and a classification head for human vs. AI, we compute . If we produce logits , we can convert them to probabilities via softmax.
Authenticity Pass: We embed the predicted authorship distribution
(or the raw logits,
) into the input for the second classification pass, e.g.,
concatenating or otherwise merging the predicted authorship probabilities with the PLM embedding. A second head,
, then performs a cross-entropy classification for fake vs. real.
Thus, the second stage explicitly conditions on the authorship model’s outcome. In practice, this approach can be implemented by first training an authorship model, saving predictions, and then injecting them as features (or using a single PLM forward pass that computes both). An outline of the approach is shown in
Figure 7. The advantage is that authorship predictions might highlight subtle cues (e.g., certain writing style or repeated tokens) that help the authenticity model, especially if human- vs. AI-authored texts differ in the likelihood of containing misinformation.
If we let
be the authorship classifier’s output distribution, the second classifier’s decision for fake vs. real is:
where we might incorporate
either by concatenation or by an auxiliary feed-forward layer that fuses the probabilities/logits into the authenticity embedding.
4.2.2. Single-Pass Cascade
Rather than explicitly saving intermediate outputs and re-feeding them, we can implement a single-pass hierarchical model that internally first computes authorship logits and then conditions authenticity logits on them. Concretely, let
be the pooled representation from the PLM, and define:
which is the authorship logit vector. We then transform
into a hidden dimension equal to
or another suitable size, e.g.,
Then,
is fed to the authenticity head:
Hence, the authenticity classification layer sees not only the base embedding but also authorship information encoded in
. The entire forward pass is differentiable end-to-end, allowing authorship gradients to propagate back into the encoder while also steering the second stage authenticity head.
Figure 8 represents the overall flow of the cascade process and highlights the difference between the approaches.
If
and
are cross-entropy losses for each subtask (using authorship labels and authenticity labels, respectively), then the single-pass model minimizes
similar to a multitask scenario, but with an internal cascade from authorship logits into authenticity.
By enforcing this stepwise or cascade logic, hierarchical approaches may exploit the prior knowledge that AI-written texts differ in style or distribution and that misleading articles may be more prevalent among certain authorship types. Thus, the second task (authenticity) can condition on the first, often improving overall performance. Nevertheless, the results in the next section will confirm the extent to which hierarchical constraints outperform simpler multitask baselines.
4.3. Stylometric Feature Analysis
We additionally perform stylometric feature extraction to investigate whether certain lexical, syntactic, or readability attributes are indicative of human vs. AI authorship or fake vs. real authenticity. These features are inspired by prior studies in the domain of stylometric-based classification networks [
47], and each feature contributes to a comprehensive profile of the text, encapsulating various aspects of its structure and composition [
48]. Below, we first provide a concise list of the features and their mathematical definitions in
Table 4 and then offer an expanded description of each feature.
The following are the detailed explanations of the features:
Word Count, Unique Word Count, Character Count: These capture the basic lexical footprint of each text. Word Count (W) is the total token count, Unique Word Count (U) measures vocabulary breadth, and Character Count (C) reflects orthographic length.
TTR and Hapax Legomenon: TTR (Type-Token Ratio) gauges lexical diversity, while Hapax Legomenon checks how many tokens appear exactly once, possibly indicating subject-specific or AI-like lexical patterns.
Sentence Count, Average Sentence Length, Average Sentence Complexity: Each measure describes syntactic distribution. Sentence Count simply counts the total sentences (S), while Average Sentence Length and AvgSentenceComplexity parse how dense or frequent these sentences are relative to word counts.
Punctuation Count: This summarizes how frequently punctuation tokens appear, which can be distinctive for emotional or AI-generated texts (e.g., repetitive exclamation marks).
Noun Count, Verb Count, Adjective Count, Adverb Count: Part-of-speech tallies can reveal shifts in style or usage, e.g., certain content might rely heavily on adjectives for sensational language (common in fake news or AI descriptions).
Stopword Count: This measures how many tokens come from a standard stopword list (the, and, of), which can reflect typical English usage.
Complex Sentence Count: This tracks how many sentences have additional subordinate or conjunctive clauses, often flagged by relations like advcl or conj in dependency parsing.
Question Mark Count, Exclamation Mark Count: These tally rhetorical or emotional punctuation, potential hints of sensational or interactive text (AI prompts or clickbait).
Flesch Reading Ease, Gunning Fog Index: Classic readability formulas. Flesch Reading Ease uses average sentence length and average word length, whereas Gunning Fog Index leverages sentence complexity.
First Person Pronoun Count: This identifies I, we, our, etc., which can differ between personal or AI-generated writing.
Person Entity Count, Date Entity Count: Named Entity Recognition tallies, checking if the mention of people or dates correlates with authenticity or authorship cues.
Uniqueness Bigram, Uniqueness Trigram: The ratio of distinct bigrams/trigrams to total. High uniqueness might suggest creative or unusual phrasing; low uniqueness could indicate repeated patterns.
SyntaxVariety: The number of unique POS tags (e.g., NOUN, VERB, ADV), reflecting diversity in syntactic structures.
After computing these stylometric attributes, we scale the resulting feature vectors (e.g., via MinMax scaling) and feed them into classical ML algorithms or shallow neural networks. This approach can shed light on which linguistic cues dominate classification. For instance, when we use a Random Forest or Logistic Regression model, we can derive feature importance values or coefficient weights, thus explaining whether frequent punctuation, certain entity mentions, or high sentence complexity correlates strongly with fake vs. real or human vs. AI predictions. Consequently, stylometric analysis offers an explainable contrast to purely deep contextual embeddings, potentially highlighting domain-specific patterns that advanced PLMs might learn in a less interpretable way.
In the next section, we compare these feature-based classifiers against our multitask PLM approaches. We find that certain stylometric cues (e.g., frequent first-person pronouns, low lexical diversity) can partially reveal AI authorship, while strong readability variation or emotional punctuation can hint at manipulative fake news. Ultimately, these stylometric signals can complement or explain model decisions, even if end-to-end PLMs yield higher raw accuracy.
4.4. Prompt-Based Classifiers
In addition to supervised neural models and stylometric approaches, we employ prompt-based classification with large language models, specifically GPT-4o and Llama3.2. Rather than training a classifier directly, we issue targeted system instructions and user prompts, asking the model to output a single label (AI vs. Human or Real vs. Fake). This leverages the LLM’s internal knowledge from pretraining, enabling classification in a near-zero-shot setting. Following are the LLMs we selected.
GPT-4o: A variant of GPT-4 accessed via a chat-completions API. It offers high-quality outputs at the expense of potential rate or cost limitations.
Llama3.2: An open-source (or locally hosted) model, also responding to system plus user prompts for classification tasks.
We craft two specialized tasks: AI vs. Human for authorship and Real vs. Fake for authenticity. Each prompt pair consists of a system instruction defining the classification role and a user prompt providing the article title and text and then requesting a direct label.
Table 5 summarizes these templates.
At inference, we insert an article’s {title} and {text} into the relevant user prompt and then send both system instruction and user prompt to GPT-4o or Llama3.2. The LLM responds with a single token (“AI” or “Human”, “Real” or “Fake”). Temperature is set to zero (temperature = 0), aiming for deterministic behavior. We record these outputs as predicted labels and evaluate them against ground truth.
Compared to fine-tuned or feature-based models, LLM prompts require minimal additional data or training steps—each classification query essentially leverages the LLM’s pretraining. However, the method can be sensitive to prompt phrasing, and it may produce off-format answers if not carefully constrained. Nonetheless, prompt-based classification can be deployed quickly for new tasks, provided the instructions are unambiguous and the LLM comprehends the textual domain. We later discuss how these responses compare to supervised baselines in the results section, assessing whether LLMs can robustly infer authenticity and authorship from a straightforward question-and-answer prompt.
5. Experiments
To rigorously evaluate the proposed methods, we conduct a series of controlled experiments that aim to gauge effectiveness in both authenticity (fake vs. real) and authorship (human vs. AI) classification. In this section, we describe the experimental design, including data splits, training and validation protocols, hyperparameter configurations, and the metrics used for evaluation. We also provide details on ablation studies and other relevant implementation aspects. Following this comprehensive overview, we present all quantitative and qualitative findings in the subsequent Results Section, where we compare the performance of our novel approaches against relevant baselines and discuss the implications in detail.
5.1. Experimental Setup
All experiments were conducted on a system equipped with an NVIDIA A4000 GPU featuring 24 GB of VRAM and 2 virtual CPUs (vCPUs). This setup provided sufficient memory to handle large PLMs (e.g., BERT, DistilBERT, GPT-based architectures) and multitask or hierarchical pipelines without frequent out-of-memory errors. We relied on the Hugging Face Transformers library for model definitions, tokenization, and training routines, and we leveraged the built-in Trainer class to streamline fine-tuning, evaluation, and checkpoint management.
For single-task experiments, we typically used a batch size of 8, trained for 3 epochs, and set load-best-model-at-end = True to ensure that the best checkpoint according to validation loss was restored before final evaluation. The same batch size, number of epochs, and best-model restoration logic were applied to multitask and hierarchical models as well, though we occasionally experimented with ablation runs (e.g., removing stylometric features, adjusting task-specific loss weights) under similar or identical hyperparameters to maintain comparability. Our logging interval (logging-steps = 50) provided enough granularity to monitor training progress without incurring excessive overhead.
Additionally, we occasionally invoked ablation protocols that removed or altered certain features (e.g., stylometric input, shared layers in a multitask architecture) while preserving the same batch size, number of epochs, and validation strategy. This consistency allowed us to draw fair comparisons across different approaches without conflating changes in hyperparameters with changes in model design.
5.2. Traditional Baseline Using Classical ML Methods
Our traditional (baseline) approach employs bag-of-words representations with TF-IDF weighting to classify texts as either fake vs. real (authenticity) or human vs. AI (authorship). In our TF-IDF baseline, each document is first transformed into a high-dimensional vector
, where
d corresponds to the size of the TF-IDF feature space (e.g., unigrams or n-grams). We then feed
into one of several classical machine learning classifiers to predict the label
(e.g., fake vs. real or human vs. AI). An overview of the approach is depicted in
Figure 9. Specifically, we explore the following algorithms:
First we discuss Logistic Regression, given a weight vector
and bias
, logistic regression models the probability that
as:
where
is the sigmoid function. The prediction
is then obtained by thresholding
at 0.5.
Next, Random Forest is an ensemble of
M decision trees. Each tree is trained on a bootstrapped sample of the data and uses a subset of features to split nodes. The final classification is determined by majority vote among the
M trees:
Further, an SVC attempts to find a maximum-margin hyperplane in
. In its basic linear form:
though we often employ kernelized SVC for non-linear decision boundaries (e.g., RBF kernel).
Also, Multinomial Naive Bayes assumes word occurrences follow a multinomial distribution and treats features independently given the class. For a vocabulary size
d, the posterior probability of class
k given
is:
where
is the count (or TF-IDF weight) of the
j-th token and
is estimated from training data.
Moreover, XGBoost is a gradient boosting framework that incrementally fits an ensemble of decision trees to minimize a specified loss (often logistic loss for classification). Let
be the ensemble’s prediction after
trees; the
m-th tree is fitted to the negative gradient of the loss with respect to
, producing
. The model update is then:
where
is the learning rate. The final prediction is obtained by the sign or threshold of
after
M trees.
All of these methods take as input the same TF-IDF vectors derived from lowercased, tokenized text. We tune hyperparameters (e.g., regularization strength in logistic regression, max depth in XGBoost) via a validation set to maximize classification performance on each specific task (authenticity or authorship). Despite the more limited representational capacity of these models compared to deep neural networks, they serve as a computationally efficient and interpretable baseline that often performs competitively on text classification tasks involving TF-IDF features.
5.3. Ablation Studies
A key contribution of our approach is the flexible architecture that can toggle on or off: (i) the shared layer after the PLM, (ii) the private layers for each task, and (iii) different loss weights for authenticity (fake vs. real) and authorship (human vs. AI). To systematically evaluate which components and hyperparameters lead to gains, we conduct a set of ablation experiments, modifying the configuration in the following ways:
Shared Layer On/Off: When shared layer is used, we insert an additional linear + ReLU block to transform the pooled BERT output (). If set to False, the model directly passes into either private layers or heads without the extra shared projection.
Mathematically, if shared layer is used, we compute:
else we set
directly.
Private Layers On/Off: If private layers is turned on, each task obtains its own feed-forward (FF) head, e.g.,
for authorship and
for authenticity. Otherwise, the model simply applies two linear heads (one for authenticity, one for authorship) directly to
. In the private layers scenario:
which then feed two distinct classifier heads. Disabling private layers merges the two tasks’ feed-forward paths so that each classification head sees the same representation.
Loss Weight Modifications: By varying authenticity loss weight (
) and authorship loss weight (
), we emphasize one task’s loss over the other. Our total loss is:
where each
represents cross-entropy for its respective output logits. Typically,
indicates equal importance, but we test an ablation where authenticity is weighted more heavily.
For implementation and Forward Pass each ablation run instantiates the base BERT model and applies the following toggles:
During training, the forward pass produces both authenticity logits and authorship logits from whichever structure emerges after toggling. These logits go into separate cross-entropy losses, scaled by or , then summed. Such design lets us confirm whether the shared block or private layers significantly boost synergy between tasks.
Table 6 summarizes the ablation variants we explore. In practice, each variant is launched with the same training hyperparameters (3 epochs, batch size 8, learning rate
), and a final evaluation is performed on test data for both authorship and authenticity. We detail performance differences in the Results Section.
Table 6 displays the four main variations: (1) full_model sets everything on with equal weighting, (2) no_shared_layer removes the additional linear + ReLU block, (3) no_private_layers omits separate feed-forward transformations for each task, and (4) authenticity_loss_2x doubles the weighting for fake vs. real. By evaluating test accuracy and F1 for each configuration, we ascertain the individual impact of each design element (shared layer, private layers, task weighting) on overall performance.
We anticipate that no_shared_layer and no_private_layers might hinder the model’s ability to capture cross-task or task-specific nuances, thus lowering the synergy between authorship and authenticity. On the other hand, multiplying by 2.0 in authenticity_loss_2x should strengthen fake vs. real classification, albeit possibly at the cost of slightly weaker human vs. AI metrics. Results from these ablation runs, compared to full_model, clarify how each structural or weighting choice contributes to the model’s final accuracy and F1-scores.
6. Results
In this section, we present our empirical findings for classifying authenticity (fake vs. real) and authorship (human vs. AI). Our experiments encompass a broad range of models, starting with baselines, which include not only straightforward approaches (e.g., classical machine learning pipelines) but also some of the most recent and best performing models in the domain. Further, we move into PLM-based approaches, including single-task and multitask setups. We then perform ablation studies to measure how toggling key architectural elements (such as shared or private layers) affects performance on both tasks.
Next, we explore a stylometric analysis, relying solely on handcrafted linguistic features to highlight whether explicit lexical and syntactic cues can match or complement the accuracy of end-to-end neural methods. Finally, we evaluate prompt-based classification with large language models, comparing how zero- or few-shot queries to GPT-4o and Llama3.2 stack up against our supervised or feature-driven methods. Throughout these experiments, we report both accuracy and F1-scores (macro or weighted) to account for any class imbalance in real vs. fake or human vs. AI. In some cases, we also examine precision, recall, and confusion matrices to clarify which error patterns arise most often.
By contrasting all these results, we aim to identify which techniques effectively capture the nuances of authorship and authenticity and which design decisions—hierarchical constraints, prompt-based questions, or stylometric signals—meaningfully enhance performance and explainability. In the paragraphs that follow, we present each method’s outcomes and discuss their comparative advantages.
6.1. Baseline Results
We begin our empirical evaluation with traditional machine learning baselines, which leverage a TF-IDF feature pipeline to classify authenticity (fake vs. real) and authorship (human vs. AI). Our evaluation initially considers five classifiers: Logistic Regression, Random Forest, SVC, Multinomial Naive Bayes, and XGBoost (XGB). Their performance—reported in terms of accuracy, precision, recall, and F1-scores (both weighted and macro)—is summarized in
Table 7.
In addition to these traditional methods, we extend our analysis to include several state-of-the-art models. First, we evaluate graph neural network (GNN)-based algorithms [
49] (GCN, graphSAGE, GIN, and GAT), which treat TF-IDF vectorized articles as nodes in a synthetic graph to capture inter-document relationships. We then consider deep learning architectures from the recent literature [
50,
51]: CNN, LSTM, and a bi-LSTM attention-based model. In these neural models, fasttext [
52] embeddings are used for tokenization and feature representation. Finally, for authorship detection, we incorporate two popular machine-generated text (MGT) detection algorithms, GPTZero [
53] and DetectGPT [
16].
Table 7 reveals several trends. Among the traditional baselines, Logistic Regression and XGB yield robust performance, achieving authenticity accuracies of 89.16% and 88.39% and authorship accuracies of 90.91% and 90.21%, respectively. In contrast, deep learning models significantly outperform these methods: the CNN, LSTM, and bi-LSTM attention networks reach authenticity accuracies of 92.92%, 93.37%, and 94.61% and authorship accuracies of 94.53%, 95.20%, and 95.17%, respectively. These results indicate that convolutional and recurrent architectures—especially when augmented with attention mechanisms—are highly effective at capturing complex textual patterns.
On the other hand, the GNN-based models (GCN, graphSAGE, GAT, and GIN) exhibit considerably lower performance, with authenticity accuracies hovering around 54% and authorship accuracies ranging between 41.47% and 58.53%. This suggests that representing articles as nodes with TF-IDF features and synthetic connections does not capture the nuanced relationships required for accurate classification in these tasks. For authorship detection specifically, the specialized MGT detection algorithms—GPTZero and DetectGPT—yield much lower accuracies (64.74% and 49.89%, respectively). Their subpar performance underscores the challenges these approaches face when distinguishing human-written texts from AI-generated content, as compared to both traditional and deep learning classifiers.
Overall, while traditional methods such as Logistic Regression and XGB remain competitive, the deep learning architectures, particularly the bi-LSTM attention model, deliver superior performance across both authenticity and authorship classification tasks.
6.2. PLM-Based Approaches
In the prior literature, the best performing strategies for both authenticity and authorship classification have been based on pretrained language models (PLMs) [
54,
55]. In our work, we have incorporated all of these top-performing strategies to ensure a comprehensive comparison. We now present our PLM-based approaches for both tasks.
Table 8 shows the accuracy, precision, recall, and F1-scores (weighted and macro) achieved by various configurations, including single-task versus multitask fine-tuning, a shared–private sub-network (SPSM) approach, hierarchical variants, and an authenticity_loss_2x configuration that emphasizes the authenticity objective.
The SPSM surpasses naive multitask or separate training (96.97% vs. 95–96% authenticity), indicating that partial parameter sharing plus private sub-networks effectively capture cross-task cues.
Notably, authenticity_loss_2x improves authenticity accuracy to 97.09%—a jump from the original BERT metrics—by weighting authenticity’s cross-entropy loss more heavily. Though this also slightly elevates authorship performance (98.92% accuracy), the difference is smaller, highlighting how scaling one task’s objective can yield an overall synergy without excessively penalizing the other task.
Meanwhile, DistilBERT (Multitask) and ELECTRA (Multitask) remain competitive, with ELECTRA hitting 96.84% for authenticity and 98.80% for authorship. Finally, Hierarchical B/A show how authorship signals can be used to drive authenticity classification in a single-pass or two-stage approach, with Hierarchical B surpassing basic multitask BERT in authenticity by about 0.26% (96.20% vs. 95.94%).
Overall, these results reinforce that multitask architectures leveraging shared representation and task-specific specialization can exceed simpler single-task or naive two-head solutions. Moreover, adjusting loss weights (e.g., doubling authenticity’s contribution) can significantly boost that particular task’s performance without unduly harming the other classification goal.
6.3. Ablation Studies
To isolate the impact of shared layers, private layers, and loss weighting, we carry out a series of ablation experiments.
Table 9 summarizes how toggling each component affects authenticity (fake vs. real) and authorship (human vs. AI) metrics. Below, we discuss the four configurations:
full_model w both loss: It enables both shared and private layers with default weighting ().
no_shared_layer: It disables the shared feed-forward transformation while keeping private layers.
no_private_layers: It retains the shared layer but removes per-task private layers, mapping the shared representation directly to logits.
authenticity_loss_2x: It doubles the authenticity loss weight (), amplifying the fake vs. real objective.
full_model with both loss achieves 96.95% authenticity accuracy and 98.53% authorship accuracy, validating that including both shared and private layers balances synergy and task-specific nuance. no_shared_layer slightly reduces authenticity accuracy to 96.71%, indicating that the additional feed-forward block may indeed help general representation. Interestingly, no_private_layers yields identical metrics to the full model in this case, suggesting the shared feed-forward transformation alone can suffice under certain data conditions.
Finally, authenticity_loss_2x raises authenticity accuracy to 97.09% while also nudging authorship up to 98.92%, showing that boosting one task’s weight () can still improve joint performance without noticeably harming the secondary objective. This further underscores that controlling relative loss terms can help a multitask network prioritize more difficult tasks (fake vs. real in this scenario) while retaining high performance on authorship.
Figure 10 presents a superimposed radar chart comparing the performance of four model variants across multiple authenticity and authorship metrics. Each axis represents an individual metric, with higher values toward the outer edge indicating better performance. To accentuate subtle differences, the radial axis is limited to the range 96–100. The authenticity_loss_2x variant, highlighted in purple, demonstrates consistently strong performance across all metrics. In contrast, the ablated variants—where either shared or private layers are removed—exhibit declines in specific areas. For example, the removal of shared layers primarily affects authenticity metrics, suggesting that shared representations are critical for accurate authenticity classification. Similarly, the drop in authorship-related performance observed in the variant lacking private layers indicates that these layers are important for capturing nuances required to distinguish between authors.
6.4. Stylometric-Based Approaches
We next evaluate how well stylometric feature vectors (
Section 4.3) alone can classify authenticity (fake vs. real) and authorship (human vs. AI).
Table 10 presents performance for a shallow neural network (Stylometric NN) and three classical ML algorithms (Logistic Regression, Random Forest, SVC) trained solely on handcrafted linguistic attributes (e.g., word count, TTR, punctuation usage, reading ease).
For authenticity, both the shallow NN and Random Forest surpass 86% accuracy, suggesting that certain textual cues (e.g., average word length, punctuation usage, or entity counts) correlate with whether an article is real or fake. In contrast, Stylometric LR lags behind, possibly due to linear constraints on feature interactions. Meanwhile, Stylometric SVC and Stylometric RF offer comparable authenticity metrics (around 85–86%), showing that tree-based or margin-based classifiers can partially exploit these linguistic signals.
For authorship, performance is notably lower—ranging from 55–58% for classical methods to 85.06% for the NN. This indicates that identifying AI vs. human texts purely from stylometric attributes is more challenging than detecting factual consistency. The shallow NN, possibly benefiting from non-linear transformations of the stylometric features, outperforms classical ML by a wide margin (85.06% vs. 57–59%). This underscores that while stylometry alone may capture some aspects of AI writing, it might require a more flexible (e.g., neural) model to glean subtle patterns.
Although stylometric classifiers remain behind PLM-based pipelines (
Section 6.2), they provide interpretability and computational efficiency. Feature importances (e.g., from Random Forest or logistic coefficients) can
explain whether, for instance, an unusual ratio of first-person pronouns or a high Gunning Fog Index signals AI authorship or misinformation. As a standalone method, stylometry yields moderate success—particularly for authenticity—but struggles more with authorship classification, possibly because advanced LLMs adopt varied and increasingly human-like styles. Nonetheless, these results highlight how explicit linguistic features can still distinguish certain phenomena in real vs. fake or human vs. AI content.
6.5. Prompt-Based Classification
Finally, we explore prompt-based classification using large language models, specifically a version of GPT (labeled Prompt classifier gpt) and Llama3.2 (Prompt classifier llama). Rather than fine-tuning, each article’s title and text are fed into a carefully designed system plus user prompt (
Section 4.4), and the model returns a short label (Real or Fake, Human or AI).
Table 11 lists the accuracy, precision, recall, and F1-scores (weighted and macro) for both authenticity and authorship tasks under these zero-/few-shot paradigms.
For authenticity, the GPT-based prompt classification achieves a respectable 88.95% accuracy, with strong precision and recall. However, the same GPT prompt falls sharply on authorship accuracy (51.86%), suggesting the model struggles to discern subtle stylistic cues of AI vs. human text under the provided instructions. Conversely, Llama3.2 yields only 73.08% authenticity accuracy but obtains a slightly higher authorship accuracy (52.77%), though with lower F1 metrics due to class imbalances and off-format answers in some cases.
These results confirm that while LLM prompts can perform reasonably well on simpler tasks (like identifying factual consistency), distinguishing human vs. AI writing may require more carefully engineered prompts or additional context. The typical zero-shot or few-shot setup also lacks the domain-specific tuning that our supervised or multitask models enjoy. Nevertheless, such prompt-based classifiers remain appealing for quick deployment, especially if training data or computation for fine-tuning are limited, and can serve as a flexible baseline for new tasks with minimal overhead.
7. Further Analysis of Experimental Results
Overall, our results highlight a clear performance gap between classical or stylometric methods and the top-scoring PLM-based architectures (see
Figure 11). Proposed models with the SPSM architecture secure the highest accuracy on both authorship (human vs. AI) and authenticity (fake vs. real). In contrast, stylometric-only or traditional ML baselines typically hover below 90% accuracy for authenticity and even lower for authorship—suggesting that detecting AI-written text from handcrafted features alone can be more challenging.
A closer inspection of feature importance (shown in the top and bottom panels of
Figure 12) helps explain why stylometric methods can still provide meaningful signals for each task, even if they do not match PLM performance. For authorship, features like Date Entity Count, Noun Count, and Syntax Variety rank among the highest in importance scores. This suggests that AI-generated text might differ slightly in how it employs (or omits) date references, uses certain noun constructions, or varies its part-of-speech patterns. Seeing Complex Sentence Count and Average Word Length relatively high also indicates that advanced LLMs may simulate complex grammar but do so in ways detectably different from human authors. Hence, while these explicit counts do not guarantee perfect classification, they do reveal observable linguistic traits that can be used for partial explainability—e.g., if a model leans on unusual date usage to flag an article as AI-written.
For authenticity, the lower panel in
Figure 12 shows that exclamation marks, question marks, and certain entity counts (particularly Date Entity Count and Person Entity Count) can prove indicative of fake vs. real. Evidently, emotional emphasis (many exclamation marks) or overuse of dates/people might signal attempts at sensationalism or fabricated storytelling. However, these features alone sometimes fail to capture the deeper context that PLMs learn from massive pretraining—explaining why stylometric models plateau around the mid-80% range of authenticity accuracy.
In terms of architectural differences, hierarchical architectures and the SPSM significantly outperform naive two-head multitask or separate-task baselines (
Figure 11). The hierarchical models impose a cascade where authorship logits guide authenticity classification (or vice versa), which can boost synergy if AI writing correlates strongly with misinformation styles. Meanwhile, shared–private layers allow a common feed-forward network to model universal language cues while devoting private layers to each label’s unique nuances. This design often yields a higher synergy than purely “two tasks, two heads”, as it enforces partial parameter sharing and preserves specialized capacity for each objective.
Lastly, prompt-based classifiers using GPT or Llama for zero-/few-shot classification exhibit variable success. They score moderate to strong accuracy on authenticity—arguably because factual correctness can be partially inferred by large LLMs—but they fall behind for authorship (especially Llama) with near 50–55% accuracy. This discrepancy indicates that distinguishing AI from human text requires more precise prompting or additional training. In all, the SPSM(authenticity_loss_2x) approach shows that weighting tasks differently can improve performance on the more difficult classification (fake vs. real) while still retaining high performance on the secondary authorship task.
In summary, these findings illustrate that PLM-based pipelines—particularly those with multitask synergy and carefully tuned architectures—consistently yield top-tier results. Stylistic features, though less powerful in isolation, provide valuable explanatory insights: we can pinpoint which punctuation marks, entity references, or syntactic complexities signal AI authorship or potential misinformation. Meanwhile, hierarchical and shared–private models leverage both task interdependence and partial specialization to outperform naive baselines, reaffirming that how tasks share or separate parameters profoundly affects real-world classification accuracy.
7.1. Computational and Resource Requirements for Deployment
The SPSM multitask framework presents significant computational requirements due to the model’s large number of parameters and high resource requirements. A standard BERT-based model consists of 110 million parameters, with a hidden size of 768 and 12 attention heads, resulting in a memory footprint of approximately 400 MB when stored on a disk. The BERT-large variant, which scales up to 340 million parameters with 1024 hidden dimensions and 24 attention heads, requires even more computational power, often demanding 32GB+ of GPU VRAM for training. Since the SPSM shares a single transformer encoder across multiple tasks, the additional parameters primarily come from task-specific classification heads, each adding 1–2 million parameters per task. While this setup reduces the number of parameters compared to separate models for each task, the overall computational demand increases with the number of tasks, requiring careful optimization.
Training is computationally intensive due to the quadratic complexity (O(n2)) of self-attention, which makes processing long input sequences expensive in terms of both memory and computation time. A typical BERT-based SPSM with two to three tasks requires at least 16GB of GPU VRAM (e.g., NVIDIA RTX 3090, A100, or V100) to handle batch sizes of 16–32 efficiently. For larger models like BERT-large, training becomes impractical on consumer-grade GPUs, necessitating gradient accumulation or very small batch sizes (≤8) to avoid out-of-memory (OOM) errors. On CPU-only systems, training is virtually infeasible, as the model requires heavy matrix multiplication in the self-attention layers, leading to training times that can stretch over several days per epoch.
Inference with the SPSM also presents scalability challenges, especially in real-time applications. On high-end GPUs such as the NVIDIA A100 or RTX 4090, inference times range from 20 to 50 ms per query, making deployment feasible in cloud-based environments. However, on mid-range GPUs (RTX 3060, T4), inference latency increases to 50–100 ms per query, while on a CPU-only system, it can take 300–500 ms per query, making real-time response impractical. The problem worsens for edge devices like Jetson Nano, Raspberry Pi, or mobile processors, where memory limitations (1–2GB RAM per model instance) and slow processing speeds (1–2 s per query) severely impact feasibility. Furthermore, high power consumption makes battery-powered deployment inefficient, significantly reducing device runtime.
The primary challenges in deploying models in real time or on edge devices stem from high memory usage, latency issues, and power constraints. Real-time applications, such as fraud detection, chatbots, and speech recognition, require inference times below 10 ms, which BERT-based models struggle to achieve without optimization. To make deployment practical, various optimization techniques can be applied, including model distillation (TinyBERT, MiniLM), quantization (reducing model precision to 8-bit), and hardware acceleration (using TensorRT, ONNX for faster inference). Without such optimization, deploying the SPSM directly on edge devices remains impractical, making cloud-based hosting the most viable option for real-world applications requiring multitask transformers.
7.2. Explainability Using LIME and SHAP
In order to provide deeper insight into the model’s decision-making process for both the authenticity (real vs. fake) and authorship (human vs. AI) tasks, we employed two popular post hoc explainability methods: LIME (Local Interpretable Model-Agnostic Explanations) and SHAP (SHapley Additive exPlanations) on our best performing SPSM model. Below, we summarize the key findings from both local and global perspectives.
Figure 13 shows the top 30 global LIME features for both tasks. On the x-axis, we have the average LIME weight of each word, and on the y-axis, we list the most influential words. Positive bars (to the right) indicate that a word contributes positively to predicting the class at index 1 (e.g., fake for authenticity or AI for authorship), whereas negative bars (to the left) indicate a contribution toward the class at index 0 (e.g., real or human).
Several interesting observations can be drawn: Words such as “russia”, “taiwan”, and “propaganda” appear to strongly influence the authenticity classifier, suggesting that the model associates these terms more with “fake” content. For authorship, words like “emerging” and “agency” show high positive weights, indicating they push the model’s decision toward AI-generated text, whereas words such as “none” and “senator” steer the model toward human-generated text.
Figure 14 displays the global SHAP feature importance for authenticity, showing the top 30 tokens ranked by their average absolute SHAP value. Larger values indicate greater impact on the prediction (real vs. fake). Common words (e.g., “the”, “of”, “and”) appear among the top tokens, likely because they frequently co-occur in certain linguistic patterns within either real or fake news articles. Proper nouns such as “trump” and “donald” have a high impact, suggesting the classifier uses political references to distinguish real from fake.
Such tokens can reflect data distribution biases, especially if the training corpus contains articles heavily referencing political figures in fake contexts.
Figure 15 presents a SHAP summary plot over many tokens (the x-axis is the mean absolute SHAP value). It highlights how a large portion of the predictive power can be attributed to a small set of highly influential tokens. Additionally, it underscores the long tail of “other features”, each contributing marginally. This aligns with the intuition that in text classification, many words contribute small, context-dependent signals.
Figure 16 shows a pair of local LIME bar charts for a single test sample, one for authenticity (left) and one for authorship (right). The top portion of each bar chart indicates a feature (token) that pushes the model toward predicting “fake” or “AI”, whereas negative bars push toward “real” or “human.” The predicted probabilities are included in the figure captions. For authenticity, words like “u-”, “congress”, or “raucous” are more strongly associated with the model’s decision to classify the article as real. In contrast, “contentious”, “polling”, and “gamered” nudge the authorship classifier toward AI, illustrating how distinct features can influence the two tasks differently.
Overall, these explainability methods help validate that our model is leveraging semantically meaningful features rather than relying on spurious correlations. Global explanations reveal the top words driving the classifier across the entire dataset, whereas local explanations show which tokens matter most for a single example. These insights can inform data curation (e.g., identifying biases or over-reliance on specific keywords) and guide model improvements in future work.
8. Conclusions
In this study, we proposed and evaluated a comprehensive framework for addressing the challenges of classifying news articles across two dimensions: authenticity (real vs. fake) and authorship (human vs. AI). Our approach spanned multiple methodologies, including traditional machine learning baselines, stylometric feature-based analysis, prompt-based classifiers, and advanced PLM-based architectures with multitask and hierarchical capabilities. Through extensive experimentation, we demonstrated that our proposed SPSM architecture with synergetic layers and hierarchical task dependencies consistently outperforms simpler baselines, achieving state-of-the-art results with accuracies exceeding 96% for authenticity and 98% for authorship. Furthermore, our analysis highlighted the value of stylometric features in providing interpretable insights, revealing linguistic and structural patterns unique to fake news and AI-generated content. These findings underscore the importance of integrating modern architectures with interpretable features for addressing misinformation in the age of AI.
8.1. Limitations
Despite the promising results, this work has several limitations that warrant further investigation. First, our dataset was limited to textual news articles, which may not capture the nuances of multimodal misinformation that often includes images, videos, or other contextual metadata. Incorporating such multimodal data could improve the applicability and robustness of our models. Second, while stylometric features offered interpretability, they were less effective for standalone classification compared to PLM-based methods, especially as AI-generated content continues to evolve in sophistication. This limitation highlights the need for richer feature engineering to capture subtle patterns in modern AI-generated text.
Another limitation lies in the performance of prompt-based classifiers, which were sensitive to prompt design and often underperformed compared to fine-tuned PLMs. This suggests the need for more advanced prompt optimization techniques. Additionally, our PLM-based models, while highly accurate, rely heavily on pretrained models like BERT and Electra, limiting their scalability to less-resourced languages or domains. Finally, the computational cost of training and evaluating these models presents a challenge for deployment in real-time or resource-constrained settings.
Furthermore, the current version of the FAANR dataset was constructed using outputs from a select group of best performing and popular generative models—specifically, GPT-4, Llama, gemma, and Mistral. While these models are widely recognized for their performance and have been chosen to represent diverse characteristics of AI-generated content, their inclusion may introduce inherent biases. In particular, since these models can share similar stylistic or linguistic features, the dataset might not fully capture the broader spectrum of AI-generated text produced by other emerging architectures. Furthermore, as the dataset primarily comprises English-language content, its applicability to non-English or cross-lingual contexts remains limited.
8.2. Future Work
Building on this work, several avenues for future research can be pursued to address the limitations and further advance the field. A key direction is the integration of multimodal data, such as text combined with images, videos, or metadata, using advanced architectures like multimodal transformers or graph neural networks. This would enable a more holistic approach to misinformation detection. Additionally, improving the performance of prompt-based classifiers through automated prompt optimization techniques, such as reinforcement learning or prompt tuning, could make these methods more robust and adaptable to diverse tasks and domains.
Extending this study to multilingual datasets or domain-specific corpora, such as scientific misinformation, is another important direction. Evaluating cross-lingual and cross-domain transferability would provide insights into the generalizability of the proposed methods. Enhancing interpretability remains a critical area, and future work could combine feature importance metrics with attention visualizations to make model predictions more transparent and trustworthy. Additionally, future iterations of the FAANR dataset will seek to incorporate a wider array of generative models, including those that are emerging and potentially offer different stylistic or domain-specific outputs. We plan to expand the dataset by integrating cross-lingual content to enhance its robustness and ensure broader applicability. Additionally, further empirical studies will be conducted to assess the impact of these biases on detection performance, with the aim of refining the dataset and improving the generalizability of the associated detection frameworks. Lastly, the development of lightweight and parameter-efficient architectures, or the pruning of existing models, could reduce computational costs, making these solutions accessible for real-time applications and deployment in low-resource settings.
Moreover, we will focus on developing advanced fusion strategies to better integrate stylometric features with deep transformer-based representations. Our initial experiments using simple concatenation showed a decrease in performance, likely due to the inherent heterogeneity between the explicit, surface-level signals of stylometric features and the deep, context-rich embeddings produced by transformers. To address this, we plan to explore techniques such as attention-based mechanisms and multi-view learning frameworks that can dynamically weight and harmonize these disparate feature sets. Future studies can aim to conduct a more granular analysis of the interactions between stylistic and semantic features, with the goal of enhancing both detection performance and model interpretability in practical deployment scenarios.
In conclusion, this work provides a robust foundation for tackling the dual challenges of misinformation detection and authorship classification in the context of AI-generated content. By leveraging the strengths of advanced PLM-based architectures alongside interpretable stylometric features, we achieved not only high classification performance but also meaningful insights into the linguistic and structural patterns of fake and AI-generated text. These contributions lay the groundwork for future innovations in this rapidly evolving field.