Background: Artificial intelligence poses a critical challenge to the authenticity of journalistic documents.
Objectives: This research proposes a method to automatically identify AI-generated news articles based on various stylistic features.
Methods/Approach: We used machine learning algorithms and trained five classifiers
[...] Read more.
Background: Artificial intelligence poses a critical challenge to the authenticity of journalistic documents.
Objectives: This research proposes a method to automatically identify AI-generated news articles based on various stylistic features.
Methods/Approach: We used machine learning algorithms and trained five classifiers to distinguish journalistic news articles from their AI-generated counterparts based on various lexical, syntactic, and readability features. BERTopic was used to extract salient keywords from these articles, which were then used to prompt Google’s Gemini to generate new artificial articles on the same topic.
Results: The Random Forest classifier performed the best on the task (accuracy = 98.3%, precision = 0.984, recall = 0.983, and F1-score = 0.983). Random Forest feature importance, Analysis of Variance (ANOVA), Mutual Information, and Recursive Feature Elimination revealed the top five important features: sentence length range, paragraph length coefficient of variation, verb ratio, sentence complex tags, and paragraph length range.
Conclusions: This research introduces an innovative approach to prompt engineering using the BERTopic modelling technique and identifies key stylistic features to distinguish AI-generated content from human-generated content. Therefore, it contributes to the ongoing efforts to combat disinformation, enhancing the credibility of content in various industries, such as academic research, education, and journalism.
Full article