1. Introduction
Health misinformation, which spreads quickly on social media, can cause serious health risks, especially in times of crises [
1,
2]. During Coronavirus Disease 2019 (COVID-19), false or misleading claims promoted harmful treatment, such as drinking bleach to prevent or cure the virus and encouraging anti-vaccination, placing individual and public health at risk [
3,
4]. In addition, studies have shown that continuous exposure to misinformation leads to psychological stress and anxiety [
2,
5]. Although the COVID-19 emergency has ended worldwide, misinformation about viruses is still distributed on social media. For example, during the recent war between Iran and Israel, which has resulted in the deaths of several Iranian military leaders [
6], a false claim has arisen promoting that Israel tracked these Iranian commanders through the COVID-19 vaccine [
7]. This conspiracy, claimed by a cleric practicing Islamic medicine [
7], highlights how outdated health misinformation continues to evolve and connect to various geopolitical events.
Efforts have been made to address ongoing misinformation risks using various machine learning and deep learning methods [
8,
9,
10,
11,
12]. While these methods are promising, they are heavily dependent on large, annotated datasets for effective performance [
13,
14]. Manually labeling datasets is time-consuming and labor-expensive, making these approaches unpractical for addressing misinformation due to its dynamic and evolving nature [
15].
The advent of generative AI models, especially large language models (LLMs) such as GPT-4 [
16], Gemini [
17], and Llama [
18], has allowed researchers to leverage their potential in addressing the limitations of ML methods in detecting misinformation [
19,
20,
21,
22,
23]. LLMs have extensive knowledge and robust reasoning capabilities due to their training on extensive text corpora [
24]. In addition, LLMs can perform well with little or no additional data, which reduces their dependency on large, labeled datasets [
25,
26,
27,
28]. While LLMs offer valuable approaches in detecting misinformation, they still face some limitations [
24,
29]. One of their main challenges is their lack of domain-specific knowledge, which can result in hallucination and reduced accuracy in their output [
24,
30]. While previous work has explored external knowledge sources, such as search engines [
21,
31,
32] and retrieval-based augmentation [
33], to address this gap, the potential of debunk lists, which are collections of fact-checked false claims, still remains largely unexplored. This study addresses this gap by evaluating how integrating debunk lists, used as external knowledge sources, can improve the performance of LLMs in detecting health misinformation.
The objective of this study is to assess the effectiveness of large language models (LLMs) in detecting health misinformation utilizing a COVID-19 external knowledge structure in the form of debunk lists. Specifically, it examines whether leveraging these curated fact-based resources enhances the detection accuracy and reliability of LLMs compared to a zero-shot learning approach. The following research questions were designed to address this objective:
RQ1: To what extent does augmenting LLMs with debunk lists improve their performance in detecting health misinformation?
RQ2: How accurately can LLMs apply the use of debunk lists in their reasoning justification when identifying health misinformation?
RQ3: How do LLMs apply debunk entities during misinformation detection across different debunk list sizes?
2. Literature Review
The use of LLMs for detecting misinformation has been increasing in recent years. Researchers have explored the use of LLMs for detecting misinformation from different perspectives including prompting settings [
19,
34,
35], leveraging their multilingual abilities [
20,
36] and multimodal capabilities, [
19,
37,
38], fine-tuning [
37,
39], and Retrieval-Augmented Generation (RAG) approaches [
21,
22]. Perline et al. [
20] evaluated the performance of GPT-4 using a zero-shot prompting approach for detecting four-way misinformation classification in a multilingual setting (English and German). GPT-4, under zero-shot settings, obtained better results than those of other benchmarks, reaching an F1 score of 42.8% (English) and 38.7% (German). In another study, Chen et al. [
35] used NoCoT and zero-shot CoT strategies on GPT-3.5, GPT-4, and Llama2-7b/13b-chat to detect both human-generated misinformation and LLM-generated misinformation. Their results show that the CoT prompting technique outperforms NoCoT in most of their experiments. Wu et al. [
19] leveraged GPT-3.5 as a feature extractor to improve the detection of out-of-context image–caption pairs. Their results show that their proposed method, GPT+RF, outperforms those in previous studies, reaching an accuracy of 92.9%.
Fine-tuning LLMs have also been explored in various studies to improve misinformation detection [
31,
39]. Pavlyshenko [
39] explored fine-tuning Llama 2 for various disinformation tasks, including fake news detection, propaganda analysis, fact checking, and sentiment extraction. Their results highlight that the fine-tuned Llama-2 model is capable of analyzing complex styles and narratives as well as detecting sentiment that can be used as a predictive feature in machine learning models. In another study, various LLMs, including GPT, Llama, and Gemini, were explored for detecting fake news detection [
31] on an ISOT dataset [
40]. Their results show that fine-tuned models, such as (RoBERTa with an F1 score of 0.99), significantly outperform LLM models (with an Avg. F1 score of 55.29%). Even fine-tuned Llama 3.1-8B reaches an F1 score of 60%. LoRA, introduced in [
41], has been studied as an efficient approach for fine-tuning LLMs in misinformation detection tasks [
37,
39]. Mura et al. [
37] reported that the combination of the LoRa fine-tuning and data augmentation techniques significantly improved the model’s performance.
Some studies have explored improving LLMs’ detection performance using the RAG method, which incorporates external knowledge and tools [
22,
42]. RAG can potentially improve fake news detection by combining a retrieval system with an LLM. For instance, Wan et al. (2024) [
21] prompted LLMs to identify the key elements in a news article and then retrieve related Wikipedia content. These retrieved passes are then prepended to the original article to improve the LLM’s contextual understanding in misinformation detection task. Cheung and Lam’s [
22] experiments highlight that integrating external knowledge with LLMs can lead to better fact-checking performance. Their proposed method, Fact Lama+, augmented with external knowledge, outperforms previous approaches by reaching an F1 score of 30.44% on the LIAR dataset [
43].
The debunk list approach used in this study, while conceptually related to RAG, differs in various important ways. RAG retrieves large volumes of uncurated documents based on similarity search. The retrieved content is then fed to a downstream model for further processing [
44]. However, this approach does not validate the factual correctness of the retrieved content. Consequently, irrelevant similarities might be included, potentially introducing noise that can negatively impact model performance [
45,
46]. In contrast, our curated debunk lists were collected from health authorities and contain pre-verified false claims. This provides high-quality and reliable external knowledge that aligns semantically with common health misinformation patterns. Debunk lists allow the model to conceptually match claims rather than rely on surface similarity, like in RAG. This helps reduce hallucination and improve misinformation detection. Although debunk lists require some level of human effort to collect and verify, they are lightweight and fast to implement. New entities can be added immediately without building retrieval infrastructure, making debunk lists especially practical for rapid deployment during emerging health misinformation events. Although RAG is generally more scalable across different domains, debunk lists potentially offer higher accuracy and lower noise.
3. Materials and Methods
This study utilized a multistage methodological framework to evaluate the performance of LLMs in detecting health misinformation, as illustrated in
Figure 1. In the first stage (A), a debunk list is created. This list consists of health-related claims that have been verified as false information [
47]. The list was extracted from trustworthy health sources such as the World Health Organization (WHO) [
48], Centers for Disease Control and Prevention (CDC) [
49], and John Hopkins University (JHU) [
50]. In the second stage (B), the test data are collected from social media posts. Both datasets are processed through an extract–transfer–load (ETL) process, through which raw data are cleaned, standardized, and formatted for the next phase of prompt design. The transformed data are then stored in a database to be used in experimental settings. In the third stage (C), prompts are designed for the Debunk-Augmented setting, which uses both the test data and the debunk entries. In the fourth stage (D), prompts include only the test data to represent the baseline zero-shot settings. In the fifth stage (E), prompts are sent to LLMs to detect misinformation on social media posts. In the sixth and final ftage (F), the outputs are analyzed and evaluated using standard performance metrics.
3.1. Debunk-Augmented Setting
In the debunk setting, prompts include both the statement and debunk items. Models are guided to classify the claims using the debunk list if relevant. If the models cannot find a relevant match, they are instructed to use their general medical knowledge. Output is returned in JSON format, comprising the predicted label, whether a debunk list was used, the ID of the debunk list if applicable, and a brief explanation. A sample prompt is shown in
Figure 2.
3.2. Zero-Shot Setting
In the zero-shot setting, the prompts are designed to use only the test statements without having access to additional external knowledge or the debunking entities. The models are guided to classify each statement as True or False based on their pretrained knowledge and reasoning. The output is required to be in JSON format, comprising the predicted label and a short explanation. An example prompt is shown in
Figure 3.
3.3. LLM Selection and Setup
In this study, we selected five state-of-the-art LLMs to evaluate their performance in detecting health misinformation. The chosen models were Llama-3.1-8B-instruct [
18], Mistral-large-2411 [
51], GPT-4o-mini [
16], Claude-3.5-haiku-20241022 [
52], and Gemini-1.5-flash [
17]. Llama-3.1-8B is an 8 billion-parameter model, designed for multilingual dialog and code tasks [
53]. Mistral-Large-2411 has 123 billion parameters and is optimized for reasoning, coding, and long-context use [
54]. The parameter sizes of Claude-3.5-haiku, Gemini-1.5-flash, and GPT-4o-mini are not publicly disclosed. Claude-3.5-haiku is designed for tasks such as code completions, interactive chatbots, and data extraction and labeling in domains like finance, healthcare, and research [
52]. Gemini-1.5-flash is a multimodal model optimized for efficient performance across text, image, audio, video, and long-context inputs [
17]. The domain focus for GPT-4o mini has not been publicly specified.
We selected these models because they were released in 2024 and support large-context windows of more than 100k tokens. Large-context windows are critical in this study because they allow the models to process both the statement and debunk list entities in a single prompt. In addition, these models are widely used in the literature, are publicly available, and represent a diverse range of capabilities. This selection enables us to evaluate the generalizability of our framework across different model types while considering computational limitations that affect reproducibility for other researchers. We did not include larger models such as GPT-4o and Llama-3.1-70B due to their high computational cost. The outputs of all models were returned in JSON format through API requests, except for Llama, which was run on a local server.
3.4. Evaluation Setup
The evaluation framework for this study includes assessing the model performance using five standard metrics: accuracy (1), precision (2), recall (3), F1 score (4), and Matthews Correlation Coefficient (MCC) (5) [
55,
56]. These metrics provide a complete view of model performance, beyond simple correctness. Precision and recall capture the balance between false positives and false negatives, while the F1 score reflects the overall classification quality. MCC evaluates the level of agreement beyond chance.
The performance of each model is evaluated under three experimental conditions: the Debunk-Augmented setting (50 and 100 Debunk) and zero-shot setting. These configurations allow us to measure the impact of external knowledge on misinformation detection. In addition to models’ classification performance, we evaluate how well each model recognizes and reports the use of the debunk entities. Specifically, we explore if the models reporting use of debunk information are more likely to generate accurate predictions. Each metric’s formula is provided below [
55,
56]:
3.5. Dataset
In this study, we used a publicly available COVID-19 misinformation dataset [
57] to test our proposed framework. This dataset includes 10,700 English social media claims about COVID-19, each labeled as either “real” or “fake”. Fake or misleading claims were collected from public fact-checking websites such as PolitiFact, and Snopes, and were manually verified against the original source documents. Real news was collected from verified official twitter accounts, (e.g., ICMR and CDC) and labeled as real if the content included factual information about COVID-19. The dataset is balanced, with 47.66% fake posts and 52.34% real posts. Following standard practice in machine learning, we randomly sampled 20% of the dataset (2,140 claims) to create a balanced, fixed test dataset. This test dataset was sufficiently large and balanced to provide reliable estimation of model’s performance.
In addition, we manually collected 120 debunked false claims about COVID-19 from trustworthy health sources, such as the WHO, CDC, and John Hopkins University, to sample the debunk settings (50 and 100 debunks). Claims were included if they were explicitly labeled as false at least with one of these agencies. Each claim was screened for clarity, and any duplicates were removed. The remaining claims were reformatted into the standard labeling format to maintain consistency across the list. The entities cover the major thematic categories of conspiracy theories, prevention, and treatment.
Table 1 summarizes the distribution of the datasets.
4. Results
This study evaluated the performance of five state-of-the-art LLMs under three experimental settings: zero-shot and Debunk-Augmented with 50 and 100 items. The goal of these evaluations is to assess the effectiveness of structured external knowledge in improving models’ misinformation detection performance. Model performance is examined across the three configurations to highlight the impact of debunk lists on various classification metrics (RQ1). The analysis also explores whether models correctly identify and apply relevant debunk information in their reasoning justification (RQ2). In addition, the most used debunk list entries are identified to provide insights for developing functional debunk lists for real-world deployment (RQ3).
4.1. Performance Comparison Across Debunk List Size (RQ1)
Table 2 illustrates the performance of five LLMs across five evaluation metrics, including accuracy, precision, recall, F1 score, and MCC, under zero-shot and 50 and 100 Debunk-Augmented items. Overall, the results show that leveraging LLMs with structured external knowledge in the form of debunk lists improves the performance of most models in detecting misinformation. In addition, increasing the number of debunk items from 50 to 100 generally improves the performance further; however, the magnitude of the improvements depends on the model.
As shown in
Figure 4, the most notable improvement is observed in Claude, where the F1 score increased by 6.96% under the zero-shot setting (76.1%) compared to under 50-Debunk (81.4%) and further to 84.5% in 100-Debunk. Similarly, Gemini and Mistral gain a 9.26% and 3.39% improvement in F1 score, respectively, under the zero-shot to 100-Debunk settings. These results highlight the role of a larger debunk size in providing broader misinformation coverage, assisting these models in more accurate classifications. GPT-4o shows lower improvement in F1 score (79.2% to 81.3%), but gains considerable improvement in precision from 79.5% to 89.4% in the 100-Debunk setting. This suggests that the model becomes more cautious in classifying misinformation when exposed to larger external knowledge. In contrast, Llama shows inconsistent results, dropping its F1 score under 50-Debunk (79.7% to 79.5%) and then recovering slightly with 100-Debunk (79.8%). This suggests that the model’s design and reasoning process play a critical role in effectively using debunk augmentation.
In addition to the standard evaluation metrics, MCC provides an overall performance of model predictions by considering all four elements of the confusion metrics, as shown in
Figure 5. For instance, while GPT-4o shows a slight improvement in F1 score, its MCC increases from 60.5% to 67.96%. This highlights the model’s capability for more balanced predictions. Similarly, Mistral reaches the highest MCC of 76.61% in the 100-Debunk setting, which emphasizes not only its high precision and recall but also its consistent predictions over the whole dataset.
While augmenting LLMs with debunk lists improves the model’s overall performance, it tends to lower the recall for most models. As shown in
Table 2 and
Figure 6, Claude, GPT-4o, and Mistral decline in recall when augmented with debunk lists. For instance, Claude’s recall drops from 89.5% under the zero-shot setting to 84.7% in Debunk-50 and recovers slightly to 85.95% with Debunk-100. GPT-4o shows slower decreases (from 78.9% to 74.6%), while Mistral’s recall drops from 85.4% to 80.7%. This suggests that when models are exposed to external knowledge, they may miss more true positives. In contrast, Llama shows an unpredictable pattern by dropping its recall to 73.9% for 50-Debunk but significantly recovering in 100-Debunk.
Figure 7 shows two sample responses from Claude-3.5-haiku, highlighting how external knowledge can guide the model’s predictions, especially in scenarios where the model lacks domain-specific knowledge. The Debunk-Augmented output references a relevant debunk entity (# 15), while the zero-shot output is based only on the model’s internal knowledge. This shows the model’s strong reasoning and proper use of related debunk items.
4.2. Evaluating Debunk Utilization and Correct Prediction Rates (RQ2)
To provide a better understanding of how models apply external knowledge during misinformation detection, we explore the frequency and correct use of debunk list across both 50-item and 100-item debunk list.
Table 3 summarizes the percentage distribution for four model outcomes: (1) No|No: no debunk list was used and the model made incorrect predictions; (2) No|Yes: no debunk list was used and the model made correct predictions; (3) Yes|No: a debunk list was used and the model made incorrect predictions; (4) Yes|Yes: a debunk list was used and the model made correct predictions. We focused on the last two categories to investigate if models used debunk lists correctly.
In most models, increasing the debunk list from 50 to 100 items improved correct use (Yes|Yes). However, the magnitude of this improvement varies by model. GPT-4o’s correct use of debunk information increases from 10.3% to 15%, which indicates its capability in applying additional debunk items. However, this also increases the incorrect usage from 0.9% to 1.6%, which highlights the trade-off between higher usage and lower precision.
As shown in
Figure 8, Llama shows the highest debunk list usage but also the highest rate of incorrect prediction. It shows the most significant change through the increase in Yes|Yes from 27.7% to 37.6% and significant increase of 5.1% to 11.6% in Yes|No. In contrast, Mistral shows the most stable trend through its increasing Yes|Yes rates from 8.2% to 10.1%, while its Yes|No rates increase moderately from 1% to 1.2%. This highlights the ability of Mistral in its correct use of debunk lists. Claude and Gemini show minor changes, which suggests that their performance saturated regardless of the increase in debunk list length. These findings show that while larger debunk lists can improve model performance, they may introduce noise to some models.
4.3. Qualitative Analysis of Debunk Usage (RQ3)
To have a better understanding of how LLMs apply the debunk list during misinformation detection, we analyze the top debunk items that were most used during prediction. Our analysis is based on two categories across two debunk list size (50 vs. 100 items): when a debunk item is used and the prediction is correct (Yes|Yes) and when a debunk item is used but the prediction is incorrect (Yes|No).
Table 4 and
Table 5 summarize the most frequently used debunk items for each category.
4.3.1. Analysis of Debunk Usage in Correct Prediction (Yes | Yes)
As shown in
Table 4, specific debunk items were utilized by multiple models during correct predictions. For example, Debunk item #32 (“COVID-19 is an engineered biological weapon created by the United States or China”) and Debunk item #2 (Quercetin can protect you from the coronavirus or treat COVID-19) were used several times across various models. This shows that these debunk entries provide useful semantic cues to the test data and are well aligned with the structure and language of the data. In addition, many of these top-used debunk items reflect the most common concerns raised by the public during a pandemic, such as remedies and the origin of the virus. Moving from 50-Debunk to 100-Debunk results in some interesting shifts. We notice that several top entries from the 50-Debunk setting remain top entries under the 100-Debunk setting (e.g., Debunk items #2, #3, and #12). This indicates that these core entries play a critical role in misinformation detection. However, under the 100-Debunk condition, new top debunk items such as Debunk item #28 (“Taking in alcoholic drinks will protect you from getting COVID-19 or treat the disease”) and Debunk item #41 (“Social distancing is not important to reduce the spread of the Covid virus”) appear. This shift suggests that increasing the number of debunk items to 100 allows the models to have access to broader and more diverse semantic content. This access improves the prediction over a wider range of misinformation topics.
4.3.2. Analysis of Debunk Usage in Incorrect Prediction (Yes|No)
To gain a deeper understanding of models’ misuse of debunk entries, we analyzed which debunk items were used most frequently when the model misclassified the statement (Yes|No).
Table 5 summarizes the most frequently misused debunk items across both the 50- and 100-Debunk configurations. Most models, such as Claude, GPT-4o, Gemini, and Mistral, do not show an increased number of incorrect predictions moving from the 50- to 100-Debunk configuration. In fact, expanding the number of debunk entities resulted in lowering the incorrect use of the debunk entities and improving the accuracy of prediction. This indicates that a more diverse debunk sets allows for broader coverage, which helps these models avoid repeating similar mistakes.
However, Llama demonstrates a different pattern. Llama shows a noticeable increase in incorrect prediction when the length of debunk list entries increases from 50 to 100. For instance, Debunk item #41 (“Social distancing is not important to reduce the spread of the Covid virus”) was used 92 times in incorrect predictions. Similarly, Debunk items #40 and #45, which focus on vaccine efficiency, were significantly misused by Llama, which negatively affected its performance. To investigate the root causes of Llama’s misclassifications, we analyzed various model response examples.
Table 6 shows a few instances where Llama used the debunk entries that were loosely related to the actual claims. For example, the input claim “Social distancing results in record low flu numbers” is classified as false, which is justified by Debunk item #41. Another example with factual claims is “The US has now completed tests on over 1 million people”, incorrectly linked to Debunk item #46. Although the claim and debunk entry look similar, Llama misuses the debunk entry. This results in the rejection of a correct statement. These examples suggest that Llama makes errors in the 100-item setting by relying on thematic similarities to incorrectly reject true statements. This behavior of Llama is in contrast with that of the other models, whose accuracy improves by increasing the size of debunk lists.
4.4. Comparison with Prior Work
While previous studies have explored strategies to improve misinformation detection using LLMs, direct comparison is challenging due to differences in study design.
Table 7 summarizes a high-level overview of recent studies that integrate external knowledge to improve LLM-based misinformation detection. Wan et al. [
21] and Li et al. [
32] explored misinformation across a range of topics by integrating external evidence from Wikipedia and web domains, respectively. While these studies utilized broad external sources, they did not focus on external health knowledge. Cao et al. [
33] focused on scientific misinformation by integrating scientific abstracts from the CORD- database. Although this study focuses on the COVID-19 domain, their work addresses the verification of scientific reports’ accuracy. In contrast, our study focuses on exploring whether augmenting external knowledge with LLMs can improve their ability in detecting health misinformation in social media claims, where data are often noisy, informal, and filled with acronyms.
Unlike previous studies, our research directly evaluates how LLMs can utilize fact-checked false claims, debunk lists, to improve misinformation detection. In addition, we assess not only models’ performance improvement but also whether they can correctly recognize and report their use of external knowledge. This approach offers valuable insights into LLMs’ accuracy, reasoning, and reliability in detecting misinformation.
5. Discussion
The findings of this study highlight the potential of using debunk lists as structured external knowledge to improve generative AI models, particularly LLMs’ performance, in misinformation detection. Interestingly, this improvement is achievable without retraining the model architecture, making it a practical approach for mitigating misinformation that has a dynamic and rapidly evolving nature. By simply augmenting LLMs with debunk lists, the models can be guided to make more accurate classifications, especially in cases where models suffer from a lack of specific domain knowledge.
Increasing the size of a debunk list improved most models’ performance; however, the magnitude of improvement varies across models. Claude and Gemini show the highest gains in F1 score. This suggests that these models have stronger reasoning abilities and are better optimized for effectively integrating external knowledge. GPT-4o-mini and Mistral also benefited consistently from the increasing the size of the debunk list but slightly less. On the other hand, Llama shows only minimal improvement between the zero-shot and Debunk-Augmented configurations. Its performance even declined with longer debunk lists. The exact root cause cannot be confirmed without full disclosure of the training details. However, potential explanations include a smaller model size, less training on following instructions through prompts, and sensitivity to longer prompts. These findings suggest that the benefit of debunking augmentation depends on the underlying model structure and training process.
Although most models showed lower recall when augmented with debunk lists, this pattern is expected. The debunk augmentation makes models be more conservative in detecting false claims by prompting them to verify claims against debunk items. As a result, some false claims that are only weakly similar to debunk statements may be missed. This increases false negatives and lowers recall. However, the same reason also reduces false positives and improves the reliability of predictions, as reflected in higher precision and MCC. MCC evaluates classification performances by applying all four elements of a confusion matrix (true positive, true negatives, false positives and false negatives). This allows MCC to capture the overall reliability of the models’ prediction when precision and recall move in opposite directions.
A deeper analysis of model justifications shows that models apply debunk lists differently. For instance, while Llama frequently reported the use of debunk lists in its explanations, it also had the highest rate of incorrect predictions. This highlights concerns about the reliability of self-reported usage where higher usage does not necessarily reflect better application. In contrast, other models, such as Claude, Mistral, and Gemini, used the debunk entities less frequently but more effectively. These results show that recognizing external knowledge is not the same as integrating it effectively into reasoning justification.
In addition to performance metrics, the way models relate to debunking entities provides valuable insight into their reasoning behavior. The models’ reliance on a small number of effective debunk entries, especially the ones related to common health misinformation topics, like remedies and prevention, highlight the role of specific debunk items as strong guiding cues during decision making. This suggests that a small but well-chosen debunk list can improve model performance without needing a large knowledge base. Creating such a list requires understanding of common misinformation themes on social media. The benefit of debunk lists depends on whether models can apply them correctly. If not, external knowledge may confuse models and reduce performance.
In practice, a debunk list can be integrated into a real-time misinformation monitoring pipeline for social media platforms and online health systems. Debunk lists can be easily updated since new entries can be added quickly during emerging public health events. This allows rapid deployment without the need for a complex retrieval infrastructure. In addition, this approach can improve existing fact-checking systems by reducing false claims and improving trust in automated detection systems.
6. Limitations and Future Work
While this study shows the effectiveness of debunk list augmentation for detecting health information, it also has some limitations. An additional investigation could examine the inconsistent behavior of Llama-3.1-8B by comparing model size, training methods, and sensitivity to longer prompts. The debunk lists we used in this study mainly focus on COVID-19-related themes, including conspiracies, prevention, treatments, and remedies. These topics reflect the overall public concerns during the early stage of the pandemic. Future work should explore whether expanding debunk lists to cover more diverse misinformation themes, such as political influence or financial health claims, can further improve the generalizability and performance of the models.
While curated debunk lists offer high precision, they require upfront manual effort. Future work should compare RAG implementation and debunk lists from different perspectives such as accuracy, implementation complexity, and risk of introducing noisy content. It would also be beneficial to investigate a hybrid framework that combines the scalability of RAG with the precision of debunk lists.
All LLMs evaluated in this study were released in 2024. They may have been exposed to COVID-19-related content during their pretraining. To reduce this potential effect, we instructed the models in the prompts to first use the debunk list as a reference and only then rely on their general knowledge. Future work could also evaluate models with a pre-COVID-19 knowledge cutoff. This would provide more accurate insights into how much of their performance is related to pretraining exposure versus use of the debunk list. It is also important to highlight that misinformation is shaped by social, cultural, and temporal factors. COVID-19 conspiracy theories spread extensively during the early stage of the pandemic, while the later stages focused more on vaccine and treatment-related content. Although our debunk lists capture some of these shifts, further research should examine how social context, cultural framing, and temporal factors influence the model performance.
Additionally, the proposed framework can be generalized to other misinformation domains, like politics and finance. Future work should explore its adaptability across these areas. However, unlike the health domain, where false claims can be verified from authoritative sources, misinformation in politics or finance is usually more subjective. Applying this framework to other domains may require the development of domain-specific debunk lists from reputable fact-checking organizations. This may also require refining prompts to integrate domain-relevant context.