Mapping the Infodemic: Geolocating Reddit Users and Unsupervised Topic Modeling of COVID-19-Related Misinformation

Alarfaj, Lulu; Blackburn, Jeremy; Amjad, Maaz; Patel, Jay; Ertem, Zeynep

doi:10.3390/info16090748

Open AccessArticle

Mapping the Infodemic: Geolocating Reddit Users and Unsupervised Topic Modeling of COVID-19-Related Misinformation

by

Lulu Alarfaj

^1,*

,

Jeremy Blackburn

²,

Maaz Amjad

³

,

Jay Patel

² and

Zeynep Ertem

¹

Department of System Science and Industrial Engineering, Binghamton University, Binghamton, NY 13902, USA

²

School of Computing, Binghamton University, Binghamton, NY 13902, USA

³

Department of Computer Science, Texas Tech University, Lubbock, TX 79409, USA

^*

Author to whom correspondence should be addressed.

Information 2025, 16(9), 748; https://doi.org/10.3390/info16090748

Submission received: 9 July 2025 / Revised: 13 August 2025 / Accepted: 26 August 2025 / Published: 28 August 2025

Download

Browse Figures

Versions Notes

Abstract

The problem of geolocating Reddit users without access to the author information API is tackled in this study. Using subreddit data, we analyzed and identified user location based on their interactions within location-specific subreddits. Using unsupervised learning methods such as Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) algorithms, we examined conversations about COVID-19 and immunization across the U.S., focusing on COVID-19 vaccination. Our topic modeling identifies four themes: humor and sarcasm (e.g., jokes about microchips), conspiracy theories (e.g., tracking devices and microchips in the COVID-19 vaccine), public skepticism (e.g., debates over vaccine safety and freedom), and vaccine brand concerns (e.g., Pfizer, Moderna, and booster shots). Our geolocation analysis shows that regions with lower vaccination rates often exhibit a higher prevalence of misinformation-labeled comments. For example, counties such as Ada County (Idaho), Newton County (Missouri), and Flathead County (Montana) showed both a low vaccine uptake and a high rate of false information. This study provides useful information on the many different examples of misinformation that are disseminated online. It gives us a better understanding of how people in different parts of the U.S. think about getting a COVID-19 vaccine.

Keywords:

fake news; COVID-19; geolocation; misinformation; unsupervised learning; topic modeling

1. Introduction

Digital media platforms now serve as modern meeting places, global commons, where people from all over the world can discuss and share many topics. During the COVID-19 pandemic, these online spaces became important places in which to share information, voice concerns, and, sometimes, spread false information about the virus and vaccines. The World Health Organization (WHO) called this an “infodemic,” which means that there is too much information, some of which is true and some of which is not, which makes it hard for people to find reliable sources and advice [1]. Cinelli et al. [2] also showed that social media made it easier for false information to spread during the pandemic. A recent systematic review found that people who were exposed to false information during the COVID-19 pandemic were less likely to get vaccinated and more likely not to trust public health measures [3]. We wanted to find out how false information about COVID-19 vaccines spreads on Reddit. Reddit is a great place to study public discourse surrounding COVID-19 vaccination-related conversations because it has various subreddits for different topics and places. A 2022 observational study published in BMC Public Health found that Reddit is a good place to find public health stories and concerns about vaccines because it contains a lot of information [4]. The data from subreddits was examined to show how users interacted with each other in location-specific subreddits to find geographic patterns. If a user always posts in a specific subreddit linked to a certain area, we linked that user to that area. We then compared this inferred geographic data to COVID-19 vaccination rates in the U.S. to find out how online conversations differ from one area to another and how online discussions and vaccination patterns vary across regions.

A recent study showed that in places where people do not trust the government and there is political division, false information spreads more easily [2,4]. We looked at some counties where few people got vaccinated and found that there was a lot of false information, and these counties represent just a small part of a bigger problem. In our study, we wanted to find useful information on how people interact regarding COVID-19 vaccinations, particularly the common issues, misunderstandings, and differences in how people react to vaccine-related topics in different parts of the U.S. One issue with the earlier research on this subject is that there have not been enough in-depth studies on how misinformation about COVID-19 vaccinations on Reddit and attitudes toward vaccines differ in the U.S. by region. In this study, we mapped the users based on their interactions on subreddits, topic modeling, and misinformation analysis to fill this gap in the research. By looking at the data at the county level, we were able to explore the relationship between vaccination rates, the spread of false information, and regional patterns of public discussion. This method helps us to find out which areas need the most professional help and gives us a better idea of what local factors may be associated with the spread of misinformation. Also, looking at location data can help use to identify places where false information is likely to spread. This can help identify regions where targeted communication strategies may improve public understanding. This study aims to answer the following question: How does COVID-19 vaccine misinformation on Reddit affect people’s decisions to get vaccinated in the U.S.?

2. Literature Review

2.1. Volume and Impact of COVID-19 Misinformation

The COVID-19 pandemic led to an information overload within social networks. Though the majority of this information entailed useful new information that had evidence behind it, some of it contained misleading or even false information. Such misinformation was highly disseminated, which had quite a significant impact on the attitude, behavior patterns, and judgment of the population during the pandemic [5]. This created more confusion and mistrust of health officials as people were frequently misinformed on how serious the virus was, on preventive actions, and on vaccine efficacy. Understanding the extent and impact of this miscommunication is vital to establishing improved methods of communication and improving our responses to future global public health emergencies.

2.2. Methods for Detecting Fake News

During the COVID-19 pandemic, fake news was identified through a variety of innovative computerized techniques. Zhang et al. proposed a method whereby recurrent neural networks and convolutional neural networks are applied to the study of the pattern of fake news in social networks [5]. This model proved to be promising as it could identify misinformation early by monitoring the spreading patterns on social media platforms such as Twitter. Patwa et al. compiled all of the available data on fake news related to COVID-19 across multiple sources and used machine learning models, including logistic regression and decision trees [6]. Their models could accurately distinguish between real and fake news, identifying popular misinformation topics such false cures, preventive myths, and conspiracy theories.

More work has been undertaken to investigate how to detect misinformation in non-English settings. Wasim et al. created a benchmark dataset of fake news in Urdu, proving that models such as logistic regression can achieve high levels of accuracy in detecting false news [7]. Their study showed that the such methods can be employed across different languages and cultures and are therefore an effective means of combating misinformation in the world. Similarly, Chen et al. employed deep learning models, including CNN and LSTM, to identify news about COVID-19 with high precision, which demonstrates that deep learning can be employed to categorize disinformation during a health crisis [8].

2.3. Topic and Sentiment Analysis Approaches

Another way in which this has been achieved is by looking at the content and the emotional tone of the COVID-19 discussion to unveil the themes of misinformation. An analysis of Twitter conversations by Sharma et al. based on keyword filtering manually coded several prevalent themes, such as conspiracy theories, ineffective treatments, and mistrust of pandemic mitigation measures [9]. This piece emphasized how regional misinformation should be addressed as beliefs and myths varied wildly depending on location. CoronaVis is a real-time tweet analysis system built by Kabir et al. It applies natural language processing to tweet content categorization and symptomatic analysis [10]. This tool provided a dynamic picture of the changes in the flow of public opinion and false news over the course of the pandemic.

Additional platforms such as Reddit were also a major source of spreading or combating misinformation. Melton et al. reviewed the discussion of the topic of vaccines on Reddit and found that there was a high level of people propagating inaccurate opinions and stories [4]. The unsupervised learning they used allowed for the identifying of patterns of misinformation and regional issues in the conversations held in the U.S. Valdez et al. conducted a longitudinal analysis of the tweets dedicated to mental health issues and revealed an increase in psychological stress after social media exposure to disinformation [11]. The emotional toll of the pandemic was found to be high in their study, reinforcing the need for mental health provisions and restrictions on digital misinformation.

2.4. Geolocation and Regional Dynamics

The geographical dynamics of misinformation has become a critical feature of how to manage public health messaging. Mayank et al. proposed a knowledge graph-based system that mapped the relationships between stories in order to find deception more precisely [12]. This method revealed the connections between various fragments of information and the way false storylines are created. Oyebode et al. performed natural language processing to find important issues and emotional triggers in social media discussions, which could aid researchers in comprehending what makes people believe or promote misinformation [13]. These results are critical to promoting public health messaging that can appeal to certain fears and perceptions.

Another factor that determines the spreading of misinformation is the regional context. The study conducted by Alsudias et al. examined Twitter communication in the Arab world, and the researchers concluded that culturally sensitive and localized communication measures are needed to combat disinformation [14]. Ajao et al. demonstrated the use of multidimensional data to enhance the correctness of fake news with a hybrid model in which the traditional content analysis was complemented by metadata (i.e., timestamps and user data) [15]. Jlifi et al. created the Soft T-LVM model, which incorporated both linguistic patterns and environmental variables in order to enhance deception identification [16]. In a case study on Taiwan, Lin et al. highlighted the necessity of following a transparent communication approach and actively involving the media to foster the trust of the population and reduce misinformation [17].

Lastly, Qazi et al. presented GeoCoV19, a large multilingual geotagged dataset that allows for the geographical mapping of misinformation [18]. This tool has the capacity to help researchers and policymakers to determine misinformation hotspots and direct responses to reduce misinformation in identified locations. The need for culturally sensitive, evidence-based, and specific answers to pandemic misinformation is demonstrated by these analyses as the geographic and linguistic factors are included.

This literature review leads us to our main research question:

How does COVID-19 vaccine misinformation on Reddit affect people’s decisions to get vaccinated in the U.S.? By addressing this question, our study links county-level topic modeling of COVID-19 vaccine narratives with previous work on geolocation techniques and misinformation analysis.

Table 1 provides an overview of prior work on COVID-19 misinformation detection, geolocation methods, and social media analysis, outlining the methods used, areas of focus, and research gaps that motivate our current study. This study aims to fill the gap in investigating misinformation related to COVID-19 vaccination at the county level within the United States using Reddit data and unsupervised learning techniques. By combining geolocation information with vaccination rates and misinformation prevalence, it provides an overview of the challenges posed by misinformation in different areas, contributing to informed public discourse and targeted interventions.

3. Methodology

To study how COVID-19 vaccine misinformation on Reddit affected vaccination decisions in the United States, we followed a five-phase approach (see Figure 1). First, using a keyword set from previous research on COVID-19 misinformation, we collected approximately 18 million Reddit comments posted between December 2020 and July 2022. Next, we performed geolocation analysis by linking users to specific U.S. counties based on their interactions in location-based subreddits. In the third phase, we manually labeled and preprocessed the comments as either true or false, followed by data cleaning.

Two topic modeling techniques were applied, Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF), to identify key themes in the data. Finally, we mapped the misinformation trends against county-level vaccination rates to explore regional patterns and their potential impact on public health discussions.

3.1. Phase 1: Data Collection

For this study, we collected a total of 17,748,278 Reddit data points. To gather these Reddit data, we used keywords from previous studies on COVID-19 misinformation, such as “vaccine,” “booster,” “unvaccinated,” “Pfizer,” “Moderna,” “5G,” “tower,” “chips,” “microchips,” “Coronavirus,” “COVID19,” “Corona,” “Covid,” and “pandemic” [21,22]. These keywords were used to create a dictionary that aimed to collect Reddit posts specifically related to COVID-19 vaccination. The data was saved in JSON format, which contained fields like the author’s username, the subreddit where the comment was posted, the comment text, the created-utc, which represents the timestamp of when a post or comment was created, and the score of the comment to show how many people liked it.

3.2. Phase 2: Geolocation Analysis

We filtered the Reddit dataset by matching subreddit names to U.S. cities and counties listed by SimpleMaps [23] in order to identify city-specific subreddits for geolocation purposes. If a user always commented on a specific city or county subreddit, we inferred the user was from that location. As a result of this filtering process, we obtained a sample of 5853 Reddit data points.

3.3. Phase 3: Data Preprocessing and Labeling

In the data labeling process, the 5853 comments were manually labeled as either “true” or “false” by a team of four researchers. We reviewed and evaluated each comment, focusing on the context, mood, and overall message to determine its accuracy. Some of comments were removed during the preprocessing phase because they had unrelated comments or links. After filtering, we assembled a final dataset of 5014 labeled Reddit data comments, with 3439 labeled as true and 1575 as false.

To assess the reliability of the manual annotation process, inter-annotator agreement was calculated using Cohen’s

κ

, which resulted in a value of 0.82, indicating nearly perfect agreement among the annotators.

Data Cleaning and Transformation

In order to get the Reddit comment data ready for unsupervised learning analysis, it first went through a number of cleaning and transformation stages. First, all of the text was changed to lowercase. We removed punctuation marks and special characters, and we used standard English stopword lists to remove stopwords. Tokenization was used to split the text into separate words. Next, lemmatization was used to reduce the words to their base or dictionary forms (for example, “running” to “run”) to make sure that all the different word forms were categorized as the same. The Term Frequency Inverse Document Frequency (TF-IDF) approach was then used to turn these cleaned tokens into numbers. This method gives more weight to unique and relevant terms in the dataset. This preparation procedure made sure that the text data was organized and ready for use in topic modeling techniques.

3.4. Phase 4: Topic Modeling and Unsupervised Learning

The study by Chandrasekaran, Mehta, Valkunde, and Moustakas (2020), who analyzed tweets about COVID-19 using unsupervised analysis, was useful in guiding this study [24]. It encouraged us to look into similar approaches for studying the terrain of misinformation in our Reddit dataset.

In our study, we decided to use the Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF). Among the various algorithms that can be used for topic modeling, LDA is an analysis tool used for discovering latent topics within a given set of documents. LDA assumes that each document is a mixture of topics, and each topic is a mixture of words. It is applied to identify the topics, as well as the keywords related to these topics, based on the co-occurrence patterns of the words found within the corpus.

On the other hand, NMF is a matrix factorization algorithm that decomposes the document-term (DT) matrix into two non-negative matrices of topic–document and word–topic distributions. NMF has been found to provide additive and interpretable topics; hence, it is a useful tool in topic modeling.

LDA and NMF are the two most popular and efficient techniques for unsupervised topic modeling [25]. These methods help uncover the underlying structure in the text data without requiring any labeled data. We applied these algorithms to our preprocessed Reddit comments dataset, aiming to identify the main topics and narratives relating to COVID-19 vaccine misinformation. By applying the Latent Dirichlet Allocation (LDA) model and Non-Negative Matrix Factorization (NMF) algorithm, we were able to extract four significant topics, which help reveal different aspects of the misinformation environment [4].

3.4.1. Key Findings from Topic Modeling

We used two topic modeling methods for Reddit comments about COVID-19 vaccines: Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF). The results from both methods were the same and consistent. The analysis brought up four main topics.

In Table 2, which presents the LDA’s four-topic model, the first topic was humorous and informal discussions about vaccinations, such as jokes about microchips, booster shots, vaccination cards, and testing; conspiracy theories emerged as the second topic, including discussions about tracking devices and microchips; the third topic concerned public debates over whether to get vaccinated or not, including opinions on personal rights, when vaccines should be given, and how effective they are, showing that users had different views; and the fourth topic concerned vaccine brands, especially Moderna and Pfizer, as well as booster shots and worries that technical terms like “phone” and “microchips” were being used in vaccine conversations.

In Table 3, which shows the NMF four-topic model, we can see that similar themes appear but with different emphases. The first topic focuses on public health concerns, such as virus spread, vaccination status, and health risks; the second topic concerns discussions around booster shots, vaccine brands like Moderna and Pfizer, and timing between doses; the third topic concerns conspiracy theories involving microchips, tracking, and mentions of Bill Gates; and the fourth topic concerns confusion and misinformation, blending vaccine-related terms with unrelated concepts like “flu,” “tower,” and “reception.”

The LDA and NMF topics give a clear picture of the most popular false conversations on Reddit about COVID-19 vaccines and set the stage for the geographic and misinformation analysis that will be covered in the next section.

3.4.2. Distribution of Misinformation Across Topics

As discussed in Section 3.3, our dataset contains more TRUE-labeled comments than FALSE-labeled ones. However, by using topic modeling in both LDA and NMF, our findings show that misinformation frequently concentrates on specific topics, particularly those related to conspiracy theories, vaccine skepticism, and technological myths, rather than being evenly spread.

In the LDA model, Topic 3, which focuses on vaccine brands and technical distrust, shows the highest concentration of FALSE-labeled comments (70.97% FALSE). Discussions included Moderna, Pfizer, booster shots, and terms like “phone” and “microchips”, suggesting overlap between brand skepticism and technological misinformation. Next is Topic 0 (67.26%), concerning humorous and informal vaccine discourse. This included jokes about microchips, booster shots, vaccination, and testing. These comments reflect a casual attitude toward vaccination, which may make it harder to tell whether the information is meant to be taken seriously. Topic 2 (60.49%) concerns public debates over vaccine necessity and efficacy. Users discussed personal rights, timing of doses, and skepticism about effectiveness, showing division in public opinion. These topics featured keywords such as “microchips”, “Moderna”, “activation”, and “unvaccinated”.

Similarly, in the NMF model, Topic 3, technological confusion and misinformation, comes to 58% FALSE. This topic includes vaccine-related keywords like “flu”, “tower”, and “reception”, indicating confusion and misinformation. Topic 1 (57.50% FALSE) is misinformation-dense, concerning booster shot discussions and brand skepticism. Conversations revolved around Moderna, Pfizer, timing between doses, and possible uncertainty about safety.

Our findings show that COVID-19 vaccine misinformation is not random, instead being clustered around certain themes, especially those involving conspiracy theories, vaccine brand distrust, and technological myths.

3.5. Phase 5: Visualization, Mapping, and Insights

The fifth segment of this paper investigates the spatial variability of vaccine misinformation topics by integrating the topic modeling results with inferences concerning the location of users. In order to demonstrate such patterns, this study employs maps that resemble Geographic Information Systems (GIS) to present the data at the level of U.S. counties. Figure 2 shows the rate of vaccination at the county level, and Figure 3 presents the average percentage of false news at the county level.

Our geographic analysis (Figure 2 and Figure 3) reveals how these topics change from one area to the next. For instance, Multnomah County in Oregon state, the District of Columbia in Washington DC, and New Castle County in Delaware state all had high vaccination rates and low false-news labeling.

On the other hand, counties with more false information tend to have lower vaccination rates and more social media discussions about conspiracy theories. For example, places like Ada County in Idaho State, Newton County in Missouri State, Flathead County in Montana State, and Cloud County in Kansas State all show high levels of misinformation and low vaccine uptake. This suggests that strong local communication strategies may help to cut down on false information.

This study shows that some areas still have both low vaccination rates and high levels of misinformation. The results suggest that false information does not spread randomly; it tends to follow patterns based on location and topic. Our work is different from earlier studies like those of Melton et al. [4] and Chandrasekaran et al. [24] because it specifically looks at misinformation on Reddit. We also added a geographic layer that links online discussions to real-world vaccination outcomes.

4. Discussion and Analysis

The topic modeling results in Section 3.4.1, Table 2 and Table 3, show four main themes in Reddit discussions about COVID-19 vaccines. Conspiracy theories, especially worries about tracking devices, microchips, and telecommunication signals, were one of the most talked-about topics. This topic included a lot of keywords, such as “tracking”, “tracker”, “chips”, “microchips”, and “phone”. It sounds like people are talking about vaccines and technologies for spying on people and implementing chips and microchips via the vaccine. As demonstrated in other studies, such as the sentiment analysis by Chandrasekaran et al. [24], emotionally charged and tech-focused keywords often make false information more popular online. To come up with effective ways to deal with misinformation where it has the most impact, it is important to understand the context of these conversations.

The main contribution of this study is that it combines topic modeling of user comments with estimated user locations and public health data. This method helps us to better understand where false information spreads and how it might influence people’s behavior. Previous researchers haved tended to examine misinformation without indicating where it was the most prevalent. Bozarth and Budak [26] discovered that the quantity of misinformation on Reddit was proportional to the population of users in each U.S. state, but they did not investigate the topics discussed. Our approach offers details, demonstrating what themes can be found in various regions. By using Reddit as a data source and connecting regional discussions to county-level vaccination rates, this study offers helpful insights into health communication and future tools for tracking misinformation. Likewise, Habib and Nithyanand [27] performed topic modeling on Reddit vaccine-related conversations. In their study, safety concerns were one of the key topics, and regional variations were not analyzed. Our results indicate that regional patterns may be used to identify special misinformation subjects that national averages may miss.

The advantage of this analysis is confirmed by recent research employing similar analyses on other platforms. As an example, Gozzi et al. [28] applied topic modeling and geolocation to Twitter to gain a better understanding of the public COVID-19 discourse, and Bozarth et al. [29] created a map of misinformation networks to develop more effective interventions. These demonstrate that a content analysis combined with geographic data may assist health officials in drawing region-specific plans.

By identifying user locations and analyzing online discussions using topic modeling, this study shows the impact of how vaccine misinformation varies across different regions, but further analysis demonstrates the influence of other factors in shaping certain misinformation patterns [24]. Multnomah County in Oregon state, the District of Columbia in Washington DC, and New Castle County in Delaware state, which have high vaccination rates and less misinformation, demonstrate the impact of effective communication and public engagement. In contrast, areas with lower vaccination rates and more false information, such as Ada County in Idaho State, Newton County in Missouri State, Flathead County in Montana State, and Cloud County in Kansas State, indicate where better support and clear messages may be needed. These differences show how important it is to modify health communication to the needs of each community.

5. Limitations

Our study provides useful information on how vaccine-related misinformation on Reddit varies by area, but it also has some limitations that need to be pointed out. First, geolocation based on subreddit activity might not always show where users really are because people from different regions may be participating in region-specific discussions or because the names of some subreddits are not always clear. Second, labeling comments by hand adds a subjective element, even though strict rules were followed. Third, our study only looked at the Reddit platform. This means that it may not be true for other sites or for the way people act in general. To conclude, linking subreddit activity to specific locations can help us estimate a user’s location based on their participation in location-based subreddits, but this method is not always accurate. These flaws show how important it is to employ a range of tools and methods and to pay close attention to the data.

6. Conclusions

The geographic analysis of misinformation patterns in relation to vaccination rates shows that we need to use different methods to identify and fight the misinformation. Vaccination rates and misinformation levels may be related, but other factors also contribute to how false information spreads. Geolocation data, vaccination rates, and the prevalence of false news all work together to provide a full picture of the problems that misinformation causes in different places.

In answering our research question, we found that counties with lower vaccination rates often had more vaccine misinformation on Reddit. This suggests that misinformation may be linked to how people decide whether or not to get vaccinated.

By examining social media conversations during the COVID-19 pandemic, this study offers insight into how misinformation may be linked to people’s decision-making. These results should be helpful in planning new ways to share accurate health information in the future, especially in places where people are more likely to be exposed to false information. Future research could expand this framework to other platforms or employ the real-time monitoring of misinformation trends by region.

Author Contributions

Conceptualization, L.A. and M.A.; methodology, L.A., M.A. and J.P.; validation, L.A., J.B. and Z.E.; formal analysis, L.A.; investigation, L.A.; resources, J.P. (data collection under the supervision of J.B.); data curation, L.A.; writing original draft preparation, L.A.; writing review and editing, L.A., J.B., Z.E. and M.A.; visualization, L.A.; supervision, J.B. and Z.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting this study were collected from publicly available Reddit discussions.

Conflicts of Interest

The authors declare no conflict of interest.

References

World Health Organization. Managing the COVID-19 Infodemic: Promoting Healthy Behaviors and Mitigating the Harm from Misinformation and Disinformation. 2020. Available online: https://www.who.int/news/item/23-09-2020-managing-the-covid-19-infodemic-promoting-healthy-behaviours-and-mitigating-the-harm-from-misinformation-and-disinformation (accessed on 23 September 2020).
Cinelli, M.; Quattrociocchi, W.; Galeazzi, A.; Valensise, C.M.; Brugnoli, E.; Schmidt, A.L.; Zola, P.; Zollo, F.; Scala, A. The COVID-19 social media infodemic. Sci. Rep. 2020, 10, 16598. [Google Scholar] [CrossRef]
Islam, M.S.; Sarkar, T.; Khan, S.I.; Kamal, A.H.M.; Hasan, S.M.M.; Kabir, A.; Yeasmin, D.; Islam, M.A.; Chowdhury, K.S.K.; Anwar, I.; et al. COVID-19–Related Infodemic and Its Impact on Public Health: A Global Social Media Analysis. Am. J. Trop. Med. Hyg. 2020, 103, 1621–1629. [Google Scholar] [CrossRef]
Melton, C.A.; Olusanya, O.A.; Ammar, N.; Shaban-Nejad, A. Public Sentiment Analysis and Topic Modeling Regarding COVID-19 Vaccines on the Reddit Social Media Platform: A Call to Action for Strengthening Vaccine Confidence. J. Infect. Public Health 2021, 14, 1505–1512. [Google Scholar] [CrossRef]
Zhang, Z.; Luo, L.; Fu, X.; Yang, J. Early Detection of Fake News on Social Media Through Propagation Path Classification with Recurrent and Convolutional Networks. Inf. Sci. 2019, 493, 298–315. [Google Scholar]
Patwa, P.; Sharma, S.; Pykl, S.; Guptha, V.; Kumari, G.; Akhtar, M.S.; Chakraborty, T. Fighting an Infodemic: COVID-19 Fake News Dataset. In Proceedings of the Combating Online Hostile Posts in Regional Languages During Emergency Situations: First International Workshop, CONSTRAINT 2020, Virtual, 8 February 2021; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 21–29. [Google Scholar]
Amjad, M.; Sidorov, G.; Zhila, A.; Gómez-Adorno, H.; Voronkov, I.; Gelbukh, A. Bend the Truth: Benchmark Dataset for Fake News Detection in Urdu Language and Its Evaluation. Data Brief 2020, 31, 105906. [Google Scholar] [CrossRef]
Chen, Y.; Liu, S.; Yin, Y.; Jiang, W. Using Deep Learning Models to Detect Fake News About COVID-19. Chaos Solitons Fractals 2020, 140, 110122. [Google Scholar] [CrossRef]
Sharma, K.; Yadav, K.; Yadav, N.; Ferdinand, K.C. Covid-19 on Social Media: Analyzing Misinformation in Twitter Conversations. J. Med Syst. 2020, 44, 1–7. [Google Scholar]
Kabir, M.A.; Madria, S.K. CoronaVis: A Real-Time COVID-19 Tweets Data Analyzer and Data Repository. IEEE Access 2021, 9, 104515–104525. [Google Scholar]
Valdez, D.; Ten Thij, M.; Bathina, K.; Rutter, L.A.; Bollen, J. Social Media Insights Into US Mental Health During the COVID-19 Pandemic: Longitudinal Analysis of Twitter Data. J. Med. Internet Res. 2020, 22, e21418. [Google Scholar] [CrossRef]
Mayank, M.; Sharma, S.; Sharma, R. DEAP-FAKED: Knowledge Graph-Based Approach for Fake News Detection. In Proceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Istanbul, Turkey, 10–13 November 2022; pp. 47–51. [Google Scholar]
Oyebode, O.; Ndulue, C.; Mulchandani, D.; Suruliraj, B.; Adib, A.; Orji, F.A.; Orji, R. COVID-19 Pandemic: Identifying Key Issues Using Social Media and Natural Language Processing. J. Healthc. Inform. Res. 2022, 6, 174–207. [Google Scholar] [CrossRef]
Alsudias, L.; Rayson, P. COVID-19 and Arabic Twitter: How Can Arab World Governments and Public Health Organizations Learn From Social Media? In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online, 5–10 July 2020.
Ajao, O.; Garg, A.; Da Costa-Abreu, M. Exploring Content-Based and Meta-Data Analysis for Detecting Fake News Infodemic: A Case Study on COVID-19. In Proceedings of the 2022 12th International Conference on Pattern Recognition Systems (ICPRS), Saint-Etienne, France, 7–10 June 2022; pp. 1–8. [Google Scholar]
Jlifi, B.; Sakrani, C.; Duvallet, C. Towards a Soft Three-Level Voting Model (Soft T-LVM) for Fake News Detection. J. Intell. Inf. Syst. 2022, 1–21. [Google Scholar] [CrossRef]
Lin, Y.C.J. Establishing Legitimacy Through the Media and Combating Fake News on COVID-19: A Case Study of Taiwan. Chin. J. Commun. 2022, 15, 250–270. [Google Scholar] [CrossRef]
Qazi, U.; Imran, M.; Ofli, F. GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets With Location Information. SIGSPATIAL Spec. 2020, 12, 6–15. [Google Scholar] [CrossRef]
Wani, M.A.; Qazi, A.; Zahid, M.; Syed, T.A. Temporal analysis and detection of COVID-19 misinformation on Twitter using ensemble machine learning models. J. Comput. Soc. Sci. 2023, 6, 101–123. [Google Scholar]
Das, S.; Mishra, S.; Mukherjee, A. A hybrid BERT-LDA model for topic-wise fake news detection during COVID-19. Online Soc. Netw. Media 2024, 36, 100762. [Google Scholar]
Amin, M.H.; Madanu, H.; Lavu, S.; Mansourifar, H.; Alsagheer, D.; Shi, W. Detecting Conspiracy Theory Against COVID-19 Vaccines. arXiv 2022, arXiv:2211.13003. [Google Scholar] [CrossRef]
Lindelöf, G.; Aledavood, T.; Keller, B. Vaccine Discourse on Twitter During the COVID-19 Pandemic. arXiv 2022, arXiv:2207.11521. [Google Scholar] [CrossRef]
SimpleMaps. United States Cities Database. 2024. Available online: https://simplemaps.com/data/us-cities (accessed on 20 July 2024).
Chandrasekaran, R.; Mehta, V.; Valkunde, T.; Moustakas, E. Topics, Trends, and Sentiments of Tweets About the COVID-19 Pandemic: Temporal Infoveillance Study. J. Med. Internet Res. 2020, 22, e22624. [Google Scholar] [CrossRef]
Mifrah, S.; Benlahmar, E.H. Topic Modeling Coherence: A Comparative Study between LDA and NMF Models using COVID-19 Corpus. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 5756–5761. [Google Scholar] [CrossRef]
Bozarth, L.; Budak, C. Keyword expansion techniques for mining social movement data on social media. EPJ Data Sci. 2022, 11, 30. [Google Scholar] [CrossRef]
Habib, H.; Nithyanand, R. Exploring the magnitude and effects of media influence on Reddit moderation. In Proceedings of the International AAAI Conference on Web and Social Media, online, 15 May–15 July 2022; Volume 16, pp. 275–286. [Google Scholar]
Gozzi, N.; Chinazzi, M.; Dean, N.E.; Longini, I.M.; Halloran, M.E.; Perra, N.; Vespignani, A. Estimating the impact of COVID-19 vaccine inequities: A modeling study. Nat. Commun. 2023, 14, 3272. [Google Scholar] [CrossRef]
Bozarth, L.; Quercia, D.; Capra, L.; Šćepanović, S. The role of the big geographic sort in online news circulation among US Reddit users. Sci. Rep. 2023, 13, 6711. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Research methodology overview.

Figure 2. County-level vaccination percentages.

Figure 3. Percentage of false news in each county.

Table 1. Overview of relevant studies on COVID-19 misinformation detection, highlighting key techniques, features, and gaps.

Study	Techniques	Features/Gaps Addressed
Zhang et al. [5]	Recurrent, convolutional neural networks	Early detection of fake news on social media
Patwa et al. [6]	Logistic regression, decision trees	Benchmark dataset for COVID-19 fake news detection
Amjad et al. [7]	Logistic regression	Fake news detection in low-resource languages
Chen et al. [8]	CNN, LSTM networks	Deep learning models for COVID-19 fake news detection
Sharma et al. [9]	Keyword filtering, manual coding	Misinformation patterns, geographical variability
Kabir and Madria [10]	NLP	Real-time analysis of COVID-19 tweets
Melton et al. [4]	Topic modeling, sentiment analysis	Addressing misinformation in Reddit discussions
Valdez et al. [11]	Longitudinal analysis	Impact of misinformation on mental health
Mayank et al. [12]	Knowledge graphs	Knowledge-driven approach for fake news detection
Oyebode et al. [13]	NLP	Extracting key insights from user-generated content
Alsudias and Rayson [14]	Linguistic analysis	Regional nuances in misinformation dissemination
Ajao et al. [15]	Content/meta-data analysis	Improved detection methods
Jlifi et al. [16]	Soft T-LVM	Novel model combining language and contextual factors
Lin [17]	Case study	Media’s role in combating fake news
Qazi et al. [18]	Dataset creation, multilingual geotagging	Cross-regional public health and misinformation analysis using geolocated tweets
Wani et al. [19]	Ensemble machine learning (XGBoost, RF)	Temporal modeling of COVID-19 misinformation spread on Twitter
Das et al. [20]	BERT + LDA hybrid model	Detection of topic-wise fake news clusters and misinformation intent classification

Table 2. LDA topic modeling results.

Topic	Description	Top Keywords
Topic 0	Humorous/Informal Vaccine Discussions	booster, vaccine, covid, dose, second, shot, people, card, months, day
Topic 1	Conspiracy Theories	vaccine, booster, dude, thank, microchips, tracking, control, gave, run, right
Topic 2	Public Debates on Vaccination	vaccine, people, covid, vaccines, vaccinated, virus, time, unvaccinated, spread, point
Topic 3	Brand Skepticism and Technical Concerns	booster, vaccine, moderna, shot, microchips, covid, pfizer, phone, yes, activation

Table 3. NMF Topic modeling results.

Topic	Description	Top Keywords
Topic 0	Public Health Concerns	people, vaccinated, unvaccinated, virus, time, spread, point, cases, risk, health
Topic 1	Discussions Around Booster Shots/vaccine brands	booster, moderna, shot, pfizer, shots, second, dose, months, day, thanks
Topic 2	Conspiracy Theories	vaccine, microchips, chips, gates, tracking, right, chip, government, lol, thought
Topic 3	Confusion and Misinformation Blend	covid, vaccines, chips, flu, work, reception, prevent, tower, shot, vax

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alarfaj, L.; Blackburn, J.; Amjad, M.; Patel, J.; Ertem, Z. Mapping the Infodemic: Geolocating Reddit Users and Unsupervised Topic Modeling of COVID-19-Related Misinformation. Information 2025, 16, 748. https://doi.org/10.3390/info16090748

AMA Style

Alarfaj L, Blackburn J, Amjad M, Patel J, Ertem Z. Mapping the Infodemic: Geolocating Reddit Users and Unsupervised Topic Modeling of COVID-19-Related Misinformation. Information. 2025; 16(9):748. https://doi.org/10.3390/info16090748

Chicago/Turabian Style

Alarfaj, Lulu, Jeremy Blackburn, Maaz Amjad, Jay Patel, and Zeynep Ertem. 2025. "Mapping the Infodemic: Geolocating Reddit Users and Unsupervised Topic Modeling of COVID-19-Related Misinformation" Information 16, no. 9: 748. https://doi.org/10.3390/info16090748

APA Style

Alarfaj, L., Blackburn, J., Amjad, M., Patel, J., & Ertem, Z. (2025). Mapping the Infodemic: Geolocating Reddit Users and Unsupervised Topic Modeling of COVID-19-Related Misinformation. Information, 16(9), 748. https://doi.org/10.3390/info16090748

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mapping the Infodemic: Geolocating Reddit Users and Unsupervised Topic Modeling of COVID-19-Related Misinformation

Abstract

1. Introduction

2. Literature Review

2.1. Volume and Impact of COVID-19 Misinformation

2.2. Methods for Detecting Fake News

2.3. Topic and Sentiment Analysis Approaches

2.4. Geolocation and Regional Dynamics

3. Methodology

3.1. Phase 1: Data Collection

3.2. Phase 2: Geolocation Analysis

3.3. Phase 3: Data Preprocessing and Labeling

Data Cleaning and Transformation

3.4. Phase 4: Topic Modeling and Unsupervised Learning

3.4.1. Key Findings from Topic Modeling

3.4.2. Distribution of Misinformation Across Topics

3.5. Phase 5: Visualization, Mapping, and Insights

4. Discussion and Analysis

5. Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI