Prompt-Driven LLM Pipeline for Topic Modeling of Multiple Sclerosis Social Media Discussions
Abstract
1. Introduction
- To empirically compare two prompt-based in-context learning strategies—zero-shot and few-shot prompting—for topic modeling of short, noisy, health-related social media texts.
- To assess topic quality through a multi-layered evaluation framework examining three dimensions—coherence, diversity, and document-level faithfulness—combining GPT-4-based automated scoring with human validation for coherence and faithfulness.
2. Materials and Methods
2.1. Data Collection
2.2. Data Pre-Processing
2.3. Prompt-Based Topic Modeling
- Extract Topics per Post (Task 1):The initial task involved the extraction of potential topics from each post, as illustrated in the zero-shot prompt provided in Listing A1 given in Appendix A. The LLM was tasked with deconstructing the provided post to identify its key ideas and implied meanings before assigning final topic labels and explicitly directing: “Now, based on these identified components, what specific topics accurately represent the content of the post? Consider the relationships between the different themes and how they connect to broader categories.” The prompt input consisted of the post, and the desired output was formatted as a list of the most relevant topics (e.g., [topic1, topic2]), with the number of identified topics determined by the LLM based on the post’s complexity. Importantly, in instances where no relevant topics could be identified, the LLM was instructed to output “NA” to minimize the risk of hallucinated outputs. By following this prompt structure, the model was compelled to map its output directly to the semantic evidence within the source text, thereby ensuring document-level faithfulness. In contrast, the few-shot approach supplemented these same instructions with five illustrative examples, as illustrated in Listing A2 given in Appendix A, each pairing a generated post with its corresponding set of extracted topics. These examples were initially generated by the LLM to mirror the style of the original dataset and were subsequently subjected to human review and verification to ensure correctness.
- Topic frequency calculation (Task 2): The purpose of this task was to identify the most prominent and recurring topics discussed by people with MS, enabling the filtering of broad or general topics that did not meaningfully contribute to the analysis. Following topic extraction, the frequency of each identified topic was calculated across the entire corpus, with a dictionary utilized to aggregate occurrences of each unique topic and generate a structured list pairing each topic with its corresponding frequency count. A filtering mechanism was then applied to discard overly broad topics—such as “Multiple Sclerosis”—as these referred to the condition itself rather than capturing a specific or meaningful aspect of patient experience.
- Semantic grouping (Clustering) (Task 3): In this task, the LLM was guided to identify underlying semantic relationships among the provided posts and their corresponding topics, including less obvious connections. The prompt directed the model to follow a structured chain-of-thought reasoning process, explicitly instructing: “As you analyze each topic, articulate your thought process in a structured, step-by-step manner”, with the key objective of discovering hidden themes: “Actively search for themes that may not be immediately apparent, employing a systematic approach to identify overarching concepts or narratives that connect seemingly disparate topics”. The LLM was further tasked with generating representative cluster names ensuring clarity, conciseness, and minimal overlap, as stated in the prompt: “Use clear, concise, and standardized language for each cluster name. Aim for consistency in terminology across multiple runs and ensure minimal overlap between clusters”. The input consisted of a list of posts and their corresponding topics, and the output required both a detailed chain-of-thought justification for each clustering decision and a structured presentation of the resulting clusters. Specific constraints were applied to ensure quality, reproducibility, and thematic coherence. Notably, no topics were discarded due to rare occurrences—given the relatively small dataset (n = 504), infrequent topics could still represent meaningful aspects of patient experience, such as discussions of specific symptoms or emotional states that are uncommon but deeply relevant to those affected. The full prompt is presented in Listing A3 given in Appendix A.
- Theme Assignment to Post (Task 4): In this task, the LLM was instructed to categorize a given post by assigning the most semantically similar theme from a strictly predefined list derived from Task 3. The process followed a structured sequence of instructions as defined in the prompt. First, the LLM interpreted the content of the post, as instructed: “Carefully read and understand the content of the provided post”. Second, the LLM interpreted the meaning of each theme in the predefined list, as directed: “Carefully read and understand the content of the Themes list”. Third, the semantic similarity between the post and each theme was evaluated, as stated: “For each post assess the semantic similarity between the post’s content and the Theme’s meaning”. Fourth, the single most semantically aligned theme was selected and outputted, as explicitly constrained: “select the single theme from the themes list that exhibits the absolute highest degree of semantic similarity with the post’s content”. Finally, if no suitable theme was identified, the model was instructed to assign the label “Unrelated”. The input to the prompt consisted of the post and the predefined list of themes, and the expected output was a single theme label from that list. The full prompt is presented in Listing A4 given in Appendix A.
2.4. Evaluation
2.4.1. Topic Coherence Evaluation
- LLM-based:The prompt used in this evaluation is presented in Listing A5 given in Appendix A. For each set of generated topics obtained through zero-shot prompting and few-shot prompting, GPT-4 was prompted to provide a coherence score on a scale of 1 to 5, where 1 indicates low coherence and 5 indicates high coherence. Since these were prompt-elicited judgements rather than formula-derived statistics, no arithmetic formula applied. To account for run-to-run variability in LLM outputs, each configuration was evaluated across three independent runs and the mean score was computed aswhere S1, S2, and S3 denote the scores assigned by GPT-4 in runs 1, 2, and 3 respectively. The prompt also requested the LLM to provide a brief justification for the assigned score, thereby capturing qualitative insights into the model’s reasoning.
- Human assessment:A human-centered evaluation was conducted using a word intrusion task to assess the extent to which human annotators perceived the coherence of the generated themes. The task was originally introduced as a method for evaluating topic model quality through human judgment [34]. In this task, each theme was presented as a list of its comprising topics, with one unrelated intruder word inserted, selected randomly from a different cluster. Annotators were asked to identify the intruder word within each theme. Higher accuracy in identifying the intruder was interpreted as an indication of higher human-perceived coherence, as it suggested that the remaining topics formed a clearly recognizable and semantically coherent group. The survey was administered online via Google Forms to nine annotators, each holding a minimum of a bachelor’s degree, aged 25 years or older, and drawn from diverse educational and professional backgrounds. All annotators participated voluntarily, were fully informed of the task prior to participation, and completed the task independently for all themes across both prompt-based configurations—zero-shot and few-shot—resulting in nine judgements per cluster per configuration. Two complementary measures are reported: One is percentage agreement, which captured the proportion of annotators who correctly identified the intruder and served as the primary indicator of human-perceived coherence, calculated for each cluster c aswhere PA(c) is reported as “Agreement (%)” in Section 3.This is complemented by Fleiss’ kappa [35] as a supplementary reliability indicator. However, kappa values are subject to a well-documented paradox whereby high observed agreement can yield paradoxically low or even negative kappa coefficients when one response category is highly prevalent [36,37]. Fleiss’ kappa was computed using the fleiss_kappa function from the statsmodels.stats.inter_rater module in Python 3.12.
2.4.2. Topic Diversity Evaluation
2.4.3. Document-Level Faithfulness Evaluation
3. Results
3.1. Extracted Themes
3.1.1. Themes Identified via Zero-Shot Prompting
3.1.2. Themes Identified via Few-Shot Prompting
3.2. Topic Modeling Evaluation
3.2.1. Coherence
- LLM-based evaluation:Both configurations achieved high coherence scores, with the few-shot configuration yielding a mean score of 4.9 and the zero-shot configuration achieving a score of 5.0. The evaluation was carried out in three separate runs, and the mean score for each approach was then calculated, as presented in Table 4.
- Word intrusion task:The following results are based on a word intrusion task in which annotators were asked to identify an intruder word inserted among the topics of each theme. Higher accuracy in identifying the intruder reflected stronger human-perceived coherence of the theme. Two complementary measures are reported: percentage agreement and Fleiss’ kappa reported as a supplementary reliability indicator. For the zero-shot configuration, the overall percentage agreement was 79.2%, with three clusters—Chronic Illness Experience, Healthcare Access and Costs, and Personal Growth and Acceptance—achieving perfect agreement (100%). The remaining clusters ranged from 55.6% to 77.8%, with Patient Experience and Treatment recording the lowest agreement at 55.6%, suggesting relatively weaker perceived coherence for this cluster. The inter-rater reliability for the zero-shot configuration, measured using Fleiss’ kappa, was , classified as slight agreement [39]. Per-cluster agreement rates are presented in Table 5.For the few-shot configuration, the overall percentage agreement was 87.5%, with Mental Health and Emotional Well-being and Community and Family Support achieving perfect agreement (100%). The remaining clusters ranged from 66.7% to 88.9%, with Coping Strategies and Resilience recording the lowest agreement at 66.7%, suggesting that themes involving overlapping emotional and behavioral dimensions are inherently more difficult for annotators to distinguish, regardless of prompting strategy. The inter-rater reliability for the few-shot configuration, measured using Fleiss’ kappa, was , classified as poor agreement [39]. As discussed in Section 2.4.1, this negative value is attributable to the prevalence effect [36,37] rather than genuine annotator disagreement, a statistical artifact that arises when one response category strongly dominates, inflating expected chance agreement and artificially suppressing kappa. Percentage agreement was therefore retained as the primary measure of human-perceived coherence. Per-cluster agreement rates are presented in Table 6.Notably, the few-shot configuration produced no cluster below 66.7% agreement, whereas the zero-shot configuration recorded a minimum of 55.6% for Patient Experience and Treatment, indicating that few-shot prompting yields more consistently coherent themes across all clusters, not only on average.
3.2.2. Diversity
3.2.3. Document-Level Faithfulness
3.3. LLM Reproducibility
4. Discussion
Limitations
- Single dataset and LLM configuration: The analysis was conducted on a single social media dataset comprising 504 posts using one LLM configuration (GPT-4o-mini, temperature = 0.0). The reported findings are therefore indicative within this specific experimental setting and should not be generalized across datasets, domains, languages, or model architectures without further empirical validation.
- Language and platform scope: The dataset comprised English-language posts collected from a single platform (Platform X). The findings may not generalize to other languages, platforms, or communication styles, and extending the pipeline to multilingual and multi-platform datasets remains a direction for future work.
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
| Listing A1. Task 1: LLM prompting (zero-shot prompt). |
| You are an expert data analyst. Let’s analyze the following post step-by-step to identify the most relevant topics. post: ’p’ First, let’s break down the post into its key components and ideas. What are the main subjects or themes mentioned? Think carefully about the context and any implied meanings. Now, based on these identified components, what specific topics accurately represent the content of the post? Consider the relationships between the different themes and how they connect to broader categories. Finally, provide a list of the most relevant topics, formatted as: [topic1, topic2 (if exist), topic 3 (if exist)]. The number of topics should be determined by the model based on the complexity of the post. If after careful consideration, you determine that no relevant topics can be identified, or if the post lacks coherent themes, assign “NA”. Let’s think step by step. |
| Listing A2. Task 1: LLM prompting (examples for few-shot prompt). |
| .. .. .. .. Let’s think step by step. Here are some examples that might help: Example 1: Post: I have been taking Ocrevus for 2 years now and I have no complaints Topics: treatment plan, treatment side effects, Ocrevus Example 2: Post: I just had my first MRI after I started Ocrevus, and it showed no new lesions, but I still feel like my symptoms are getting worse. I’m confused. Topics: MRI results, brain lesions, treatment effectiveness Example 3: Post: I don’t think I’d ever stop taking a dmt. But after talking it over with my neurologist for several years, I decided to switch from Ocrevus to Kesimpta this October. Topics: medication plan Example 4: Post: I’m really heartbroken to share that I’ve been diagnosed with multiple sclerosis and that it’s only expected to progress. Topics: Disability Example 5: Post: I took four naps today and I’m still wiped out. Topics: Low power, fatigue |
| Listing A3. Task 3: clustering prompt. |
| You are an expert in semantic topic clustering and thematic analysis of social media posts. Your task is to analyze a list of posts and, if provided, related topics, and group them into coherent clusters (Themes) based on their underlying semantic relationships, including identifying hidden or less obvious themes. Given the following list of posts: {Posts} and the related Topics: {topics}. **Instructions:** 1. **Analyze the Posts and Topics:** Carefully read and understand the content of each post provided, and the corresponding list of topics that were previously extracted from each post. Consider the precise meaning of each topic, its potential connotations, and how they might relate to one another. Ensure that each topic is analyzed independently before comparing it to others. 2. **Structured Chain of Thought Reasoning:** As you analyze each topic, articulate your thought process in a structured, step-by-step manner. 3. **Rigorous Hidden Theme Discovery:** Actively search for themes that may not be immediately apparent, employing a systematic approach to identify overarching concepts or narratives that connect seemingly disparate topics. Document any implicit or latent connections you observe. 4. **Generate Standardized Cluster Names:** Use clear, concise, and standardized language for each cluster name (Theme). Aim for consistency in terminology across multiple runs. Prioritize commonly understood terms and ensuring minimal overlap between clusters. Strive for exhaustive coverage of all topics. 5. **Generate Representative Cluster Names:** For each cluster, create a concise and descriptive name that accurately reflects the included topics. Use the most representative topic or a general term that encapsulates the cluster’s content. **Output Format:** Please provide the clusters in the following format, including your chain of thought: **Chain of Thought:** * “Analyzing topic [topic1], I notice it shares [semantic connection] with [topic2]. This suggests a potential theme related to [theme].” * “Considering [topic3], its implication of [implication] strongly aligns with [theme].” * “I observe a less obvious theme connecting [topic4] and [topic5] through [hidden connection].” **Cluster Output:** Cluster Name 1: [topic1, topic2, …], Justification: [Detailed explanation using chain of thought] Cluster Name 2: [topic3, topic4, …], Justification: [Detailed explanation using chain of thought] **Example Output:** Cluster Name 1: **Chronic Illness Experience**: [chronic illness, multiple sclerosis (ms), chronic pain], Justification: This cluster encapsulates the shared experiences and challenges faced by individuals living with chronic conditions, particularly focusing on multiple sclerosis and the associated pain. **Constraints:** * Ensure that each cluster name is clear, concise, and representative of the grouped topics. * Ensure the reproducability of the clusters names. * Prioritize the identification of hidden themes and deeper semantic relationships. * Aim for thematic coherence and meaningful groupings. * Use the chain-of-thought process to thoroughly justify each cluster assignment. |
| Listing A4. Task 4: theme assignment to post prompt. |
| prompt = You are a highly skilled expert in social media content analysis and semantic similarity matching. Your task is to accurately categorize a given post based on its semantic alignment with a *strictly predefined* theme list. **Instructions:** 1. **Analyze Post Content:** Carefully read and understand the content of the provided post. 2. **Understand Themes list:** Carefully read and understand the content of the Themes list(clusters). 3. **Determine Semantic Similarity between the post and the appropiate theme within the given (predifined) theme List:** For each post assess the semantic similarity between the post’s content and the Theme’s meaning. *Crucially, your selection MUST be made ONLY from the themes present in the provided listThemes.* 4. **Assign Theme with Highest Semantic Similarity (From Predefined List ONLY):** Select the SINGLE Theme from Themes list that exhibits the ABSOLUTE HIGHEST degree of semantic similarity with the post’s content. *You MUST NOT generate or select themes outside of this list.* 5. **Handle Unrelated Cases (Within Predefined List):** If none of the major topics in Themes list are a good semantic fit for the post, assign the topic “Unrelated”. **Crucial Constraints:** * **STRICT ADHERENCE TO PREDEFINED LIST:** * You MUST ONLY select topics from the Themes list. Generating or selecting themes outside of this list is strictly prohibited.* * **Semantic Focus:** Prioritize semantic similarity over simple keyword matching. * **No Explanations:** Do not include any additional explanations or formatting when returning the assigned themes. * **Single Theme Output (Highest Similarity):** Return ONLY ONE Theme, the one with the HIGHEST semantic similarity. * **Handle Unrelated:** if no topic from the predfined Theme list is appropriate please assign “Unrelated” as a topic. * **If unable to complete the task, return “Unable to fulfill request”** **Provided post:** post **Themes list (STRICTLY USE ONLY THESE):** Themes **Begin Analysis:** |
| Listing A5. Coherence prompt. |
| Given a set of themes and their associated topic words in {Clusters} 1. **Assess Semantic Relatedness:** Critically evaluate the semantic relatedness of the keywords and phrases within each individual cluster.Consider the following questions: * Do the terms within a cluster share a common underlying concept or theme? * Are there any terms that appear to be outliers or do not logically fit with the other keywords in the cluster? 2. **Provide a Coherence Score (Quantitative):** derive a rough quantitative score on a scale of 1 to 5 that represents the *average* coherence of the clusters within that set. * **Scoring Guidelines:** * **5 (Near-Perfect Coherence):** All clusters within the set exhibit strong internal semantic relatedness, with keywords clearly belonging together. * **4 (Good Coherence):** Most clusters demonstrate good internal semantic relatedness. * **3 (Moderate Coherence):** Some clusters show reasonable internal relatedness, while others might contain more ambiguous or less relevant keywords. * **2 (Low Coherence):** Several clusters exhibit weak internal semantic relatedness, with multiple keywords appearing out of place. * **1 (Significant Incoherence):** Many or all clusters demonstrate very poor internal semantic relatedness, with keywords appearing largely unrelated. *Output Format:** Structure your evaluation as follows: Overall Coherence Score (1–5): [OVERALL_AVERAGE_COHERENCE_SCORE] (a numerical rating) |
| Listing A6. Diversity prompt. |
| Given a set of themes and their associated topic words in {Clusters} 1. **Assess Cluster Distinctiveness: Evaluate the thematic diversity represented by the set of clusters. Consider the following questions: * Are the central concepts of the different clusters within the same set clearly differentiated? * Does the collection of clusters within a set capture a broad range of distinct themes or are they focused on a narrow set of related concepts 2. **Provide a Diversity Score (Quantitative):** derive a rough quantitative score on a scale of 1 to 5 that represents the thematic diversity of the clusters within that set. * **Scoring Guidelines:** * **5 (High Diversity):** The clusters within the set represent highly distinct and varied themes with minimal semantic overlap. The set captures a broad spectrum of concepts. **4 (Good Diversity):** Most clusters within the set represent distinct themes with generally low semantic overlap. The set covers a reasonably broad range of concepts. * **3 (Moderate Diversity):** Some clusters represent distinct themes, while others may exhibit noticeable semantic overlap or the set might focus on a moderately narrow range of concepts. * **2 (Low Diversity):** Several clusters show significant semantic overlap, indicating limited thematic variety. The set likely focuses on a narrow range of interconnected concepts. * **1 (Very Low Diversity):** The clusters within the set are largely redundant, with substantial semantic overlap, indicating a lack of distinct themes. The set focuses on a very narrow or even singular concept. **Output Format:** Structure your evaluation as follows: Overall Diversity Score (1–5): [OVERALL_AVERAGE_DIVERSITY_SCORE] (a numerical rating) |
| Listing A7. Document-level faithfulness prompt. |
| You are an expert in media data analysis. Your task is to evaluate the faithfulness of a given ’Assigned Theme’ to the content of a ’post’. **Post:** “post” **Assigned Theme:** “assigned_theme” **Instructions:** 1. Determine if the ’Assigned Theme’ accurately reflects the primary subject matter or core message of the ’post’. 2. Consider if the theme captures the essence of what the post is primarily discussing. 3. Output ONLY one of the following labels: ’faithful’ or ’not faithful’. **Definition of Faithfulness:** * **faithful:** The assigned theme clearly and accurately represents the main topic, or experience conveyed in the post. * **not faithful:** The assigned theme is either irrelevant, too broad, too narrow, or misrepresents the primary content of the post. |
Appendix B
| Theme (Run 1) | Theme (Run 2) | Stability Status |
|---|---|---|
| Chronic Illness Experience | Chronic Illness Experience | Identical |
| Mental Health and Resilience | Mental Health and Resilience | Identical |
| Community and Family Support | Community and Support Systems | Semantic Match |
| Financial Aspects | Healthcare Access and Costs | Semantic Match |
| Humor as Coping | Humor as Coping | Identical |
| Personal Growth and Motivation | Personal Growth and Acceptance | Semantic Match |
| Self-Care and Well-being | Daily Life Challenges | Thematic Shift |
| Awareness and Advocacy | Patient Exp. and Treatment | Thematic Shift |
| Theme (Run 1) | Theme (Run 2) | Stability Status |
|---|---|---|
| Chronic Illness Experience | Chronic Illness Experience | Identical |
| Mental Health and Well-being | Mental Health and Well-being | Identical |
| Community and Family Support | Community and Family Support | Identical |
| Treatment Management | Treatment Management | Identical |
| Coping and Resilience | Coping Strategies and Resilience | Semantic Match |
| Economic Impact of Illness | Healthcare Access and Challenges | Semantic Match |
| Finding Joy in Struggles | Humor and Positivity | Semantic Match |
| Healthcare Access | Physical Limitations and Disability | Thematic Shift |
References
- Doshi, A.; Chataway, J. Multiple sclerosis, a treatable disease. Clin. Med. 2017, 17, 530–536. [Google Scholar] [CrossRef] [PubMed]
- Walton, C.; King, R.; Rechtman, L.; Kaye, W.; Leray, E.; Marrie, R.; Robertson, N.; Rocca, N.L.; Uitdehaag, B.; van der Mei, I.A.; et al. Rising prevalence of multiple sclerosis worldwide: Insights from the Atlas of MS, third edition. Mult. Scler. 2020, 26, 1816–1821. [Google Scholar] [CrossRef] [PubMed]
- Tahernia, H.; Esnaasharieh, F.; Amani, H.; Milanifard, M.; Mirakhori, F. Diagnosis and Treatment of MS in Patients Suffering from Various Degrees of the Disease with a Clinical Approach: The Orginal Article. J. Pharm. Negat. Results 2022, 13, 1908–1921. [Google Scholar]
- Topcu, G.; Mhizha-Murira, J.R.; Griffiths, H.; Bale, C.; Drummond, A.; Fitzsimmons, D.; Potter, K.J.; Evangelou, N.; das Nair, R. Experiences of receiving a diagnosis of multiple sclerosis: A meta-synthesis of qualitative studies. Disabil. Rehabil. 2023, 45, 772–783. [Google Scholar] [CrossRef]
- Amato, M.; Ponziani, G.; Rossi, F.; Liedl, C.; Stefanile, C.; Rossi, L. Quality of life in multiple sclerosis: The impact of depression, fatigue and disability. Mult. Scler. 2001, 7, 340–344. [Google Scholar] [CrossRef]
- Pompili, M.; Forte, A.; Palermo, M.; Stefani, H.; Lamis, D.A.; Serafini, G.; Amore, M.; Girardi, P. Suicide risk in multiple sclerosis: A systematic review of current literature. J. Psychosom. Res. 2012, 73, 411–417. [Google Scholar] [CrossRef]
- Eizaguirre, M.; Yastremiz, C.; Ciufia, N.; Roman, M.S.; Alonso, R.; Silva, B.A.; Garcea, O.; Cáceres, F.; Vanotti, S. Relevance and Impact of Social Support on Quality of Life for Persons With Multiple Sclerosis. Int. J. MS Care 2022, 25 3, 99–103. [Google Scholar] [CrossRef]
- Churchill, R.; Singh, L. The Evolution of Topic Modeling. ACM Comput. Surv. 2022, 54, 215. [Google Scholar] [CrossRef]
- Abdelrazek, A.; Eid, Y.; Gawish, E.; Medhat, W.; Hassan Yousef, A. Topic modeling algorithms and applications: A survey. Inf. Syst. 2022, 112, 102131. [Google Scholar] [CrossRef]
- Corti, L.; Zanetti, M.; Tricella, G.; Bonati, M. Social media analysis of Twitter tweets related to ASD in 2019–2020, with particular attention to COVID-19: Topic modelling and sentiment analysis. J. Big Data 2022, 9, 113. [Google Scholar] [CrossRef] [PubMed]
- Lyu, J.C.; Han, E.L.; Luli, G.K. COVID-19 Vaccine-Related Discussion on Twitter: Topic Modeling and Sentiment Analysis. J. Med. Internet Res. 2021, 23, e24435. [Google Scholar] [CrossRef]
- Gabarron, E.; Dorronzoro, E.; Reichenpfader, D.; Denecke, K. What do autistic people discuss on Twitter? An approach using BERTopic modelling. In Caring is Sharing – Exploiting the Value in Data for Health and Innovation, Proceedings of MIE 2023; SAGE Publications: Thousand Oaks, CA, USA, 2023; Volume 302, pp. 403–407. [Google Scholar] [CrossRef]
- Min, S.; Han, J. Topic Modeling Analysis of Diabetes-Related Health Information during the Coronavirus Disease Pandemic. Healthcare 2023, 11, 1871. [Google Scholar] [CrossRef]
- Shankar, R.; Yip, A.W. Sentiment analysis and topic modeling of social media data to explore public discourse on irritable bowel syndrome. Sci. Rep. 2025, 15, 21550. [Google Scholar] [CrossRef]
- Prasad, A.; Shalmani, S.A.; He, L.; Wang, Y.; McRoy, S. Identifying Themes in Social Media Discussions of Eating Disorders: A Quantitative Analysis of How Meaningful Guidance and Examples Improve LLM Classification. BioMedInformatics 2025, 5, 40. [Google Scholar] [CrossRef]
- Giunti, G.; Claes, M.; Zubiete, E.; Rivera, O.; Gabarron, E. Analysing Sentiment and Topics Related to Multiple Sclerosis on Twitter. In Digital Personalized Health and Medicine; IOS Press: Amsterdam, The Netherlands, 2020; Volume 270, pp. 911–915. [Google Scholar] [CrossRef]
- Haag, C.; Steinemann, N.; Ajdacic-Gross, V.; Schlomberg, J.; Ineichen, B.; Stanikić, M.; Dressel, H.; Daniore, P.; Roth, P.; Ammann, S.; et al. Natural language processing analysis of the theories of people with multiple sclerosis about causes of their disease. Commun. Med. 2024, 4, 122. [Google Scholar] [CrossRef]
- Ahmed, M.; Tiun, S.; Omar, N.; Sani, N. Short Text Clustering Algorithms, Application and Challenges: A Survey. Appl. Sci. 2023, 13, 342. [Google Scholar] [CrossRef]
- Akash, P.S.; Chang, K.C.C. Enhancing Short-Text Topic Modeling with LLM-Driven Context Expansion and Prefix-Tuned VAEs. arXiv 2024, arXiv:2410.03071. [Google Scholar] [CrossRef]
- Egger, R.; Yu, J. A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef]
- Gupta, P.; Ding, B.; Guan, C.; Ding, D. Generative AI: A systematic review using topic modelling techniques. Data Inf. Manag. 2024, 8, 100066. [Google Scholar] [CrossRef]
- Janssens, W.; Bogaert, M.; den Poel, D.V. LLM-Assisted Topic Reduction for BERTopic on Social Media Data. arXiv 2025, arXiv:2509.19365. [Google Scholar] [CrossRef]
- Wang, H.; Prakash, N.; Hoang, N.; Hee, M.S.; Naseem, U.; Lee, R.K.W. Prompting Large Language Models for Topic Modeling. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 1236–1241. [Google Scholar] [CrossRef]
- He, W.; Hou, B.; Zheng, A.; Feng, Y.; Klein, A.; Oconnor, K.; Yang, S.; Shang, T.; Demiris, G.; Gonzalez, G.; et al. Advanced topic modeling with large language models: Analyzing social media content from dementia caregivers. Innov. Aging 2025, 9, S38–S47. [Google Scholar] [CrossRef]
- Doi, T.; Isonuma, M.; Yanaka, H. Topic Modeling for Short Texts with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop); Fu, X., Fleisig, E., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 21–33. [Google Scholar] [CrossRef]
- Mu, Y.; Dong, C.; Bontcheva, K.; Song, X. Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling. arXiv 2024, arXiv:2403.16248. [Google Scholar] [CrossRef]
- Guo, Y.; Liu, J.; Wang, L.; Qin, W.; Hao, S.; Hong, R. A Prompt-Based Topic-Modeling Method for Depression Detection on Low-Resource Data. IEEE Trans. Comput. Soc. Syst. 2024, 11, 1430–1439. [Google Scholar] [CrossRef]
- Newman, D.; Lau, J.H.; Grieser, K.; Baldwin, T. Automatic Evaluation of Topic Coherence. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics; Kaplan, R., Burstein, J., Harper, M., Penn, G., Eds.; Association for Computational Linguistics: Los Angeles, CA, USA, 2010; pp. 100–108. [Google Scholar]
- Hoyle, A.; Goel, P.; Peskov, D.; Hian-Cheong, A.; Boyd-Graber, J.; Resnik, P. Is Automated Topic Model Evaluation Broken? The Incoherence of Coherence. arXiv 2021, arXiv:2107.02173. [Google Scholar] [CrossRef]
- Stammbach, D.; Zouhar, V.; Hoyle, A.; Sachan, M.; Ash, E. Revisiting Automated Topic Model Evaluation with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 9348–9357. [Google Scholar] [CrossRef]
- Alansari, A.; Luqman, H. Large Language Models Hallucination: A Comprehensive Survey. arXiv 2026, arXiv:2510.06265. [Google Scholar] [CrossRef]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. arXiv 2023, arXiv:2205.11916. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2023, arXiv:2201.11903. [Google Scholar] [CrossRef]
- Chang, J.; Boyd-Graber, J.; Gerrish, S.; Wang, C.; Blei, D. Reading Tea Leaves: How Humans Interpret Topic Models. Adv. Neural Inf. Process. Syst. 2009, 22, 288–296. [Google Scholar]
- Artstein, R.; Poesio, M. Survey Article: Inter-Coder Agreement for Computational Linguistics. Comput. Linguist. 2008, 34, 555–596. [Google Scholar] [CrossRef]
- Feinstein, A.; Cicchetti, D.; Feinstein, A.R. Cicchetti DVHigh agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 1990, 43, 543–549. [Google Scholar] [CrossRef]
- Dettori, J.R.; Norvell, D.C. Kappa and Beyond: Is There Agreement? Glob. Spine J. 2020, 10, 499–501. [Google Scholar] [CrossRef] [PubMed]
- Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]


| # | Theme Name | Comprised Topics |
|---|---|---|
| 1 | Chronic Illness Experience | chronic illness, multiple sclerosis (ms), chronic pain, chronic fatigue |
| 2 | Mental Health and Resilience | mental health, emotional well-being, resilience, coping mechanisms |
| 3 | Patient Experience and Treatment | medication, treatment efficacy, patient experience |
| 4 | Community and Support Systems | community support, family dynamics, support systems, emotional support |
| 5 | Healthcare Access and Costs | healthcare access, healthcare costs, insurance coverage |
| 6 | Humor as a Coping Mechanism | humor, emotional expression |
| 7 | Daily Life Challenges | fatigue, daily life challenges, symptoms |
| 8 | Personal Growth and Acceptance | personal growth, self-acceptance |
| # | Theme Name | Comprising Topics |
|---|---|---|
| 1 | Chronic Illness Experience | chronic illness, multiple sclerosis (ms), chronic pain, symptoms, symptom management, physical symptoms |
| 2 | Mental Health and Emotional Well-being | mental health, emotional impact, emotional resilience, emotional well-being |
| 3 | Community and Family Support | community support, family support, support systems |
| 4 | Treatment Management | treatment effectiveness, treatment options, medication management, treatment plan |
| 5 | Coping Strategies and Resilience | coping mechanisms, self-care, resilience, emotional well-being |
| 6 | Physical Limitations and Disability | disability, mobility challenges, physical limitations |
| 7 | Humor and Positivity | humor, positivity, positive mindset |
| 8 | Healthcare Access and Systemic Challenges | healthcare access, insurance coverage, healthcare costs |
| Method | Coherence (LLM-Based) | Human Agreement | Diversity (LLM-Based) |
|---|---|---|---|
| Zero-Shot Prompting | 5.0 | 79.2% | 4.6 |
| Few-Shot Prompting | 4.9 | 87.5% | 4.6 |
| Approach | Run 1 | Run 2 | Run 3 | Average |
|---|---|---|---|---|
| Zero-shot | 5.0 | 5.0 | 5.0 | 5.0 |
| Few-shot | 4.9 | 4.8 | 5.0 | 4.9 |
| Cluster | Theme Name | Correct (n/9) | Agreement (%) |
|---|---|---|---|
| 1 | Chronic Illness Experience | 9/9 | 100.0 |
| 2 | Mental Health and Resilience | 7/9 | 77.8 |
| 3 | Patient Experience and Treatment | 5/9 | 55.6 |
| 4 | Community and Support Systems | 6/9 | 66.7 |
| 5 | Healthcare Access and Costs | 9/9 | 100.0 |
| 6 | Humor as a Coping Mechanism | 6/9 | 66.7 |
| 7 | Daily Life Challenges | 6/9 | 66.7 |
| 8 | Personal Growth and Acceptance | 9/9 | 100.0 |
| Overall | 79.2% | ||
| Fleiss’ | 0.074 (Slight) [39] | ||
| Cluster | Theme Name | Correct (n/9) | Agreement (%) |
|---|---|---|---|
| 1 | Chronic Illness Experience | 8/9 | 88.9 |
| 2 | Mental Health and Emotional Well-being | 9/9 | 100.0 |
| 3 | Community and Family Support | 9/9 | 100.0 |
| 4 | Treatment Management | 8/9 | 88.9 |
| 5 | Coping Strategies and Resilience | 6/9 | 66.7 |
| 6 | Physical Limitations and Disability | 7/9 | 77.8 |
| 7 | Humor and Positivity | 8/9 | 88.9 |
| 8 | Healthcare Access and Systemic Challenges | 8/9 | 88.9 |
| Overall | 87.5% | ||
| Fleiss’ | −0.016 (Poor) [39] | ||
| Approach | Run 1 | Run 2 | Run 3 | Average |
|---|---|---|---|---|
| Zero-shot | 4 | 5 | 5 | 4.6 |
| Few-shot | 5 | 5 | 4 | 4.6 |
| Scope | Rater/Pair | Zero-Shot | Few-Shot | ||
|---|---|---|---|---|---|
| Faithful Rate (%) | Cohen’s | Faithful Rate (%) | Cohen’s | ||
| Stratified Sample (n = 126) | Ann1 | 66.7 | – | 69.8 | – |
| Ann2 | 77.8 | – | 77.8 | – | |
| GPT-4 | 81.0 | – | 81.0 | – | |
| Ann1 vs. Ann2 | – | 0.688 | – | 0.756 | |
| Ann1 vs. GPT-4 | – | 0.440 | – | 0.411 | |
| Ann2 vs. GPT-4 | – | 0.661 | – | 0.613 | |
| Full Corpus (n = 494/490) | Ann1 | 82.6 | – | 85.9 | – |
| GPT-4 | 87.2 | – | 89.0 | – | |
| Ann1 vs. GPT-4 | – | 0.403 | – | 0.545 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Alamoudi, Y.; Babour, A.; Almatrafi, O. Prompt-Driven LLM Pipeline for Topic Modeling of Multiple Sclerosis Social Media Discussions. Appl. Sci. 2026, 16, 5316. https://doi.org/10.3390/app16115316
Alamoudi Y, Babour A, Almatrafi O. Prompt-Driven LLM Pipeline for Topic Modeling of Multiple Sclerosis Social Media Discussions. Applied Sciences. 2026; 16(11):5316. https://doi.org/10.3390/app16115316
Chicago/Turabian StyleAlamoudi, Yasmeen, Amal Babour, and Omaima Almatrafi. 2026. "Prompt-Driven LLM Pipeline for Topic Modeling of Multiple Sclerosis Social Media Discussions" Applied Sciences 16, no. 11: 5316. https://doi.org/10.3390/app16115316
APA StyleAlamoudi, Y., Babour, A., & Almatrafi, O. (2026). Prompt-Driven LLM Pipeline for Topic Modeling of Multiple Sclerosis Social Media Discussions. Applied Sciences, 16(11), 5316. https://doi.org/10.3390/app16115316

