Prompt-Driven LLM Pipeline for Topic Modeling of Multiple Sclerosis Social Media Discussions

Alamoudi, Yasmeen; Babour, Amal; Almatrafi, Omaima

doi:10.3390/app16115316

Open AccessArticle

Prompt-Driven LLM Pipeline for Topic Modeling of Multiple Sclerosis Social Media Discussions

by

Yasmeen Alamoudi

^*,

Amal Babour

and

Omaima Almatrafi

Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5316; https://doi.org/10.3390/app16115316

Submission received: 5 April 2026 / Revised: 17 May 2026 / Accepted: 18 May 2026 / Published: 26 May 2026

Download

Browse Figures

Versions Notes

Abstract

Despite advances in topic modeling, extracting coherent themes from short, noisy, health-related social media texts remains a methodological challenge. This paper presents a preliminary and exploratory empirical investigation of a prompt-driven Large Language Model (LLM) pipeline for topic modeling of 504 multiple sclerosis (MS)-related posts from Platform X. Two in-context learning strategies—zero-shot and few-shot prompting—were compared using GPT-4o-mini, with topic quality assessed through a multi-layered evaluation framework that leveraged GPT-4 for automated coherence and diversity scoring, complemented by human validation for coherence and document-level faithfulness. Within this experimental setting, few-shot prompting achieved the highest human agreement score (87.5%) alongside high coherence (4.9/5.0), while zero-shot prompting yielded the highest coherence score (5.0/5.0), with 79.2% human agreement. Both configurations demonstrated high thematic diversity (4.6/5.0) and produced themes largely judged as faithfully grounded in the source posts across the full corpus, with few-shot prompting demonstrating consistently stronger faithfulness and greater thematic stability. Nonetheless, these findings suggest that prompt-driven LLM pipelines — particularly few-shot approaches—show promise as a human-aligned and interpretable method for topic modeling of short, noisy, health-related social media texts.

Keywords:

multiple sclerosis; social media; natural language processing; topic modeling; large language models; prompt engineering; coherence; diversity

1. Introduction

Multiple Sclerosis (MS) is a chronic autoimmune disorder that affects the central nervous system [1], estimated to affect more than 2.8 million people worldwide [2]. MS is diagnosed primarily through a combination of clinical evaluations and imaging tests, such as magnetic resonance imaging (MRI). Currently, there is no known cure for MS, but various treatments are available that aim to manage symptoms and slow the progression of the disease. Physical symptoms of MS may include symptoms such as fatigue, weakness, numbness or tingling, muscle spasms, coordination difficulties, vision problems, and cognitive impairment [3]. Apart from physical symptoms, MS patients can develop cognitive and psychological symptoms. Coping with a diagnosis of MS can be challenging, often triggering a range of emotions such as denial, fear, distress, and anger as individuals navigate the changes and uncertainties it brings [4]. Furthermore, studies have shown that most people with MS frequently experience depression and fatigue, which have a high impact on their quality of life compared to their physical state or other characteristics [5]. In addition, people with MS have a higher risk of suicide compared to the general population, with the incidence of suicide among MS patients being seven times higher [6]. These symptoms are strongly associated with the social support patients receive. Those with a strong support system tend to experience lower levels of depression and fatigue compared to those without sufficient support [7]. Social media platforms have emerged as a digital space where people with MS openly share personal experiences, coping strategies, and evolving emotional states. This unstructured content represents a valuable resource for researchers and healthcare providers to understand the unmet needs of this population and inform the design of more effective health interventions. To effectively extract and analyze valuable insights from this unstructured social media content, topic modeling techniques provide a powerful method for identifying the most prevalent themes in a corpus and have been been widely applied across diverse domains, including political science, neuroscience, psychology, and bioinformatics [8,9], with notable applications in health-related social media research. Health-related social media data presents a particularly demanding context for topic modeling, combining informal language, clinical terminology, and emotional complexity in ways that make coherent theme extraction especially challenging. Studies have employed topic modeling to analyze public data across a range of health conditions, including COVID-19 [10,11], autism spectrum disorder [12], diabetes [13], irritable bowel syndrome [14], and eating disorders [15].

This paper deliberately focuses on MS-related social media specifically because patient-generated discussions about chronic conditions represent an underexplored yet clinically valuable source of insight into lived experiences that are absent from formal clinical records. Furthermore, the heterogeneous nature of MS—where patients experience widely varying symptoms and emotional responses—makes social media discussions a particularly compelling area for topic modeling, as they capture the full diversity of the patient experience. Giunti et al. [16] and Haag et al. [17] are among the few studies to have applied topic modeling to MS-related data, underscoring the relative scarcity of work in this area. However, extracting insights from unstructured, short-form text remains a persistent challenge for traditional topic modeling approaches [18] as they often struggle to capture context and semantic nuances in short texts due to their reliance on word co-occurrence patterns [19]. In related work, several studies have explored the use of BERT-based models for topic modeling of short medical texts [12,13,17]. Although BERT-based models improve upon traditional topic modeling approaches by leveraging word embeddings [20], they can still encounter challenges when applied to short texts, such as handling corpora with dynamic and rapidly evolving topics and interpreting the resulting topic representations [21]. To overcome the problem of excessive topics generated by the BERTopic model, Janssens et al. [22] proposed an LLM-assisted topic reduction approach for social media data. Their methodology first applies BERTopic to generate an initial set of topics, and subsequently leverages LLMs to iteratively identify and merge semantically similar topics. Recently, prompting LLMs has gained significant attention, leading researchers to experiment with different methods of prompting specifically for topic modeling. Wang et al. [23] proposed the PromptTopic approach and compared its performance against the state-of-the-art baseline. Additionally, Weiqing et al. [24] explored the direct application of LLMs for topic modeling of caregiver posts through a two-stage approach. Furthermore, Doi et al. [25] studied two different prompting approaches to prevent LLMs from processing many texts at once. Moreover, Yida et al. [26] assessed the performance of prompting GPT-3.5 and LLaMA-2-7B for topic modeling. Guo et al. [27] proposed a prompt-based method for detecting depression, and Prasad et al. [15] experimented with different prompting approaches to identify themes related to eating disorders. Despite the advances in prompting-based topic modeling, their evaluation still largely relies on traditional topic modeling metrics. Classical word co-occurrence measures played a foundational role in assessing frameworks like LDA, and Newman et al. [28] first demonstrated that such automated coherence measures could correlate with human judgement for classical topic models. However, these measures were developed for statistical models, and have not been extended or validated for advanced neural models including LLMs—a limitation that Hoyle et al. [29] characterize as a fundamental validation gap in the field. To bridge this gap, a more robust evaluation framework is necessary, one that is complemented by human judgment tasks such as word intrusion to ensure that automated assessments align with human perception. Additionally, automated quantitative evaluation often struggles to capture the deeper semantic meaning of topics or how well words relate to each other within a broader context. In response to these limitations, recent research by Stammbach et al. [30] has suggested using LLMs to evaluate topic models directly, showing that when LLMs are prompted to assess topic quality, their judgments correlate substantially more strongly with human evaluations than traditional automated metrics. Building on these insights and the recent advancements in prompt-based topic modeling, this paper focuses on the application of a prompt-driven LLM pipeline for topic modeling of MS-related social media data as an exploratory empirical test case. The primary contributions of this work are as follows.

To empirically compare two prompt-based in-context learning strategies—zero-shot and few-shot prompting—for topic modeling of short, noisy, health-related social media texts.
To assess topic quality through a multi-layered evaluation framework examining three dimensions—coherence, diversity, and document-level faithfulness—combining GPT-4-based automated scoring with human validation for coherence and faithfulness.

The remainder of this paper is organized as follows: Section 2 details the methodology, Section 3 presents the results, Section 4 discusses the findings and limitations, and Section 5 outlines the conclusion and directions for future work.

2. Materials and Methods

The methodology comprises four stages, illustrated in Figure 1: data collection, pre-processing, prompt-based topic modeling, and multi-layered evaluation. Initially, posts centering on MS were collected from Platform X to serve as a focused test case for short-form health discussions. These posts underwent multi-step pre-processing to reduce noise before entering the modeling phase. Next, GPT-4o-mini was applied using two in-context learning strategies—zero-shot and few-shot prompting—to identify themes through a shared pipeline of topic extraction and semantic grouping. Finally, the multi-layered evaluation stage assessed the results through a combination of GPT-4 automated scoring for coherence and diversity, alongside human validation via a word-intrusion task and document-level faithfulness assessment.

2.1. Data Collection

A corpus of MS-related posts was initially collected from Platform X using relevant hashtags, including #ms, #multiplesclerosis, #msawarenessmonth, #ocrevus, #msawareness, #mswarrior, #ppms, #mylifewithms, #thisisms, #livewithms, #kesimpta, #mri, #multiplessclerosis, and #mscommunity. The hashtag “ms” appeared as the most frequently used, followed by “multiplesclerosis” and “msawarenessmonth.” This suggests that people with MS actively engage in online discussions about the condition, especially during awareness campaigns. Furthermore, to maintain data consistency and facilitate analysis, only English-language posts were considered.

2.2. Data Pre-Processing

The collected posts were pre-processed prior to topic modeling to enhance data quality and facilitate a more robust analytical process. Irrelevant information was removed, and text consistency was ensured through several key steps. First, URLs, which are often irrelevant to the core content and not effectively processed by LLMs, were identified and replaced with spaces. Next, usernames—personal identifiers usually preceded by the (@) symbol—were removed to protect user privacy. To address formatting issues, multiple consecutive spaces were consolidated into single spaces, thus improving word segmentation. Punctuation marks, with the exception of specific symbols such as hashtags (#), were separated from words to prevent misinterpretation during analysis. Moreover, case folding was applied by converting all text to lowercase, ensuring that words with different capitalizations were recognized as the same term. Additionally, redundant entries—duplicate posts appearing more than once in the corpus—were identified and removed to ensure that each post contributed independently to the analysis. Following the application of these pre-processing steps, a final corpus of 504 MS-related posts was retained for analysis.

2.3. Prompt-Based Topic Modeling

The topic modeling process focused on evaluating two in-context learning approaches within a prompt-driven LLM pipeline: zero-shot prompting and few-shot prompting. Both strategies were applied using GPT-4o-mini, with the temperature set to 0.0 to minimize randomness, while all remaining hyperparameters were kept at their default settings. Unlike statistical or neural topic models, this pipeline did not involve parameters or mathematical formulas; the technical specification of each task was therefore fully defined by the prompt design, where each task was carefully developed and iteratively refined to address known LLMs limitations, such as hallucination or the generation of inaccurate content [31]. Chain-of-thought (CoT) prompting was implemented to guide the model through logical, step-by-step reasoning. The phrase “Let’s think step by step” was incorporated to further enhance accuracy, as demonstrated by Kojima et al. [32]. Additionally, role-based instructions (e.g., “You are an expert data analyst…”) were used to shape the model’s response style, with strict output constraints specifying the required response format. Two prompt engineering approaches were employed: zero-shot and few-shot prompting. In the zero-shot approach, the model was instructed to generate responses based solely on its existing knowledge. In contrast, few-shot prompting offered a more explicit and guided approach, providing the model with several detailed examples that illustrated the desired reasoning process [33]. Prompt-based topic modeling was performed in four sequential tasks, as illustrated in Figure 2. The first task used a prompt-based LLM to extract potential topics from each post. Next, topic frequencies were computed to assess their occurrence rates. After that, semantically similar topics were grouped into higher-level categories to identify the general themes of the dataset. Finally, upon the identification of the thematic clusters, each post within the corpus was assigned to the most relevant theme. This assignment allowed for a systematic categorization of the entire dataset.

Extract Topics per Post (Task 1):
The initial task involved the extraction of potential topics from each post, as illustrated in the zero-shot prompt provided in Listing A1 given in Appendix A. The LLM was tasked with deconstructing the provided post to identify its key ideas and implied meanings before assigning final topic labels and explicitly directing: “Now, based on these identified components, what specific topics accurately represent the content of the post? Consider the relationships between the different themes and how they connect to broader categories.” The prompt input consisted of the post, and the desired output was formatted as a list of the most relevant topics (e.g., [topic1, topic2]), with the number of identified topics determined by the LLM based on the post’s complexity. Importantly, in instances where no relevant topics could be identified, the LLM was instructed to output “NA” to minimize the risk of hallucinated outputs. By following this prompt structure, the model was compelled to map its output directly to the semantic evidence within the source text, thereby ensuring document-level faithfulness. In contrast, the few-shot approach supplemented these same instructions with five illustrative examples, as illustrated in Listing A2 given in Appendix A, each pairing a generated post with its corresponding set of extracted topics. These examples were initially generated by the LLM to mirror the style of the original dataset and were subsequently subjected to human review and verification to ensure correctness.
Topic frequency calculation (Task 2): The purpose of this task was to identify the most prominent and recurring topics discussed by people with MS, enabling the filtering of broad or general topics that did not meaningfully contribute to the analysis. Following topic extraction, the frequency of each identified topic was calculated across the entire corpus, with a dictionary utilized to aggregate occurrences of each unique topic and generate a structured list pairing each topic with its corresponding frequency count. A filtering mechanism was then applied to discard overly broad topics—such as “Multiple Sclerosis”—as these referred to the condition itself rather than capturing a specific or meaningful aspect of patient experience.
Semantic grouping (Clustering) (Task 3): In this task, the LLM was guided to identify underlying semantic relationships among the provided posts and their corresponding topics, including less obvious connections. The prompt directed the model to follow a structured chain-of-thought reasoning process, explicitly instructing: “As you analyze each topic, articulate your thought process in a structured, step-by-step manner”, with the key objective of discovering hidden themes: “Actively search for themes that may not be immediately apparent, employing a systematic approach to identify overarching concepts or narratives that connect seemingly disparate topics”. The LLM was further tasked with generating representative cluster names ensuring clarity, conciseness, and minimal overlap, as stated in the prompt: “Use clear, concise, and standardized language for each cluster name. Aim for consistency in terminology across multiple runs and ensure minimal overlap between clusters”. The input consisted of a list of posts and their corresponding topics, and the output required both a detailed chain-of-thought justification for each clustering decision and a structured presentation of the resulting clusters. Specific constraints were applied to ensure quality, reproducibility, and thematic coherence. Notably, no topics were discarded due to rare occurrences—given the relatively small dataset (n = 504), infrequent topics could still represent meaningful aspects of patient experience, such as discussions of specific symptoms or emotional states that are uncommon but deeply relevant to those affected. The full prompt is presented in Listing A3 given in Appendix A.
Theme Assignment to Post (Task 4): In this task, the LLM was instructed to categorize a given post by assigning the most semantically similar theme from a strictly predefined list derived from Task 3. The process followed a structured sequence of instructions as defined in the prompt. First, the LLM interpreted the content of the post, as instructed: “Carefully read and understand the content of the provided post”. Second, the LLM interpreted the meaning of each theme in the predefined list, as directed: “Carefully read and understand the content of the Themes list”. Third, the semantic similarity between the post and each theme was evaluated, as stated: “For each post assess the semantic similarity between the post’s content and the Theme’s meaning”. Fourth, the single most semantically aligned theme was selected and outputted, as explicitly constrained: “select the single theme from the themes list that exhibits the absolute highest degree of semantic similarity with the post’s content”. Finally, if no suitable theme was identified, the model was instructed to assign the label “Unrelated”. The input to the prompt consisted of the post and the predefined list of themes, and the expected output was a single theme label from that list. The full prompt is presented in Listing A4 given in Appendix A.

2.4. Evaluation

In this section, the themes generated by the prompt-based approaches are evaluated across three dimensions: (1) coherence, which measured the degree to which topics within a single theme were semantically related, assessed through both GPT-4-based automated scoring and a human word intrusion task; (2) diversity, which assessed how distinct the themes were from each other, evaluated through GPT-4-based scoring; and (3) document-level faithfulness, which assessed whether extracted themes were faithfully grounded in the content of the underlying source posts, evaluated through GPT-4-based scoring and validated against human annotation. Following the methodology of Stammbach et al. [30], GPT-4 was employed as the primary evaluator across all three dimensions, as their work demonstrated that LLM-based topic evaluation judgements correlated more strongly with human assessments than traditional automated metrics.

2.4.1. Topic Coherence Evaluation

Topic coherence reflects the degree to which the words and topics comprising a theme are semantically related and form a meaningful, interpretable unit [28]. In this paper, coherence was assessed through two complementary measures: an automated GPT-4-based scoring approach and a human word intrusion task, which served as an independent cross-validation measure of human-perceived coherence.

LLM-based:
The prompt used in this evaluation is presented in Listing A5 given in Appendix A. For each set of generated topics obtained through zero-shot prompting and few-shot prompting, GPT-4 was prompted to provide a coherence score on a scale of 1 to 5, where 1 indicates low coherence and 5 indicates high coherence. Since these were prompt-elicited judgements rather than formula-derived statistics, no arithmetic formula applied. To account for run-to-run variability in LLM outputs, each configuration was evaluated across three independent runs and the mean score was computed as

$Average = \frac{S_{1} + S_{2} + S_{3}}{3}$

(1)

where S₁, S₂, and S₃ denote the scores assigned by GPT-4 in runs 1, 2, and 3 respectively. The prompt also requested the LLM to provide a brief justification for the assigned score, thereby capturing qualitative insights into the model’s reasoning.
Human assessment:
A human-centered evaluation was conducted using a word intrusion task to assess the extent to which human annotators perceived the coherence of the generated themes. The task was originally introduced as a method for evaluating topic model quality through human judgment [34]. In this task, each theme was presented as a list of its comprising topics, with one unrelated intruder word inserted, selected randomly from a different cluster. Annotators were asked to identify the intruder word within each theme. Higher accuracy in identifying the intruder was interpreted as an indication of higher human-perceived coherence, as it suggested that the remaining topics formed a clearly recognizable and semantically coherent group. The survey was administered online via Google Forms to nine annotators, each holding a minimum of a bachelor’s degree, aged 25 years or older, and drawn from diverse educational and professional backgrounds. All annotators participated voluntarily, were fully informed of the task prior to participation, and completed the task independently for all themes across both prompt-based configurations—zero-shot and few-shot—resulting in nine judgements per cluster per configuration. Two complementary measures are reported: One is percentage agreement, which captured the proportion of annotators who correctly identified the intruder and served as the primary indicator of human-perceived coherence, calculated for each cluster c as

$P A (c) = \frac{number of annotators who correctly identified the intruder in cluster c}{total number of annotators} \times 100$

(2)

where PA(c) is reported as “Agreement (%)” in Section 3.
This is complemented by Fleiss’ kappa [35] as a supplementary reliability indicator. However, kappa values are subject to a well-documented paradox whereby high observed agreement can yield paradoxically low or even negative kappa coefficients when one response category is highly prevalent [36,37]. Fleiss’ kappa was computed using the fleiss_kappa function from the statsmodels.stats.inter_rater module in Python 3.12.

2.4.2. Topic Diversity Evaluation

Thematic diversity captured the degree to which the extracted themes were distinct from one another, assessing whether the pipeline produced a broad range of non-overlapping themes [30]. As detailed in the prompt in Listing A6 given in Appendix A, GPT-4 was instructed to assign an overall diversity score on a 1–5 scale, where 1 denotes very low diversity and 5 denotes high diversity, along with a qualitative justification. A high diversity score suggested that the model successfully identified a broad range of unique themes present in the corpus rather than producing redundant or overlapping topics. The evaluation was conducted across three independent runs, and the average score was computed following the same procedure described in Section 2.4.1 (Equation (1)).

2.4.3. Document-Level Faithfulness Evaluation

To assess whether extracted themes were faithfully grounded in the content of the underlying source posts, a document-level faithfulness evaluation was conducted across the full corpus of 504 posts. The evaluation employed a two-annotator design on a stratified random sample to establish inter-rater reliability, followed by a single-annotator extension to the remaining posts. For the stratified sample, 126 posts representing 25% of the total corpus were selected at a rate of approximately 14–15 posts per theme to ensure proportional coverage across all extracted themes for both prompt-based configurations. For each sampled post, the assigned theme was presented alongside the original post text. Two independent annotators judged, without reference to each other’s responses, whether the assigned theme faithfully reflected the semantic content of the source post, producing a binary judgement of faithful or not faithful. GPT-4 was additionally prompted to make the same binary judgement for each post using the prompt presented in Listing A7, providing an automated reference point for comparison. Agreement between the two human annotators was quantified using Cohen’s kappa [38], and human–LLM agreement was assessed by computing Cohen’s kappa between each annotator and GPT-4 independently. For the remaining 378 posts, one annotator independently assigned a faithfulness label to each post, and GPT-4 evaluated all remaining posts using the same faithfulness prompt. Posts assigned the label “Unrelated” during the theme assignment stage were excluded from the faithfulness evaluation prior to analysis. Cohen’s kappa was computed between the Annotator and GPT-4 across the full corpus to assess human–LLM agreement at scale using the cohen_kappa_score function from the sklearn.metrics module in Python 3.12. Given the exploratory scope of this work, all results are reported as indicative of general faithfulness trends rather than as statistically definitive conclusions.

3. Results

This section presents the findings of the empirical investigation across two prompt-based configurations: zero-shot and few-shot prompting. The section is organized as follows. First, the themes extracted by each configuration are presented. Second, a multi-layered evaluation is reported across three dimensions—coherence, diversity, and document-level faithfulness—combining GPT-4-based automated scoring with human validation. Finally, an examination of the reproducibility of the pipeline through repeated independent runs to assess thematic stability across both configurations is described. All findings are interpreted as indicative within this specific experimental setting and should not be generalized across datasets, domains, or model architectures without further empirical validation.

3.1. Extracted Themes

3.1.1. Themes Identified via Zero-Shot Prompting

When the zero-shot prompt (Listing A1 in Appendix A) was applied to extract themes from the 504 MS-related posts using GPT-4o-mini, eight distinct themes were identified, as presented in Table 1. The output was structured as a dictionary, where each theme serve as a key associated with a list of related topics that fell under that broader thematic category. For example, the theme “Mental Health and Resilience” encompassed topics such as mental health, emotional well-being, resilience, and coping mechanisms. This theme highlighted how individuals with MS navigate the psychological challenges of living with a chronic illness and the strategies they employ to maintain emotional balance.

3.1.2. Themes Identified via Few-Shot Prompting

When the few-shot prompt (Listing A2 in Appendix A) was applied to the 504 MS-related posts, eight distinct themes were identified, as presented in Table 2. For example, the theme “Physical Limitations and Disability” captured topics such as disability, mobility challenges, and physical limitations, reflecting the physical barriers faced by individuals with MS in relation to movement and functional independence.

3.2. Topic Modeling Evaluation

This section presents the evaluation findings across both prompt-based configurations—zero-shot and few-shot prompting—across three dimensions: coherence, diversity, and document-level faithfulness. For coherence and diversity, GPT-4 evaluation prompts were executed across three independent runs per configuration, and average scores were computed to account for the inherent variability in LLM outputs and ensure result reliability. Summary scores are presented in Table 3, with detailed per-run scores and human evaluation results reported in the subsections that follow. The document-level faithfulness results are reported separately in Section 3.2.3.

3.2.1. Coherence

LLM-based evaluation:
Both configurations achieved high coherence scores, with the few-shot configuration yielding a mean score of 4.9 and the zero-shot configuration achieving a score of 5.0. The evaluation was carried out in three separate runs, and the mean score for each approach was then calculated, as presented in Table 4.
Word intrusion task:
The following results are based on a word intrusion task in which annotators were asked to identify an intruder word inserted among the topics of each theme. Higher accuracy in identifying the intruder reflected stronger human-perceived coherence of the theme. Two complementary measures are reported: percentage agreement and Fleiss’ kappa reported as a supplementary reliability indicator. For the zero-shot configuration, the overall percentage agreement was 79.2%, with three clusters—Chronic Illness Experience, Healthcare Access and Costs, and Personal Growth and Acceptance—achieving perfect agreement (100%). The remaining clusters ranged from 55.6% to 77.8%, with Patient Experience and Treatment recording the lowest agreement at 55.6%, suggesting relatively weaker perceived coherence for this cluster. The inter-rater reliability for the zero-shot configuration, measured using Fleiss’ kappa, was $κ = 0.074$ , classified as slight agreement [39]. Per-cluster agreement rates are presented in Table 5.
For the few-shot configuration, the overall percentage agreement was 87.5%, with Mental Health and Emotional Well-being and Community and Family Support achieving perfect agreement (100%). The remaining clusters ranged from 66.7% to 88.9%, with Coping Strategies and Resilience recording the lowest agreement at 66.7%, suggesting that themes involving overlapping emotional and behavioral dimensions are inherently more difficult for annotators to distinguish, regardless of prompting strategy. The inter-rater reliability for the few-shot configuration, measured using Fleiss’ kappa, was $κ = - 0.016$ , classified as poor agreement [39]. As discussed in Section 2.4.1, this negative value is attributable to the prevalence effect [36,37] rather than genuine annotator disagreement, a statistical artifact that arises when one response category strongly dominates, inflating expected chance agreement and artificially suppressing kappa. Percentage agreement was therefore retained as the primary measure of human-perceived coherence. Per-cluster agreement rates are presented in Table 6.
Notably, the few-shot configuration produced no cluster below 66.7% agreement, whereas the zero-shot configuration recorded a minimum of 55.6% for Patient Experience and Treatment, indicating that few-shot prompting yields more consistently coherent themes across all clusters, not only on average.

3.2.2. Diversity

Both the zero-shot and few-shot configurations achieved a mean diversity score of 4.6, indicating that both prompt strategies produced a broad range of thematically distinct topic clusters within this experimental setting. The evaluation was conducted across three independent runs, and the average score was computed for each configuration, as presented in Table 7.

3.2.3. Document-Level Faithfulness

The document-level faithfulness evaluation results are summarized in Table 8, reported across two scopes: a stratified random sample of 126 posts and the full corpus.

Stratified Sample (n = 126). For the zero-shot configuration, Annotator 1 judged 66.7% of posts as faithful, Annotator 2 judged 77.8% as faithful, and GPT-4 judged 81.0% as faithful. For the few-shot configuration, the corresponding rates were 69.8%, 77.8%, and 81.0% respectively. The difference in individual faithful rates reflects variation in annotator strictness thresholds, rather than disagreement on specific posts. Inter-rater reliability between the two human annotators reached substantial agreement for both configurations (zero-shot:

κ

= 0.688; few-shot:

κ

= 0.756 [39]), confirming the reliability of the human labeling. Given this substantial agreement, the evaluation was extended to the full corpus using a single annotator, whose judgments were considered a reliable base for consensus annotation, compared against GPT-4 faithfulness judgments across the complete corpus. Human–LLM agreement on the stratified sample revealed moderate to substantial convergence. For the zero-shot configuration, Cohen’s

κ

between Annotator 1 and GPT-4 was 0.440 (moderate [39]), and between Annotator 2 and GPT-4, it was 0.661 (substantial). For the few-shot configuration, the corresponding values were

κ

= 0.411 (moderate) and

κ

= 0.613 (substantial).

Full Corpus (n = 494 zero-shot; n = 490 few-shot). Prior to analysis, posts assigned the label "Unrelated" during the theme assignment stage were excluded, yielding 494 posts for the zero-shot configuration and 490 for the few-shot configuration. Building on the substantial inter-rater agreement established on the stratified sample, Annotator 1 independently annotated the remaining posts. For the zero-shot configuration, Annotator 1 judged 82.6% of posts as faithful and GPT-4 judged 87.2% as faithful. For the few-shot configuration, the corresponding rates were 85.9% and 89.0%. Human–LLM agreement across the full corpus reached moderate agreement for both configurations (zero-shot:

κ

= 0.403; few-shot:

κ

= 0.545 [39]). Whilst the human–LLM kappa values on the stratified sample were comparable across configurations (Ann1 vs. GPT-4:

κ

= 0.440 zero-shot;

κ

= 0.411 few-shot), the full corpus results show a more pronounced difference in favor of few-shot, suggesting that the faithfulness advantage of few-shot prompting becomes more evident at scale.

3.3. LLM Reproducibility

To further examine the robustness of the prompt-based pipeline, repeated independent runs were conducted, and thematic stability was analyzed across configurations. The LLM pipeline underwent an iterative refinement phase to evaluate the sensitivity of the prompts to wording changes. Once the final prompt was established, repeated independent runs (n = 2) were performed to assess thematic consistency across both zero-shot and few-shot approaches, presented in Table A1 and Table A2 given in Appendix B. The comparison between the runs summarized in Table 9 indicates that the few-shot setting yielded a higher proportion of identical thematic matches at 50.0%—specifically in core areas such as “Treatment Management” and “Mental Health and Well-being”—compared to the 37.5% observed in the zero-shot approach. While both settings maintained a consistent 37.5% rate of semantic matches, the zero-shot configuration exhibited a greater frequency of thematic shifts (25.0%) compared to the few-shot approach (12.5%). These results suggest that providing the model with contextual examples effectively doubles the reduction of thematic drift, thereby enhancing the standardization of the thematic extraction process.

4. Discussion

This paper presents a preliminary and exploratory methodological investigation of prompt-driven topic modeling conducted using a single social media dataset and a single LLM configuration. The analysis was based on MS-related posts collected from Platform X and was intended to examine methodological behavior on short, noisy, health-related texts. Accordingly, the findings should be interpreted as indicative within this specific experimental setting and should not be construed as evidence of broad generalizability.

The first objective of this study was to empirically compare zero-shot and few-shot prompting strategies for topic modeling of short, noisy, health-related social media texts. The results indicate that both strategies are capable of producing coherent, diverse, and semantically meaningful themes, with each approach demonstrating distinct strengths. The few-shot approach achieved higher alignment with human judgment (87.5% versus 79.2%), greater thematic stability across repeated runs (50.0% identical themes versus 37.5%), and a lower frequency of thematic shifts (12.5% versus 25.0%), suggesting that contextual examples effectively reduce thematic drift and enhance the reproducibility of the extraction process. While zero-shot prompting yielded a higher LLM-based coherence score (5.0 versus 4.9), the difference is negligible in practical terms. Notably, the human word intrusion task revealed lower human-perceived coherence for the zero-shot configuration (79.2% versus 87.5%), suggesting that whilst zero-shot themes score marginally higher on automated metrics, they are less consistently perceived as coherent by human annotators—a distinction that underscores the value of complementing automated evaluation with human judgment. Both configurations achieved identical diversity scores (4.6/5.0), indicating that neither prompting strategy produced redundant or overlapping themes and, notably, that the inclusion of contextual examples did not impact the breadth of themes identified. Both approaches notably independently identified humor as a coping mechanism (Humor as a Coping Mechanism in zero-shot; Humor and Positivity in few-shot) and community and emotional support as central themes in the MS experience. These insights reflect aspects of lived patient experience—including emotional resilience, social connection, and the role of humor in navigating chronic illness—that are unlikely to surface through traditional word co-occurrence approaches such as LDA, which tend to produce symptom-focused or treatment-related topics. This shared capacity to uncover human-centered psychosocial themes, regardless of prompting strategy, represents a key strength of prompt-driven LLM pipelines for health-related social media analysis.

The second objective was to assess topic quality through a multi-layered evaluation framework examining coherence, diversity, and document-level faithfulness. By combining GPT-4-based automated scoring [30] with human word intrusion cross-validation and document-level faithfulness annotation, the framework provides both automated and human-grounded perspectives on thematic quality. The consistency between GPT-4-based coherence scores and human word intrusion agreement rates—few-shot achieved 87.5% human agreement versus 79.2% for zero-shot, with no cluster falling below 66.7% compared to a minimum of 55.6% for zero-shot—provides supplementary cross-validation of the observed coherence trends. The low Fleiss’ kappa values (zero-shot:

κ

= 0.074; few-shot:

κ

= −0.016) are attributable to the well-documented prevalence effect [36,37] rather than genuine annotator disagreement, and percentage agreement was therefore retained as the primary measure of human-perceived coherence. Both configurations achieved identical diversity scores (4.6/5.0) across three independent evaluation runs, confirming that both prompting strategies successfully identified a broad range of thematically distinct clusters with minimal semantic overlap.

Regarding document-level faithfulness, topic-level coherence does not guarantee that extracted themes are faithfully grounded in the source posts—LLM-generated summaries may represent fluent abstractions rather than direct reflections of the underlying content. A two-stage faithfulness evaluation was therefore conducted. On the stratified sample of 126 posts, substantial human–human agreement was established (zero-shot:

κ

= 0.688; few-shot:

κ

= 0.756 [39]), confirming the reliability of the human labeling and justifying single-annotator extension to the full corpus. Across the full corpus, Annotator 1 judged 82.6% of zero-shot posts and 85.9% of few-shot posts as faithful, with GPT-4 judging 87.2% and 89.0% as faithful, respectively, both marginally higher for few-shot configuration. Human–LLM agreement reached moderate levels for both configurations (

κ

= 0.403 zero-shot;

κ

= 0.545 few-shot [39]), with the few-shot configuration showing stronger faithfulness, providing further evidence that contextual examples improve the semantic grounding of extracted themes in addition to their coherence and stability. Notably, whilst human–LLM agreement on the stratified sample was comparable across configurations (Ann1 vs GPT-4:

κ

= 0.440 zero-shot;

κ

= 0.411 few-shot), the full corpus revealed a more pronounced difference (

κ

= 0.403 versus

κ

= 0.545), suggesting that the semantic grounding advantage of few-shot prompting becomes more discernible at scale, a finding that warrants further investigation in future work with larger datasets.

Regarding computational cost, the pipeline was implemented using GPT-4o-mini for topic modeling and GPT-4 for evaluation. GPT-4o-mini was selected for its favorable performance-to-cost ratio relative to larger models. The topic modeling stages—including topic extraction, semantic grouping, and theme assignment across both configurations—incurred an estimated cost of under USD 0.10 in total. The evaluation stage represented the primary cost component, comprising GPT-4-based coherence and diversity scoring (three runs per configuration, estimated at approximately USD 0.61) and GPT-4-based faithfulness evaluation across the full corpus of 504 posts for both configurations (estimated at approximately USD 0.15), yielding a total estimated pipeline cost of approximately USD 0.86. These figures are estimates derived from the dataset size and pipeline structure, as per-run usage logs were not retained. Whether these cost characteristics remain favorable at substantially larger corpus scales is an open empirical question that future work should address systematically. It is noted that this estimate does not account for the time investment associated with prompt design, iterative refinement, and human annotation, which represented non-trivial components of the overall development effort.

Limitations

Several limitations should be considered when interpreting the findings of this paper.

Single dataset and LLM configuration: The analysis was conducted on a single social media dataset comprising 504 posts using one LLM configuration (GPT-4o-mini, temperature = 0.0). The reported findings are therefore indicative within this specific experimental setting and should not be generalized across datasets, domains, languages, or model architectures without further empirical validation.
Language and platform scope: The dataset comprised English-language posts collected from a single platform (Platform X). The findings may not generalize to other languages, platforms, or communication styles, and extending the pipeline to multilingual and multi-platform datasets remains a direction for future work.

5. Conclusions and Future Work

This paper presented a preliminary and exploratory empirical investigation of a prompt-driven LLM pipeline for topic modeling of 504 multiple sclerosis-related posts from Platform X. Two in-context learning strategies—zero-shot and few-shot prompting using GPT-4o-mini—were compared through a multi-layered evaluation framework that examined coherence, diversity, and document-level faithfulness, combining GPT-4-based automated scoring with human validation. Within this experimental setting, few-shot prompting demonstrated stronger overall performance, achieving higher human agreement (87.5% versus 79.2%), greater thematic stability across repeated runs, and stronger faithfulness. While zero-shot prompting yielded a marginally higher automated coherence score (5.0/5.0), human evaluation revealed lower perceived coherence (79.2%), underscoring the value of complementing automated metrics with human judgment. Both configurations achieved identical diversity scores (4.6/5.0) and independently uncovered psychosocial dimensions, including humor as a coping mechanism and community support, that extend beyond clinical data and are unlikely to surface through traditional word co-occurrence approaches. Building on these findings, several directions for future work are identified. First, open-source or alternative-family evaluator models should be explored. Second, the pipeline should be extended to multilingual and multi-platform datasets to evaluate generalizability across diverse patient populations. Finally, integrating topic modeling with sentiment analysis and expert clinical feedback will deepen the interpretability and practical relevance of the extracted themes, informing more empathetic and data-driven healthcare strategies.

Author Contributions

Methodology, Y.A., A.B. and O.A.; formal analysis, Y.A.; supervision, A.B. and O.A. All authors have read and agreed to the published version of the manuscript.

Funding

The project was funded by KAU Endowment (WAQF) at King Abdulaziz University, Jeddah, Saudi Arabia. The authors, therefore, acknowledge with thanks WAQF and the Deanship of Scientific Research (DSR) for technical and financial support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this paper are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Listing A1. Task 1: LLM prompting (zero-shot prompt).

You are an expert data analyst. Let’s analyze the following post step-by-step to identify the most relevant topics.
post: ’p’
First, let’s break down the post into its key components and ideas. What are the main subjects or themes mentioned? Think carefully about the context and any implied meanings.
Now, based on these identified components, what specific topics accurately represent the content of the post? Consider the relationships between the different themes and how they connect to broader categories.
Finally, provide a list of the most relevant topics, formatted as: [topic1, topic2 (if exist), topic 3 (if exist)]. The number of topics should be determined by the model based on the complexity of the post. If after careful consideration, you determine that no relevant topics can be identified, or if the post lacks coherent themes, assign “NA”.
Let’s think step by step.

Listing A2. Task 1: LLM prompting (examples for few-shot prompt).

..
..
..
..
Let’s think step by step.
Here are some examples that might help:
Example 1: Post: I have been taking Ocrevus for 2 years now and I have no complaints Topics: treatment plan, treatment side effects, Ocrevus
Example 2: Post: I just had my first MRI after I started Ocrevus, and it showed no new lesions, but I still feel like my symptoms are getting worse. I’m confused.
Topics: MRI results, brain lesions, treatment effectiveness
Example 3: Post: I don’t think I’d ever stop taking a dmt. But after talking it over with my neurologist for several years, I decided to switch from Ocrevus to Kesimpta this October.
Topics: medication plan
Example 4: Post: I’m really heartbroken to share that I’ve been diagnosed with multiple sclerosis and that it’s only expected to progress.
Topics: Disability
Example 5: Post: I took four naps today and I’m still wiped out.
Topics: Low power, fatigue

Listing A3. Task 3: clustering prompt.

You are an expert in semantic topic clustering and thematic analysis of social media posts. Your task is to analyze a list of posts and, if provided, related topics, and group them into coherent clusters (Themes) based on their underlying semantic relationships, including identifying hidden or less obvious themes. Given the following list of posts: {Posts} and the related Topics: {topics}.
**Instructions:**
1. **Analyze the Posts and Topics:** Carefully read and understand the content of each post provided, and the corresponding list of topics that were previously extracted from each post. Consider the precise meaning of each topic, its potential connotations, and how they might relate to one another. Ensure that each topic is analyzed independently before comparing it to others.
2. **Structured Chain of Thought Reasoning:** As you analyze each topic, articulate your thought process in a structured, step-by-step manner.
3. **Rigorous Hidden Theme Discovery:** Actively search for themes that may not be immediately apparent, employing a systematic approach to identify overarching concepts or narratives that connect seemingly disparate topics. Document any implicit or latent connections you observe.
4. **Generate Standardized Cluster Names:** Use clear, concise, and standardized language for each cluster name (Theme). Aim for consistency in terminology across multiple runs. Prioritize commonly understood terms and ensuring minimal overlap between clusters. Strive for exhaustive coverage of all topics.
5. **Generate Representative Cluster Names:** For each cluster, create a concise and descriptive name that accurately reflects the included topics. Use the most representative topic or a general term that encapsulates the cluster’s content.
**Output Format:**
Please provide the clusters in the following format, including your chain of thought:
**Chain of Thought:**
* “Analyzing topic [topic1], I notice it shares [semantic connection] with [topic2]. This suggests a potential theme related to [theme].” * “Considering [topic3], its implication of [implication] strongly aligns with [theme].” * “I observe a less obvious theme connecting [topic4] and [topic5] through [hidden connection].”
**Cluster Output:**
Cluster Name 1: [topic1, topic2, …], Justification: [Detailed explanation using chain of thought] Cluster Name 2: [topic3, topic4, …], Justification: [Detailed explanation using chain of thought]
**Example Output:**
Cluster Name 1: **Chronic Illness Experience**: [chronic illness, multiple sclerosis (ms), chronic pain], Justification: This cluster encapsulates the shared experiences and challenges faced by individuals living with chronic conditions, particularly focusing on multiple sclerosis and the associated pain.
**Constraints:**
* Ensure that each cluster name is clear, concise, and representative of the grouped topics. * Ensure the reproducability of the clusters names. * Prioritize the identification of hidden themes and deeper semantic relationships. * Aim for thematic coherence and meaningful groupings. * Use the chain-of-thought process to thoroughly justify each cluster assignment.

Listing A4. Task 4: theme assignment to post prompt.

prompt = You are a highly skilled expert in social media content analysis and semantic similarity matching. Your task is to accurately categorize a given post based on its semantic alignment with a *strictly predefined* theme list. **Instructions:** 1. **Analyze Post Content:** Carefully read and understand the content of the provided post. 2. **Understand Themes list:** Carefully read and understand the content of the Themes list(clusters). 3. **Determine Semantic Similarity between the post and the appropiate theme within the given (predifined) theme List:** For each post assess the semantic similarity between the post’s content and the Theme’s meaning. *Crucially, your selection MUST be made ONLY from the themes present in the provided listThemes.* 4. **Assign Theme with Highest Semantic Similarity (From Predefined List ONLY):** Select the SINGLE Theme from Themes list that exhibits the ABSOLUTE HIGHEST degree of semantic similarity with the post’s content. *You MUST NOT generate or select themes outside of this list.* 5. **Handle Unrelated Cases (Within Predefined List):** If none of the major topics in Themes list are a good semantic fit for the post, assign the topic “Unrelated”. **Crucial Constraints:** * **STRICT ADHERENCE TO PREDEFINED LIST:** * You MUST ONLY select topics from the Themes list. Generating or selecting themes outside of this list is strictly prohibited.* * **Semantic Focus:** Prioritize semantic similarity over simple keyword matching. * **No Explanations:** Do not include any additional explanations or formatting when returning the assigned themes. * **Single Theme Output (Highest Similarity):** Return ONLY ONE Theme, the one with the HIGHEST semantic similarity. * **Handle Unrelated:** if no topic from the predfined Theme list is appropriate please assign “Unrelated” as a topic. * **If unable to complete the task, return “Unable to fulfill request”** **Provided post:** post **Themes list (STRICTLY USE ONLY THESE):** Themes **Begin Analysis:**

Listing A5. Coherence prompt.

Given a set of themes and their associated topic words in {Clusters}
1. **Assess Semantic Relatedness:** Critically evaluate the semantic relatedness of the keywords and phrases within each individual cluster.Consider the following questions:
* Do the terms within a cluster share a common underlying concept or theme?
* Are there any terms that appear to be outliers or do not logically fit with the other keywords in the cluster?
2. **Provide a Coherence Score (Quantitative):** derive a rough quantitative score on a scale of 1 to 5 that represents the *average* coherence of the clusters within that set.
* **Scoring Guidelines:**
* **5 (Near-Perfect Coherence):** All clusters within the set exhibit strong internal semantic relatedness, with keywords clearly belonging together.
* **4 (Good Coherence):** Most clusters demonstrate good internal semantic relatedness.
* **3 (Moderate Coherence):** Some clusters show reasonable internal relatedness, while others might contain more ambiguous or less relevant keywords.
* **2 (Low Coherence):** Several clusters exhibit weak internal semantic relatedness, with multiple keywords appearing out of place.
* **1 (Significant Incoherence):** Many or all clusters demonstrate very poor internal semantic relatedness, with keywords appearing largely unrelated.
*Output Format:**
Structure your evaluation as follows: Overall Coherence Score (1–5): [OVERALL_AVERAGE_COHERENCE_SCORE] (a numerical rating)

Listing A6. Diversity prompt.

Given a set of themes and their associated topic words in {Clusters}
1. **Assess Cluster Distinctiveness: Evaluate the thematic diversity represented by the set of clusters. Consider the following questions:
* Are the central concepts of the different clusters within the same set clearly differentiated?
* Does the collection of clusters within a set capture a broad range of distinct themes or are they focused on a narrow set of related concepts
2. **Provide a Diversity Score (Quantitative):** derive a rough quantitative score on a scale of 1 to 5 that represents the thematic diversity of the clusters within that set.
* **Scoring Guidelines:**
* **5 (High Diversity):** The clusters within the set represent highly distinct and varied themes with minimal semantic overlap. The set captures a broad spectrum of concepts.
**4 (Good Diversity):** Most clusters within the set represent distinct themes with generally low semantic overlap. The set covers a reasonably broad range of concepts.
* **3 (Moderate Diversity):** Some clusters represent distinct themes, while others may exhibit noticeable semantic overlap or the set might focus on a moderately narrow range of concepts.
* **2 (Low Diversity):** Several clusters show significant semantic overlap, indicating limited thematic variety. The set likely focuses on a narrow range of interconnected concepts.
* **1 (Very Low Diversity):** The clusters within the set are largely redundant, with substantial semantic overlap, indicating a lack of distinct themes. The set focuses on a very narrow or even singular concept.
**Output Format:**
Structure your evaluation as follows: Overall Diversity Score (1–5): [OVERALL_AVERAGE_DIVERSITY_SCORE] (a numerical rating)

Listing A7. Document-level faithfulness prompt.

You are an expert in media data analysis. Your task is to evaluate the faithfulness of a given ’Assigned Theme’ to the content of a ’post’.
**Post:** “post”
**Assigned Theme:** “assigned_theme”
**Instructions:**
1. Determine if the ’Assigned Theme’ accurately reflects the primary subject matter or core message of the ’post’.
2. Consider if the theme captures the essence of what the post is primarily discussing.
3. Output ONLY one of the following labels: ’faithful’ or ’not faithful’.
**Definition of Faithfulness:**
* **faithful:** The assigned theme clearly and accurately represents the main topic, or experience conveyed in the post.
* **not faithful:** The assigned theme is either irrelevant, too broad, too narrow, or misrepresents the primary content of the post.

Appendix B

Table A1. Thematic stability and semantic alignment: zero-shot approach (runs 1 and 2).

Theme (Run 1)	Theme (Run 2)	Stability Status
Chronic Illness Experience	Chronic Illness Experience	Identical
Mental Health and Resilience	Mental Health and Resilience	Identical
Community and Family Support	Community and Support Systems	Semantic Match
Financial Aspects	Healthcare Access and Costs	Semantic Match
Humor as Coping	Humor as Coping	Identical
Personal Growth and Motivation	Personal Growth and Acceptance	Semantic Match
Self-Care and Well-being	Daily Life Challenges	Thematic Shift
Awareness and Advocacy	Patient Exp. and Treatment	Thematic Shift

Table A2. Thematic stability and semantic alignment: few-shot approach (runs 1 and 2).

Theme (Run 1)	Theme (Run 2)	Stability Status
Chronic Illness Experience	Chronic Illness Experience	Identical
Mental Health and Well-being	Mental Health and Well-being	Identical
Community and Family Support	Community and Family Support	Identical
Treatment Management	Treatment Management	Identical
Coping and Resilience	Coping Strategies and Resilience	Semantic Match
Economic Impact of Illness	Healthcare Access and Challenges	Semantic Match
Finding Joy in Struggles	Humor and Positivity	Semantic Match
Healthcare Access	Physical Limitations and Disability	Thematic Shift

References

Doshi, A.; Chataway, J. Multiple sclerosis, a treatable disease. Clin. Med. 2017, 17, 530–536. [Google Scholar] [CrossRef] [PubMed]
Walton, C.; King, R.; Rechtman, L.; Kaye, W.; Leray, E.; Marrie, R.; Robertson, N.; Rocca, N.L.; Uitdehaag, B.; van der Mei, I.A.; et al. Rising prevalence of multiple sclerosis worldwide: Insights from the Atlas of MS, third edition. Mult. Scler. 2020, 26, 1816–1821. [Google Scholar] [CrossRef] [PubMed]
Tahernia, H.; Esnaasharieh, F.; Amani, H.; Milanifard, M.; Mirakhori, F. Diagnosis and Treatment of MS in Patients Suffering from Various Degrees of the Disease with a Clinical Approach: The Orginal Article. J. Pharm. Negat. Results 2022, 13, 1908–1921. [Google Scholar]
Topcu, G.; Mhizha-Murira, J.R.; Griffiths, H.; Bale, C.; Drummond, A.; Fitzsimmons, D.; Potter, K.J.; Evangelou, N.; das Nair, R. Experiences of receiving a diagnosis of multiple sclerosis: A meta-synthesis of qualitative studies. Disabil. Rehabil. 2023, 45, 772–783. [Google Scholar] [CrossRef]
Amato, M.; Ponziani, G.; Rossi, F.; Liedl, C.; Stefanile, C.; Rossi, L. Quality of life in multiple sclerosis: The impact of depression, fatigue and disability. Mult. Scler. 2001, 7, 340–344. [Google Scholar] [CrossRef]
Pompili, M.; Forte, A.; Palermo, M.; Stefani, H.; Lamis, D.A.; Serafini, G.; Amore, M.; Girardi, P. Suicide risk in multiple sclerosis: A systematic review of current literature. J. Psychosom. Res. 2012, 73, 411–417. [Google Scholar] [CrossRef]
Eizaguirre, M.; Yastremiz, C.; Ciufia, N.; Roman, M.S.; Alonso, R.; Silva, B.A.; Garcea, O.; Cáceres, F.; Vanotti, S. Relevance and Impact of Social Support on Quality of Life for Persons With Multiple Sclerosis. Int. J. MS Care 2022, 25 3, 99–103. [Google Scholar] [CrossRef]
Churchill, R.; Singh, L. The Evolution of Topic Modeling. ACM Comput. Surv. 2022, 54, 215. [Google Scholar] [CrossRef]
Abdelrazek, A.; Eid, Y.; Gawish, E.; Medhat, W.; Hassan Yousef, A. Topic modeling algorithms and applications: A survey. Inf. Syst. 2022, 112, 102131. [Google Scholar] [CrossRef]
Corti, L.; Zanetti, M.; Tricella, G.; Bonati, M. Social media analysis of Twitter tweets related to ASD in 2019–2020, with particular attention to COVID-19: Topic modelling and sentiment analysis. J. Big Data 2022, 9, 113. [Google Scholar] [CrossRef] [PubMed]
Lyu, J.C.; Han, E.L.; Luli, G.K. COVID-19 Vaccine-Related Discussion on Twitter: Topic Modeling and Sentiment Analysis. J. Med. Internet Res. 2021, 23, e24435. [Google Scholar] [CrossRef]
Gabarron, E.; Dorronzoro, E.; Reichenpfader, D.; Denecke, K. What do autistic people discuss on Twitter? An approach using BERTopic modelling. In Caring is Sharing – Exploiting the Value in Data for Health and Innovation, Proceedings of MIE 2023; SAGE Publications: Thousand Oaks, CA, USA, 2023; Volume 302, pp. 403–407. [Google Scholar] [CrossRef]
Min, S.; Han, J. Topic Modeling Analysis of Diabetes-Related Health Information during the Coronavirus Disease Pandemic. Healthcare 2023, 11, 1871. [Google Scholar] [CrossRef]
Shankar, R.; Yip, A.W. Sentiment analysis and topic modeling of social media data to explore public discourse on irritable bowel syndrome. Sci. Rep. 2025, 15, 21550. [Google Scholar] [CrossRef]
Prasad, A.; Shalmani, S.A.; He, L.; Wang, Y.; McRoy, S. Identifying Themes in Social Media Discussions of Eating Disorders: A Quantitative Analysis of How Meaningful Guidance and Examples Improve LLM Classification. BioMedInformatics 2025, 5, 40. [Google Scholar] [CrossRef]
Giunti, G.; Claes, M.; Zubiete, E.; Rivera, O.; Gabarron, E. Analysing Sentiment and Topics Related to Multiple Sclerosis on Twitter. In Digital Personalized Health and Medicine; IOS Press: Amsterdam, The Netherlands, 2020; Volume 270, pp. 911–915. [Google Scholar] [CrossRef]
Haag, C.; Steinemann, N.; Ajdacic-Gross, V.; Schlomberg, J.; Ineichen, B.; Stanikić, M.; Dressel, H.; Daniore, P.; Roth, P.; Ammann, S.; et al. Natural language processing analysis of the theories of people with multiple sclerosis about causes of their disease. Commun. Med. 2024, 4, 122. [Google Scholar] [CrossRef]
Ahmed, M.; Tiun, S.; Omar, N.; Sani, N. Short Text Clustering Algorithms, Application and Challenges: A Survey. Appl. Sci. 2023, 13, 342. [Google Scholar] [CrossRef]
Akash, P.S.; Chang, K.C.C. Enhancing Short-Text Topic Modeling with LLM-Driven Context Expansion and Prefix-Tuned VAEs. arXiv 2024, arXiv:2410.03071. [Google Scholar] [CrossRef]
Egger, R.; Yu, J. A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef]
Gupta, P.; Ding, B.; Guan, C.; Ding, D. Generative AI: A systematic review using topic modelling techniques. Data Inf. Manag. 2024, 8, 100066. [Google Scholar] [CrossRef]
Janssens, W.; Bogaert, M.; den Poel, D.V. LLM-Assisted Topic Reduction for BERTopic on Social Media Data. arXiv 2025, arXiv:2509.19365. [Google Scholar] [CrossRef]
Wang, H.; Prakash, N.; Hoang, N.; Hee, M.S.; Naseem, U.; Lee, R.K.W. Prompting Large Language Models for Topic Modeling. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 1236–1241. [Google Scholar] [CrossRef]
He, W.; Hou, B.; Zheng, A.; Feng, Y.; Klein, A.; Oconnor, K.; Yang, S.; Shang, T.; Demiris, G.; Gonzalez, G.; et al. Advanced topic modeling with large language models: Analyzing social media content from dementia caregivers. Innov. Aging 2025, 9, S38–S47. [Google Scholar] [CrossRef]
Doi, T.; Isonuma, M.; Yanaka, H. Topic Modeling for Short Texts with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop); Fu, X., Fleisig, E., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 21–33. [Google Scholar] [CrossRef]
Mu, Y.; Dong, C.; Bontcheva, K.; Song, X. Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling. arXiv 2024, arXiv:2403.16248. [Google Scholar] [CrossRef]
Guo, Y.; Liu, J.; Wang, L.; Qin, W.; Hao, S.; Hong, R. A Prompt-Based Topic-Modeling Method for Depression Detection on Low-Resource Data. IEEE Trans. Comput. Soc. Syst. 2024, 11, 1430–1439. [Google Scholar] [CrossRef]
Newman, D.; Lau, J.H.; Grieser, K.; Baldwin, T. Automatic Evaluation of Topic Coherence. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics; Kaplan, R., Burstein, J., Harper, M., Penn, G., Eds.; Association for Computational Linguistics: Los Angeles, CA, USA, 2010; pp. 100–108. [Google Scholar]
Hoyle, A.; Goel, P.; Peskov, D.; Hian-Cheong, A.; Boyd-Graber, J.; Resnik, P. Is Automated Topic Model Evaluation Broken? The Incoherence of Coherence. arXiv 2021, arXiv:2107.02173. [Google Scholar] [CrossRef]
Stammbach, D.; Zouhar, V.; Hoyle, A.; Sachan, M.; Ash, E. Revisiting Automated Topic Model Evaluation with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 9348–9357. [Google Scholar] [CrossRef]
Alansari, A.; Luqman, H. Large Language Models Hallucination: A Comprehensive Survey. arXiv 2026, arXiv:2510.06265. [Google Scholar] [CrossRef]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. arXiv 2023, arXiv:2205.11916. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2023, arXiv:2201.11903. [Google Scholar] [CrossRef]
Chang, J.; Boyd-Graber, J.; Gerrish, S.; Wang, C.; Blei, D. Reading Tea Leaves: How Humans Interpret Topic Models. Adv. Neural Inf. Process. Syst. 2009, 22, 288–296. [Google Scholar]
Artstein, R.; Poesio, M. Survey Article: Inter-Coder Agreement for Computational Linguistics. Comput. Linguist. 2008, 34, 555–596. [Google Scholar] [CrossRef]
Feinstein, A.; Cicchetti, D.; Feinstein, A.R. Cicchetti DVHigh agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol. 1990, 43, 543–549. [Google Scholar] [CrossRef]
Dettori, J.R.; Norvell, D.C. Kappa and Beyond: Is There Agreement? Glob. Spine J. 2020, 10, 499–501. [Google Scholar] [CrossRef] [PubMed]
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]

Figure 1. Research methodology, comprising four stages: data collection, pre-processing, prompt-based topic modeling, and multi-layered evaluation encompassing coherence, diversity, and document-level faithfulness.

Figure 2. Prompt-based topic modeling shared pipeline.

Table 1. Extracted themes and comprising topics: zero-shot prompting.

#	Theme Name	Comprised Topics
1	Chronic Illness Experience	chronic illness, multiple sclerosis (ms), chronic pain, chronic fatigue
2	Mental Health and Resilience	mental health, emotional well-being, resilience, coping mechanisms
3	Patient Experience and Treatment	medication, treatment efficacy, patient experience
4	Community and Support Systems	community support, family dynamics, support systems, emotional support
5	Healthcare Access and Costs	healthcare access, healthcare costs, insurance coverage
6	Humor as a Coping Mechanism	humor, emotional expression
7	Daily Life Challenges	fatigue, daily life challenges, symptoms
8	Personal Growth and Acceptance	personal growth, self-acceptance

Table 2. Extracted themes and comprised topics: few-shot prompting.

#	Theme Name	Comprising Topics
1	Chronic Illness Experience	chronic illness, multiple sclerosis (ms), chronic pain, symptoms, symptom management, physical symptoms
2	Mental Health and Emotional Well-being	mental health, emotional impact, emotional resilience, emotional well-being
3	Community and Family Support	community support, family support, support systems
4	Treatment Management	treatment effectiveness, treatment options, medication management, treatment plan
5	Coping Strategies and Resilience	coping mechanisms, self-care, resilience, emotional well-being
6	Physical Limitations and Disability	disability, mobility challenges, physical limitations
7	Humor and Positivity	humor, positivity, positive mindset
8	Healthcare Access and Systemic Challenges	healthcare access, insurance coverage, healthcare costs

Table 3. Evaluation scores across prompt-based configurations.

Method	Coherence (LLM-Based)	Human Agreement	Diversity (LLM-Based)
Zero-Shot Prompting	5.0	79.2%	4.6
Few-Shot Prompting	4.9	87.5%	4.6

Table 4. GPT-4 coherence evaluation scores across three runs.

Approach	Run 1	Run 2	Run 3	Average
Zero-shot	5.0	5.0	5.0	5.0
Few-shot	4.9	4.8	5.0	4.9

Table 5. Per-cluster human agreement rates: zero-shot prompting.

Cluster	Theme Name	Correct (n/9)	Agreement (%)
1	Chronic Illness Experience	9/9	100.0
2	Mental Health and Resilience	7/9	77.8
3	Patient Experience and Treatment	5/9	55.6
4	Community and Support Systems	6/9	66.7
5	Healthcare Access and Costs	9/9	100.0
6	Humor as a Coping Mechanism	6/9	66.7
7	Daily Life Challenges	6/9	66.7
8	Personal Growth and Acceptance	9/9	100.0
Overall			79.2%
Fleiss’ $κ$			0.074 (Slight) [39]

Table 6. Per-cluster human agreement rates: few-shot prompting.

Cluster	Theme Name	Correct (n/9)	Agreement (%)
1	Chronic Illness Experience	8/9	88.9
2	Mental Health and Emotional Well-being	9/9	100.0
3	Community and Family Support	9/9	100.0
4	Treatment Management	8/9	88.9
5	Coping Strategies and Resilience	6/9	66.7
6	Physical Limitations and Disability	7/9	77.8
7	Humor and Positivity	8/9	88.9
8	Healthcare Access and Systemic Challenges	8/9	88.9
Overall			87.5%
Fleiss’ $κ$			−0.016 (Poor) [39]

Table 7. GPT-4 diversity evaluation scores across three runs.

Approach	Run 1	Run 2	Run 3	Average
Zero-shot	4	5	5	4.6
Few-shot	5	5	4	4.6

Table 8. Document-level faithfulness evaluation results.

Scope	Rater/Pair	Zero-Shot		Few-Shot
Scope	Rater/Pair	Faithful Rate (%)	Cohen’s $κ$	Faithful Rate (%)	Cohen’s $κ$
Stratified Sample (n = 126)	Ann1	66.7	–	69.8	–
	Ann2	77.8	–	77.8	–
	GPT-4	81.0	–	81.0	–
	Ann1 vs. Ann2	–	0.688	–	0.756
	Ann1 vs. GPT-4	–	0.440	–	0.411
	Ann2 vs. GPT-4	–	0.661	–	0.613
Full Corpus (n = 494/490)	Ann1	82.6	–	85.9	–
	GPT-4	87.2	–	89.0	–
	Ann1 vs. GPT-4	–	0.403	–	0.545

Kappa interpretation thresholds follow Landis and Koch [39]. Full corpus: n = 494 evaluable posts for zero-shot; n = 490 for few-shot, following exclusion of unrelated assignments. Ann2 was not included in the full corpus evaluation; single-annotator extension was justified by the substantial human–human agreement established on the stratified sample.

Table 9. Thematic stability and semantic alignment comparison.

Approach	Identical	Semantic Match	Thematic Shift	Total Themes
Zero-Shot (Table A1)	37.5% (3)	37.5% (3)	25.0% (2)	8
Few-Shot (Table A2)	50.0% (4)	37.5% (3)	12.5% (1)	8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alamoudi, Y.; Babour, A.; Almatrafi, O. Prompt-Driven LLM Pipeline for Topic Modeling of Multiple Sclerosis Social Media Discussions. Appl. Sci. 2026, 16, 5316. https://doi.org/10.3390/app16115316

AMA Style

Alamoudi Y, Babour A, Almatrafi O. Prompt-Driven LLM Pipeline for Topic Modeling of Multiple Sclerosis Social Media Discussions. Applied Sciences. 2026; 16(11):5316. https://doi.org/10.3390/app16115316

Chicago/Turabian Style

Alamoudi, Yasmeen, Amal Babour, and Omaima Almatrafi. 2026. "Prompt-Driven LLM Pipeline for Topic Modeling of Multiple Sclerosis Social Media Discussions" Applied Sciences 16, no. 11: 5316. https://doi.org/10.3390/app16115316

APA Style

Alamoudi, Y., Babour, A., & Almatrafi, O. (2026). Prompt-Driven LLM Pipeline for Topic Modeling of Multiple Sclerosis Social Media Discussions. Applied Sciences, 16(11), 5316. https://doi.org/10.3390/app16115316

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prompt-Driven LLM Pipeline for Topic Modeling of Multiple Sclerosis Social Media Discussions

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Data Pre-Processing

2.3. Prompt-Based Topic Modeling

2.4. Evaluation

2.4.1. Topic Coherence Evaluation

2.4.2. Topic Diversity Evaluation

2.4.3. Document-Level Faithfulness Evaluation

3. Results

3.1. Extracted Themes

3.1.1. Themes Identified via Zero-Shot Prompting

3.1.2. Themes Identified via Few-Shot Prompting

3.2. Topic Modeling Evaluation

3.2.1. Coherence

3.2.2. Diversity

3.2.3. Document-Level Faithfulness

3.3. LLM Reproducibility

4. Discussion

Limitations

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI