Next Article in Journal
CFD Numerical Simulation of Slurry Flow Characteristics Under the Clogged Form of Coal Gangue Slurry Transportation Pipeline
Next Article in Special Issue
Between Truth and Hallucinations: Evaluation of the Performance of Large Language Model-Based AI Plugins in Website Quality Analysis
Previous Article in Journal
Validation of Computational Methods for Free-Water Jet Diffusion and Pressure Dynamics in a Plunge Pool
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TR-GPT-CF: A Topic Refinement Method Using GPT and Coherence Filtering

by
Ika Widiastuti
* and
Hwan-Seung Yong
*
Department of Computer Science and Engineering, Ewha Womans University, Seoul 03760, Republic of Korea
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(4), 1962; https://doi.org/10.3390/app15041962
Submission received: 4 January 2025 / Revised: 10 February 2025 / Accepted: 11 February 2025 / Published: 13 February 2025

Abstract

:
Traditional topic models are effective at uncovering patterns within large text corpora but often struggle with capturing the contextual nuances necessary for meaningful interpretation. As a result, these models may produce incoherent topics, making it challenging to achieve consistency and clarity in topic interpretation—limitations that hinder their utility for real-world applications requiring reliable insights. To overcome these challenges, we introduce a novel post-extracted topic refinement approach that uses Z-score centroid-based misaligned word detection and hybrid semantic–contextual word replacement with WordNet and GPT to replace misaligned words within topics. Evaluations across multiple datasets reveal that our approach significantly enhances topic coherence, providing a robust solution for more interpretable and semantically coherent topics.

1. Introduction and State of the Art

Topic modeling is an area of natural language processing (NLP) that employs statistical techniques to identify hidden topics or themes in documents [1]. It is widely utilized in various disciplines to aid in the extraction of patterns from large quantities of texts and documents [2]. As a result of its ability to assist the processing of enormous amounts of text data, topic modeling is a beneficial tool in a variety of sectors, including social media analysis, market research, and healthcare [3]. This versatility across domains is one of the most notable advantages of topic modeling [4]. A vast amount of user-generated content is available on the internet and social media, which can be difficult to examine [5]. Topic modeling enables businesses to better understand consumer preferences, opinions, and perceptions of their product by extracting topics of interest for a brand or company [6]. Topic modeling also helps substantially in market research and advertising campaigns since it enables businesses to discover consumer behavioral patterns [7], identify new trends, and customize consumer messaging to certain target markets [8].
In healthcare, topic modeling has been employed to facilitate text classification, where clinical notes are analyzed as collections of topics to enhance the categorization process [9]. Building on existing research findings [10], topic modeling approaches were used to examine and categorize the various public views and feelings on statins. This study identified and organized common themes and topics that arose during statin debates. Key topics identified included side effects, dietary interactions, hesitancy regarding statin use, and the perception of bias in the pharmaceutical sector. The extensive analysis offered by [10] helps illuminate the elements that influence public acceptance and resistance to statins, delivering crucial insights into community health dynamics.
For these applications of topic modeling, the coherence and interpretability of topics are crucial [11]. In the text classification of electronic health records, topic models enable the selection of topics as features for predictive tasks, which increases the interpretability of these classification models [12]. This enhanced interpretability is critical for the clinical domain, particularly for decision making [9]. Effective applications in information retrieval rely on a comprehensive understanding of the result derived from topic modeling [13,14].
The primary topic modeling technique utilized in these applications is Latent Dirichlet Allocation (LDA) [15,16]. LDA is a strong statistical model that identifies latent themes within a corpus by assuming that documents are mixtures of topics and topics are mixtures of words [17,18]. This method provides a brief overview of the document collection as well as guided exploration of the identified topics. A topic is intuitively characterized by a collection of words. Different topics may share words, and a document might be associated with multiple topics. A user can explore the themes to gain an overview of the corpus without reading all the documents and can focus on a specific topic by exploring only the texts that are closely related to it. For instance, a topic with the words “computer”, “model”, “algorithm”, “data”, and “mathematical” may be correlated with documents regarding computation [19,20].
In addition to LDA, numerous additional topic modeling methodologies are commonly employed, each possessing distinct characteristics [21]. Non-negative Matrix Factorization (NMF) provides a linear algebraic method that generally yields more interpretable subjects by restricting topic weights to non-negative values [22]. The Correlated Topic Model (CTM) [23,24] enhances LDA by permitting correlations among topics, thus providing a more accurate representation of real-world data characterized by overlapping themes. Furthermore, BERTopic utilized advanced language models such as BERT [25] to improve the semantic comprehension of text, yielding highly coherent and contextually pertinent topics, thus rendering it exceptionally successful for intricate datasets with nuanced language.
However, despite their wide range of applications, these topic modeling systems frequently suffer from low coherence and inconsistent topic generation [26]. These issues are especially evident in traditional models, which have several major flaws. First, the themes provided frequently lack coherence, making it difficult for users to derive significant conclusions [27]. Additionally, executing a single model numerous times with the same input document can produce different topics [28]. Furthermore, these models fail to capture the contextual nuance of language, which is critical for comprehending the deeper meanings inherent in texts [29]. Finally, the restrictions in interpretability for real-world applications reduce their usefulness, as stakeholders want clear and actionable outputs in order to make informed decisions [9].
These issues could raise problems such as semantic irrelevance, where topics contain words that do not meaningfully relate [30]; lack of thematic unity, where topics are broad or ambiguous and fail to represent a coherence subject [31]; and the inclusion of outliers, where irrelevant or rare words skew topic interpretation. This issue is especially prevalent in models like LDA, where the number of topics must be predefined, and improper specifications can lead to either simplistic structures or excessive topic fragmentation. Prior work has demonstrated that inadequate topic selection can result in uninformative and duplicated topics, allowing outlier words to influence topic meaning [32]. Additionally, the identified topics may not necessarily correlate with human evaluations of topic quality and may be perceived as poor from the point of view of an end user [27,33].
Numerous studies [34,35,36,37,38] have been performed in order to overcome these problems, with a substantial emphasis placed on the process of refining the topics that are generated as a result of topic modeling methodologies. Among the five studies, refs. [35,38] are model agnostic, which means that they do not rely on any one topic modeling technique. This model-agnostic feature enables these methods to be versatile and applicable across various contexts utilizing different topic modeling techniques.
Study [34] explores how non-expert users engage with and refine topic models, revealing the gap between user requirements and the capabilities of present interactive topic modeling systems. To learn how non-experts assess, interpret, and update topic models, the researchers performed two user studies—an in-person interview and an online crowdsourced study. They identified numerous topic model modifications that users wanted. Users often sought to add words to explain and accentuate a topic’s theme. To improve topic clarity and relevancy, and irrelevant or generic terms were deleted. Changes to the word order were also considered to properly reflect the topic’s concept. Users also consolidated similar topics to remove repetition and split broad topics into more specialized ones to make the model more detailed and useful. The study highlights the necessity of developing topic modeling tools that are more intuitive and correspond more closely with the methods by which non-experts typically evaluate and adjust topics. This approach enhances the usability of topic models while simultaneously improving their quality, aligning them more closely with user’s understanding and needs.
An innovative method proposed by [35] addresses refining the topic model. This approach incorporates word-embedding projections with interactive user input within the context of a visual analytics framework. The method uses word embedding projections to build a visual representation of the themes. These projections are useful for displaying the semantic relationship between distinct words within a topic, providing a clearer grasp of their interconnection. By allowing users to engage with the visual representations of topics, a visual analytics framework enhances this method. This interactivity enables users to change these models to improve the topic depending on their personal knowledge and interpretation. This engagement includes adding, removing or repositioning words within the concept space.
Another study [36] also leveraged user feedback to refine topics. This approach focuses on a mixed-initiative approach where both the user and the system collaborate to refine topic models dynamically. The refining process starts with the initial creation of a topic, applying conventional topic modeling methods. Users engage with the model by making changes and offering comments on the topics produced after the first generation of models. The system then analyzes this user feedback to learn their preference and adjust its algorithms based on the user preference model. The system has six agents to support various refinement operations, including combining similar topics by identifying and combining the topics that are most similar to one another. On the other hand, the Split agent divides a topic into two distinct new topics. The remaining agents remove topics, reinsert small topics, reinsert outliers, and reinsert the worst topics.
Additionally, ref. [37] presents a unique approach to refining topic models by emphasizing key phrases, instead of specific words or whole documents. The system initially extracts key phrases from the documents using an RNN-based encoder–decoder model. LDA is used as an initial topic model to obtain topic words. A function to remove and add key phrase refinement was proposed to identify documents that should be added or removed from the specified topic. Rather than changing the list of top words, the proposed method directly modifies the document topic association by considering the key phrases as a representative overview of the documents.
Unlike the research discussed above, study [38] outlines a novel technique for enhancing topic models customized for short texts. This approach introduces topic refinement, a model-agnostic mechanism that utilizes the functionalities of large language models (LLMs) to enhance topics post-extraction. The method systematically generates prompts to query LLMs for each topic. It gradually chooses a word as the possible intruder word in a topic, while the other words represent that topic. Then, it evaluates whether the word aligns with the semantic expression of the other words. Once alignment is verified, the word is retained; otherwise, the coherence word will be provided as a candidate to substitute for the intruder word.
In this work, we introduce TR-GPT-CF, a novel approach for post-extraction topic refinement to improve coherence and interpretability. This method does not engage in the preliminary modeling of topics but focuses on enhancing topics post-extraction. Our strategy focuses on detecting words that are not semantically related to the other word in the topic, which we named ‘misaligned word’, and then substituting them with alternatives suggested by the WordNet and GPT model. The process starts by extracting topics using a topic model method such as LDA or BERTopic. From these generated topics, we detect the misaligned word using Z-score-based cosine similarity of each topic word from the topic centroid. Subsequently, the word most similar to the topic centroid is selected as the basis for generating alternative words from the WordNet and GPT model. Words with the highest coherence score are chosen as candidates to replace the misaligned word. Evaluation across several datasets demonstrates that our method significantly improves topic coherence, offering an effective solution for achieving more interpretable and semantically coherent topics.
The remainder of this work is organized as follows: Initially, the proposed topic refinement method is explained, detailing the methods implemented in this study. Subsequently, performance results and a discussion are presented in Section 3, and the final section concludes with an explanation of future directions.

2. Materials and Methods

To overcome the shortcomings of conventional topic models in generating coherent and interpretable topics, we present an innovative topic refinement framework, which, to the best of our knowledge, has not been previously presented in the literature. While this work was inspired by [38], our approach differs significantly. The referenced study relies solely on large language models (LLMs) for topic refinement, whereas our proposed method integrates multiple techniques, as outlined below.
A high-level overview of our framework is illustrated in Figure 1. Preprocessing, topic extraction, and topic refinement compose the three primary phases. Our proposed method, TR-GPT-CF, focuses on topic refinement, which consists of two main stages, misaligned word detection and misaligned word replacement. Misaligned word detection aims to identify words that are not semantically related to the other word in the topic. Misaligned word replacement aims to generate alternative words that are selected based on their coherence. This framework is model agnostic, allowing it to utilize any topic modeling method and dataset. Because of its flexibility, it can be used for a variety of purposes, including academic research and industry-specific analysis. Being independent of a particular model allows it to adjust to different data requirements and characteristics, increasing its usefulness and scalability across domains.
In this research, we utilize various datasets as shown in Table 1. We employ multiple datasets to evaluate the effectiveness of our refinement method across different contexts.
We sourced certain datasets from standard scikit-learn libraries, specifically AGNews, Yahoo Answers, and 20Newsgroup. Specifically, this experiment utilizes the ‘train’ data from the AGNews and 20Newsgroup datasets and only 20,000 training samples from the Yahoo Answers dataset. The remaining three datasets—TagMyNews from [39], SMS Spam from [40], and Kaggle’s Research Article from [41]—were obtained from their respective sources. These datasets provide a diverse range of topics and formats, enabling comprehensive analysis across different domains. By leveraging their distinct characteristics, we aim to derive meaningful insights and enhance our understanding of the underlying patterns within the data.
In this work, we applied numerous preprocessing standards, including the elimination of stop words and lemmatization, to ensure that our model focused on the most relevant words, thereby enhancing the quality of the topics generated. Further preprocessing techniques such as normalization and tokenization were utilized to refine the dataset before analysis. Given the vast amount of unstructured data on the internet, particularly from social media—where user-generated content [42] and short text predominate—we refined our preprocessing with several specific actions. Since our framework is designed to handle both long and short text, some preprocessing steps followed the method proposed by [43] and implemented by [44].
Building on the aforementioned foundational steps, we further optimized the dataset using the following actions:
  • Removing high-frequency words: Identify and remove the most frequent words in the corpus, based on TF-IDF scores. This procedure prevents the model from focusing on very common words (e.g., “the” and “and”).
  • Eliminating repetitive patterns: Eliminate redundant sequences such as “hahahaha” or “hihihhihi” that contribute unnecessary noise to the corpus. This procedure uses regular expressions to replace any character repeated three or more times with a single occurrence.
  • Removing non-alphabet character: employ regular expression to eliminate non-alphabetic characters.
  • Eliminating very short documents: exclude documents that are excessively brief or potentially solely consist of nonsensical content, which might not be meaningful for topic modeling.
The overall pre-processing step is described in Algorithm 1.
Algorithm 1. Pre-processing
Input: Corpus D, stop words SW, minimum word count min_word_count
Output: Preprocessed corpus Dprocessed
For each document dmD:
     convert dm to lowercase
     remove non-alphabetic characters from dm
     remove repetitive pattern from dm
     tokenize dm into words
     lemmatize each word w
     remove stop words wSW
end for
Compute high-frequency word
     vectorize D, using TF-IDF to obtain BoW
     calculate word frequencies freq(w) for all words wBoW
     identify most frequent word HF
     remove high frequency words wHF
Filter short or empty document
     remove dm if |words(dm)| < min_word_count
Return preprocessed corpus Dprocessed
Following the preprocessing step outlined previously, we extracted a set of topics from a specific dataset using a base topic model. To facilitate this extraction, various topic modeling techniques were employed in this experiment, as outlined in Table 2, including LDA [15], NMF [45], BERTopic [46], and Gaussian-BAT [47]. Recent advancements in LDA [48] and NMF [49,50] have further improved their effectiveness in current topic modeling applications.
Additionally, our framework is designed to adapt to other techniques, thereby broadening its application across a variety of scientific and industrial fields. This adaptability not only enhances the versatility of our framework but also enables customization of our studies to meet specific requirements and accommodate different datasets.
We then use the extracted topics from such topic modeling techniques as an input in our proposed topic refinement method, TR-GPT-CF, which consist of two main functions: misaligned word detection, which semantically detects words that deviate significantly from the centroid of the topic, and misaligned word replacement, which leverages WordNet [51,52] and GPT-3.5 (model gpt-3.5-turbo-1106) from OpenAI API [53] to provide contextually appropriate replacements. The integration of these components guarantees that the refined topics exhibit enhanced coherence while preserving their interpretability for practical applications. To quickly understand the method, please see the pseudo-algorithm outlined in Algorithm 2.
Algorithm 2. TR-GPT-CF
Input: A set of topics T = {t1, t2, …, tK}, Embedding model M, Corpus C, Large Language Model L, WordNet W, Z-score threshold θz, Inverse document frequency threshold θf‘
Output: Refine topics T’ = {{t’1, t’2, …, t’K}.

1: Initialize the set of refine topics T’ ← Ø
2: For each topic tiT do
3:      Initialize the refined topic t’iti
4:          Compute the topic centroid using word embedding from M(ti)
5:          For each word wjt’i:
6:               Compute the centroid c ← mean M(ti)
7:               Compute the cosine similarity s between wj and the centroid c
8:               Compute Z-score z of similarities s
9:               Compute IDF value IDF (wij, C)
10:             if zij < θz and IDF (wij) > θf then
11:               mark wij as a misaligned word wmisaligned
12:         for each detected misaligned word wmisaligned do
13:             initialize WordNet W ← Ø
14:             select wc most similar to the centroid c
15:             retrieve a hypernym or hyponym of wc from WordNet W
16:             generate via prompt L to provide alternative for wc in the context of ti
17:             combine all candidates: WKLK
18:             calculate the coherence score of all candidates
19:             check if replacement word improves overall coherence score
20:                   replace wmisaligned in ti with whighest_coherence score
                  else
21:                   retain wmisaligned
22:          Update the refined topic t’i
23:          End if no further improvement in coherence is observed
24: End for (for all topics in T)
25: Return the set of refined topics T’

2.1. Misaligned Word Detection

Prior research indicates that words unrelated to others within a topic are recognized as intruders [33,54]. In [33] intruder identification was conducted by presenting users with a collection of words. The users were instructed to identify the word that was unrelated to other words in the topics. For instance, “banana” is rapidly identified as the intruder in a collection of {lion, tiger, elephant, banana, giraffe, zebra} as the other words such as {lion, tiger, elephant, giraffe, zebra} all represent animals. Conversely, in a set like {bike, professor, kangaroo, swift, green, Brazil}, which lacks this thematic unity, identifying the outlier becomes more challenging. Reference [55] presents a method for detecting intruder words through the utilization of semantic similarity measures in an expanded corpus. More external data are added to the initial corpus to include a larger vocabulary, word embeddings are created to capture semantic meanings, and documents are grouped together to uncover hidden topics. Each cluster’s centroid in the embedding space signifies its corresponding topic, and the low cosine similarity of intruder words to the topic centroid indicates semantic divergence.
Although the previous paper referred to the irrelevant word as an ‘intruder’, in this work, we prefer to use the term ‘misaligned word’. This new terminology emphasizes the broader applicability and unique methodology of our technique, which extend beyond the constraints of previously established frameworks.
In contrast to the methodologies proposed in previous studies, our TR-GPT-CF mechanism utilizes the bert-base-uncased model of BERT embeddings [56], which are contextualized to capture the specific meaning of a word based on its surrounding words. Recent advancements in BERT [57,58] have further improved their effectiveness in various applications. BERT embeddings provide greater flexibility and depth in representing text, especially for tasks like intruder word detection in complex or ambiguous contexts. Unlike static embedding, which struggles with polysemy—for example, ‘bank’ as a financial institution versus ‘bank’ as the side of a river—the BERT contextualized approach allows for dynamic representations that adapt to different contexts.
Our work employs the Hugging Face transformer library [59] to apply the bert-base-uncased version of the BERT model, which features 12 layers and 768 hidden units [56]. This model is pre-trained on a large corpus of English text where all input words are converted to lowercase, making it case sensitive.
In this work, misaligned word detection is designed to analyze the semantic and statistical properties of words within a given topic in order to identify misaligned words. It starts by generating embeddings M for all words in the topic T using the bert-base-uncased model of BERT embeddings and calculates a centroid embedding M(ti), which represents the semantic center of the topic. Cosine similarity s is then computed between each word’s embedding wj and the centroid c to measure how closely each word aligns with the overall topic. Instead of relying solely on cosine similarity to detect outliers as suggested in [55], we utilize the Z-score to standardize the similarity values.
Using only cosine similarity presents challenges in defining a universal threshold to classify a word as a misaligned word. The similarity score range can vary widely across different topics and datasets. For example, in some topics, all words may naturally exhibit low similarity scores due to the nature of the embeddings, complicating the identification of true misaligned words. Additionally, misaligned words often appear as ‘relative’ outliers—that is, their similarity to the centroid is significantly lower than that of other words within the topic. However, cosine similarity alone fails to account for the distribution of similarity scores within a topic.
Cosine similarity alone measures how semantically related each word is to the topic centroid, yet it struggles with relative comparisons across topics. The Z-score addresses this issue by identifying words that significantly deviate from the norm within the context of the topic. This combination enables more reliable and context-aware detection of misaligned words, which are essentially the ‘outliers’ in the similarity distribution.
The Z-score, also known as the standard score, is a statistical measure that quantifies the number of standard deviations a data point is from the mean of a dataset. It is calculated using the following formula [60] as further applied in [61,62]:
Z = x μ σ
The numerator (xμ) calculates the difference between the individual data point (x) and the mean (μ). This determines how far the data point is from the average value of the dataset. The standard deviation (σ) represents the typical amount that data points differ from the mean.
In this experiment, the Z-score is used to quantify the deviation of a specific value from the mean of the similarity scores, normalized by the standard deviation. In this context, the value refers to the cosine similarity of a word to the centroid. By standardizing the similarity scores with the Z-score, we normalize the variability in cosine similarity across different topics. This standardization allows us to apply the same Z-score threshold consistently, regardless of the overall range of similarity values within a given topic. We investigate Z-score thresholds between ±1.5 and ±3.0 to ascertain their effect on topic coherence metrics C_v for identifying misaligned words. We implement the Z-score using the scipy.stats.zscore function from the SciPy Python library (v.1.14.1).
To facilitate understanding of the variability in cosine similarity scores within a topic, Table 3 illustrates the scores for each word compared to the topic centroid, along with their Z-score values, assuming a standard deviation (σ) of 0.20. Word E, with a score of 0.40, exhibits significantly lower similarity to the centroid than other words. Relying exclusively on cosine similarity presents difficulties in establishing an appropriate threshold for misaligned word detection. This challenge is particularly evident when the topic demonstrates low similarities or when the dataset exhibits different behavior.
Using Equation (1), the Z-score value is calculated to measure the extent to which each word’s cosine similarity deviates from the ‘average’ similarity score, relative to the overall distribution of similarities. From Table 3, we observe that most of the words have a Z-score close to zero, indicating that their cosine similarities are near the mean. However, Word E has a Z-score of −1.85, significantly lower than others; if a Z-score threshold of −1.5 is applied, Word E would be flagged as a misaligned word because its Z-score falls well below this threshold.
For several reasons, the Z-score represents a superior method for identifying misaligned words within topics. Firstly, it provides relativity by comparing each word’s similarity to the other words in the topic, rather than using an arbitrary threshold, such as requiring that the ‘cosine similarity must be >0.7′. This feature enhances the adaptability of Z-scores to their respective context. Secondly, the flexibility of Z-scores enables the effective application of the same threshold (e.g., −1.5) across various topics, accommodating the unique distribution of similarities in each topic. Lastly, Z-scores are particularly effective in emphasizing misaligned words like Word E, which stand out due to their significant deviations from the norm, regardless of whether the topic naturally exhibits low- or high-similarity values.
To further refine the detection of misaligned words, we also calculate the inverse document frequency (IDF) [63] of each word using a TF-IDF vectorizer to account for word importance. IDF introduces an element of statistical rarity, ensuring that the identified misaligned word is not only semantically distant but is also unusual within the overall corpus. IDF helps emphasize less common, more relevant words while minimizing frequent, less significant ones.
Finally, we apply a threshold condition. We flag a word as a misaligned word if its Z-score falls below a predefined threshold, indicating low alignment with the topic centroid, and its IDF score surpasses another threshold, indicating the word’s uncommonness and potential contextual significance.

2.2. Misaligned Word Replacement

This function is designed to improve topic coherence by replacing an identified misaligned word in a topic with a better alternative [38]. It employs a dual approach for generating replacement candidates. First, it uses WordNet to identify synonyms and semantically related words based on the centroid word of the topic. Second, it leverages GPT to generate replacements that are more contextually relevant. These candidates are then combined into a unified set to ensure their uniqueness. The function evaluates each candidate by temporarily replacing the misaligned word in the topic and calculating the resulting topic coherence score. If a candidate produces a higher coherence score than the original topic, it is selected as the best replacement. Additionally, the function prevents redundancy by ensuring that duplicate words are not added to the topic.
WordNet is a large lexical database of the English language that organizes words into sets of synonyms called synsets [51,52]. Each synset contains lemmas that are synonymous or closely related terms. WordNet facilitates the exploration of lexical relationships, providing synonyms, hypernyms, and hyponyms for the centroid word. Hypernyms are more general, while hyponyms are more specific. We utilize it to generate candidates for a given centroid word. It extracts synonyms and lemmas for the centroid word to create a list of potential replacements. By querying WordNet for all synsets associated with the centroid word, we retrieve various semantic contexts for that word.
Consider the following workflow as an illustration: the input consists of topic words: [‘education’, ‘learning’, ‘knowledge’, ‘teaching’], with ‘learning’ identified as the centroid word. WordNet’s synsets for ‘learning’ include [‘learning.n.01′, ‘learning.n.02′], and the lemmas derived from these synsets are [‘learning’, ‘acquisition’, ‘education’, ‘study’]. Therefore, the output for WordNet candidates is [‘learning’, ‘acquisition’, ‘education’, ‘study’].
We extended our search for candidate words by also leveraging GPT-3.5-turbo-1106 in addition to using WordNet. We developed a function that can be invoked within the misaligned word replacement function to generate alternative words by constructing a clear and explicit GPT prompt. This prompt instructs GPT to provide alternative words for the centroid word, ensuring that the generated output is structured as a comma-separated list.
For testing purposes, we used Google Colab’s T4 GPU, which provides free GPU acceleration, making it a convenient and cost-effective environment for running computationally intensive models. However, the framework is not limited to this environment and can run on any computational platform with sufficient resources. We leveraged OpenAI’s API version 1.55.3 [64], setting the temperature parameter to 0 for deterministic responses, to ensure minimal randomness and consistent results, and max_tokens to 100, to prevent overly long responses. This version was chosen because of its improved compatibility and stability, addressing previous integration concerns and ensuring reliable performance for our framework. We used the following prompt to query GPT-3.5 for alternative words:
“Provide alternative words for ‘{centroid_word}’ in the context of the topic: {topic_word}. Please separate words with commas”.
The function sends the prompt to GPT via the OpenAI’s API using the client.chat.completions.create method.
As previously discussed in this section, we will select the candidate with the highest coherence score to replace the misaligned word, using the topic coherence metric C_v. We utilized Gensim’s CoherenceModel to assess this coherence score of the topics, as it has been widely used in recent studies, including [65], for evaluating topic modeling performance. In this work, we focus solely on topic coherence metric C_v [66], which encompasses a set of measures that describe the quality and interpretability of topics from a human perspective [27,67]. The general formula for calculating the coherence score C of a topic with a set of words W = {w1, w2, …, wN} is as follows:
C = i = 2 N j = 1 i 1 P M I ( w i ,   w j )
where PMI (wi, wj) is the Pointwise Mutual Information [68] between words wi and wj, defined as follows:
P M I   w i ,   w j = log P ( w i , w j ) P w i x   P ( w j )
P(wi, wj) is the probability of the co-occurrence of words wi and wj, and P(wi) and P(wj) are the individual probabilities of the words. Calculating the PMI for all word pairs and aggregating the results yield the coherence scores, which indicates the overall semantic similarity of the words within the topic. A higher coherence score indicates more interpretable and meaningful topics.
This approach allows us to evaluate how well the identified topics align with human understanding and expectations, thereby improving the efficacy of topic modeling techniques. By focusing on topic coherence, we hope to provide results that are not only statistically sound but are also meaningful and relevant to users.

3. Results and Discussion

This section evaluates the proposed TR-GPT-CF topic refinement method across six datasets: AGNews, TagMyNews, Yahoo Answers, Newsgroup, SMS Spam, and Science Article and four models: LDA, BERTopic, G-BAT, and NMF. We examine the coherence score before and after applying the proposed method and benchmark these against baseline results for three shared datasets: AGNews, TagMyNews, and Yahoo Answers. The result demonstrates consistent improvements in topic coherence, highlighting the robustness of the proposed method across diverse datasets and models.
To validate the significance of the improvement in coherence scores, paired t-tests were performed for each model across all datasets as presented in Table 4. The results indicate that the proposed method achieved statistically significant improvements (p < 0.05) for LDA, BERTopic, and G-BAT. For NMF, while the p-value was slightly above 0.05 (p = 0.068), the improvements exhibited a positive trend. The p-value for LDA is 0.042, which is below the 0.05 threshold, indicating that the improvement in coherence scores after refinement is statistically significant.

3.1. Evaluating Topic Coherence Improvement Across Datasets

3.1.1. Improvement of Topic Coherence in AGNews Dataset

Table 5 shows that the proposed refinement method significantly improved coherence scores for the AGNews datasets across all models. LDA demonstrated the highest percentage improvement at 4.4%, while BERTopic exhibited marginal improvement due to its already high baseline. The modest improvement for BERTopic at 0.45% indicates that there is minimal opportunity for improvement as the model already produces highly coherent topics. Although this is in line with expectations, it may suggest that lower-performing models may benefit more from refinement.
The AGNews dataset is known for having well-structured content with moderately long text and clear topics. Most models show high baseline coherence scores, which could explain why models like BERTopic and NMF show smaller gains. This underscores the significance of dataset structure in influencing refinement outcomes. In contrast, LDA, which traditionally struggles with capturing semantic coherence, benefits significantly from refinement. Similarly, G-BAT, initially one of the weaker models, also shows meaningful improvement, demonstrating the refinement method’s capacity to strengthen underperforming models.

3.1.2. Improvement of Topic Coherence in TagMyNews Dataset

Table 6 shows that the refinement method demonstrated varied impacts across different models, with LDA showing the most significant improvement. Specifically, LDA’s coherence score increased from 0.336 to 0.431, marking a 28.27% rise, the largest among all tested models. This substantial gain highlights the method’s effectiveness in enhancing models that initially exhibit poor coherence. In contrast, BERTopic, an embedding-based model, also benefited from the refinement, though more moderately. Its score improved from 0.539 to 0.572, a 6.12% increase, suggesting that refinement techniques can effectively address challenges in datasets with shorter or noisier text like TagMyNews. Meanwhile, NMF displayed stable performance with its coherence score improving from 0.589 to 0.604, a 2.55% increase. Although modest, this improvement underscores the method’s capability to boost coherence even in models that already have moderately strong baseline performance.
G-BAT exhibited only a marginal improvement in its performance, with its coherence score increasing slightly from 0.646 to 0.650, a mere 0.62% increase. This minimal change suggests that G-BAT may already capture most of the coherence achievable for this dataset, indicating limited scope for further refinement. In a broader context, compared to the AGNews dataset, all models registered lower baseline scores when applied to the TagMyNews dataset. This disparity suggests that TagMyNews presents notable challenges for topic modeling, likely due to factors such as shorter text, noisier content or overlapping topics.

3.1.3. Improvement of Topic Coherence in Yahoo Answers Dataset

Table 7 shows that BERTopic demonstrated strong performance, showing a significant improvement from 0.706 to 0.745, a 5.52% increase. This substantial enhancement highlights the refinement method’s ability to further improve an already high-performing, embedding-based model on a challenging dataset like Yahoo Answers. We observed consistent improvement across all models, with percentage gains ranging from 3.01% for NMF to 5.52% for BERTopic, indicating the robustness and adaptability of the refinement method. Additionally, G-BAT also showcased notable gains, improving from 0.468 to 0.492, a 5.13% increase. This confirms that even models starting with lower baseline coherence can significantly benefit from the refinement process, emphasizing the method’s broad applicability.
On the Yahoo Answers dataset, LDA showed the smallest improvement, with a modest improvement of 3.71% in coherence scores, contrasting sharply with that of TagMyNews, which showed a 28.27% improvement. This difference could mean that the way Yahoo Answers is structured makes it hard for probabilistic models like LDA to work, which means that efforts to improve them have less of an effect. Despite these challenges, the baseline coherence scores across all models are moderately high, suggesting that the Yahoo Answers dataset generally provides a well-structured topic space. Therefore, there is some constraint on the potential for significant improvements. Yahoo Answers’ likely inclusion of moderately long and well-structured documents offers less of a challenge for topic models compared to more diverse datasets like TagMyNews. However, the presence of overlapping or diverse topics within Yahoo Answers may still restrict further gains in coherence.

3.1.4. Improvement of Topic Coherence in the Newsgroup Dataset

The refinement method has exhibited significantly but varied impacts across various models, as illustrated in Table 8. G-BAT exhibits the most significant improvement, with an extraordinary 40.19% increase in coherence score from 0.209 to 0.293.
This significant achievement emphasizes the efficacy of the refinement procedure, particularly for models that initially encounter difficulties with coherence on this dataset. In contrast, BERTopic continues to demonstrate robust performance, maintaining high coherence scores that have marginally improved from 0.823 to 0.839, representing a 1.94% increase. In the interim, LDA exhibits a slight improvement, with its coherence score rising from 0.583 to 0.602, representing a 3.26% increase. This enhancement illustrates the refinement method’s effectiveness in probabilistic models, such as LDA, and is particularly relevant in datasets with structured topics, such as Newsgroup.
Regarding the Newsgroup dataset, the refinement method’s impact varied across models. NMF displayed no change in coherence scores, remaining at 0.743 both before and after refinement, suggesting that the refinement model method had no measurable effect on this model. This lack of improvement may indicate that NMF already captures the maximum coherence achievable for this dataset. Conversely, both LDA and BERTopic showed only small gains. The fact that LDA and BERTopic did not make as much progress as they did on other datasets like TagMyNews suggests that the structured nature of Newsgroup, along with topics that overlap or are similar, may pose problems that make the refinement process less effective. These factors can make substantial coherence improvements more challenging, highlighting the complexity of enhancing topic model performance in datasets characterized by structured but similar content.

3.1.5. Improvement of Topic Coherence in the SMS Spam Dataset

Table 9 indicates that the refining process has been effective across multiple models. G-BAT exhibited significant improvement, with its coherence score increasing from 0.494 to 0.570, a 15.38% increase. This large enhancement demonstrates the method’s capacity to improve coherence for embedding-based models in short-text datasets. Similarly, NMF showed a significant gain, with its score rising from 0.427 to 0.483, a 13.11% increase, indicating the method’s effectiveness even in matrix-factorization-based models within challenging datasets. On the other hand, LDA and BERTopic had more moderate gains: LDA’s coherence increased 5.64%, while BERTopic experienced a more robust improvement, a 9.09% increase. This improvement indicates that both models benefit from the refinement process, with BERTopic’s stronger performance being due to its embedding-based structure.
The baseline scores for all models on the SMS Spam dataset are relatively low compared to those on datasets like AGNews or Yahoo Answers, indicating that SMS Spam presents unique challenges for topic modeling. This difficulty is likely due to its short and informal text, which makes semantic coherence harder to achieve. Among the models, LDA shows only a modest improvement, achieving the smallest percentage gain, which underscores its limitations in handling short-text datasets. However, they also allow embedding-based models such as BERTopic and G-BAT to perform well after being improved. This scenario illustrates how the characteristics of datasets can influence various topic modeling methods.

3.1.6. Improvement of Topic Coherence in Science Article Dataset

Table 10 indicates that G-BAT showed the most significant improvement among the models, with its coherence score rising from 0.265 to 0.341, representing a notable increase of 28.68%. This substantial gain underscores the refinement method’s effectiveness for enhancing low-performing models on structured, domain-specific datasets such as Science Article. In contrast, BERTopic, which already had a high baseline coherence, exhibited a modest gain, improving from 0.731 to 0.740, representing a 1.23% increase. This slight improvement indicates the restricted potential for further enhancement given the model’s robust initial performance. Similarly, LDA exhibited a modest improvement, with its coherence score increasing from 0.526 to 0.544, reflecting a 3.42% increase. This improvement illustrates the refinement method’s ability to enhance coherence in probabilistic models, though to a limited extent.
NMF showed moderate progress on the Science Article dataset, with its score marginally increasing from 0.614 to 0.619, a 0.81% increase. This minor change shows that the refinement procedure had little effect, probably because the model was already performing near its optimal level for this dataset. Similarly, aside from G-BAT, the other models only showed small improvement. This suggests that the Science Article dataset, with its well-structured, long, and technical content, probably provides a solid foundation that limits further improvement. This structure may explain why models with high baseline coherence scores, such as NMF and BERTopic, have minimal space for improvement, as the dataset’s inherent characteristics already support a high level of topic coherence.

3.2. Evaluating Topic Coherence Improvements by Candidate Word Replacement: WordNet, GPT, and Combined Approaches

In this section, we explore the improvement in topic coherence across various datasets when employing different candidate word generation techniques: WordNet, GPT, and a combination of both. Each strategy has distinct benefits in aligning topic words more accurately, therefore improving the interpretability of the topics. Through the comparison of different approaches, we aim to underscore the effectiveness of each in facilitating more coherent and meaningful topics.
To improve topic coherence, WordNet and GPT were used to identify eligible words for refining. WordNet provided semantically similar candidates, such as synonyms and hypernyms, based on linguistic hierarchies, while GPT dynamically created contextually relevant alternatives using language model embeddings. This study investigates the use of WordNet and GPT in three different scenarios. First, we utilized only WordNet as a generator to provide alternative words. Second, we used only GPT. Third, we employed a combination of both. This hybrid approach ensured a resilient selection of candidate words for topic refinement by combining contextual adaptability with semantic depth.
From Figure 2, we can observe that each dataset exhibits differences in the effectiveness of WordNet, GPT, and their combination in improving topic coherence. Below is a detailed analysis of these variations. Science Articles benefit from both semantic knowledge and contextual generation; therefore, the combined approaches are most effective. When WordNet and GPT are used together in this dataset, they perform slightly better than when used separately. This indicates that their combination is effective in handling more technical information. Due to their heterogeneity, Newsgroup datasets show minimal improvement across all approaches, making them difficult to enhance. The small gain is likely related to the informal nature and diversity of the Newsgroup dataset. AGNews demonstrates limited potential for enhancement, as the topics are already well-structured, providing limited opportunity for additional refinement. The hierarchical organization of WordNet aligns more closely with the structure of Yahoo Answers dataset compared to GPT, leading to higher enhancement with WordNet. While both WordNet and GPT perform well individually for TagMyNews, their combination does not yield additional benefits. WordNet demonstrates superiority in this context, highlighting its proficiency in datasets characterized by succinct and clearly delineated content. The combined approach is particularly effective for SMS Spam, as it utilizes both semantic understanding and contextual nuance to address brief, spam-related phrases.
Datasets with more structured content, such as Science Article and Yahoo Answers, tend to favor WordNet. In contrast, less structured or short-text datasets, such as SMS Spam and TagMyNews, benefit more from GPT and combined approaches. The combination of WordNet and GPT demonstrates the most significant improvements in datasets that present specific challenges, like SMS Spam. However, some datasets, such as AGNews and TagMyNews, show no significant advantage for the combined approach, as individual methods already achieve high coherence.
The comparative efficiency of WordNet, GPT, and their combined use vary by dataset. WordNet performs consistently, achieving its highest improvement of 8.3% in the SMS Spam dataset, most likely due to its dependence on semantic similarity, which helps create a more organized context. In contrast, GPT alone performed comparably but with slightly lower performance. For example, it achieved only a 3.4% improvement in the Science Articles dataset, versus WordNet’s 4.5%. The combination of WordNet and GPT, however, yields the highest improvement percentages in most cases, such as 10.8% in SMS Spam and 4.6% in Science Articles. This result consistently shows that combining WordNet and GPT leads to greater improvements, implying that semantic relationships (WordNet) and contextual representations (GPT) complement each other effectively. This suggests that combining semantic clarity with contextual fluency creates a more effective mechanism for candidate word replacement, thereby improving overall topic coherence.

3.3. Evaluating Topic Coherence by Qualitative Comparison

This section will provide a qualitative comparison of the results obtained from our experiment. In our scenario, each model is configured to extract 10 topics from each dataset, with each topic comprising 10 words. The results compare extracted topics from several topic models and demonstrate how our refinement method improves the topics by correcting misaligned words. Table 11 compares extracted topics from various topic models and highlights how refinement processes improve the topics by addressing misaligned words. It showcases how misaligned words are replaced with more appropriate terms using our refinement method.
The extracted topic column shows the initial topics generated by each model. While these topics represent the general themes, they often include misaligned words that do not fit well within the topic context. After refinement, the topics are improved by replacing misaligned words with contextually appropriate replacement. The last two columns highlight the specific misaligned words identified by the misaligned word detection mechanism and their corresponding replacements. These replacements are derived using WordNet and GPT, as discussed above, ensuring that the new words enhance topic coherence.

3.4. Limitations

This work has several limitations, which provide important directions for future research. First, selecting an appropriate Z-score threshold for detecting misaligned words was challenging. An improperly chosen threshold may result in correctly associated topic words being erroneously classified as misaligned. Second, the effectiveness of word replacement depends on GPT-generated word suggestions, which are influenced by the prompt design used to query the model. Although our method guarantees contextual relevance, further optimizations in prompt engineering for topic coherence could enhance the accuracy of word replacement. Third, the evaluation in this study primarily relies on the C_v coherence metric and manual inspection to assess the interpretability of the topics. Incorporating additional coherence metrics may yield a more thorough evaluation. Fourth, this study leverages a relatively small dataset and applies sampling to larger datasets, such as Yahoo Answers. Consequently, there is a trade-off in computational cost when scaling to larger datasets. The primary trade-offs involve an increased processing time due to Z-score computation, GPT-based replacements, and embedding similarity calculations.

4. Conclusions

This study presents TR-GPT-CF, a novel approach for post-extracted topic refinement. It employs Z-score centroid-based misaligned word detection and a hybrid semantic–contextual approach for word replacement, which utilizes WordNet and GPT. To evaluate the effectiveness of our refinement method, we applied four topic modeling techniques—LDA, NMF, BERTopic, and Gaussian-BAT—across six datasets: AGNews, TagMyNews, Yahoo Answers, Newsgroup, SMS Spam, and Kaggle’s Science Articles. Using these four topic modeling techniques and six datasets, we evaluated the extracted topics by calculating the percentage improvement in coherence before and after applying the refinement method. In addition, we have investigated the enhancement of topic coherence across six datasets through the implementation of various candidate word generation techniques, including WordNet, GPT, and a combination of both. Each strategy has its own benefits in aligning topic words more accurately, making the topics easier to understand. Through the comparison of different approaches, we hope to show how well each one assists in rendering topics more logical and significant.
TR-GPT-CF demonstrates enhancements across all datasets. It is highly effective at refining coherence in simpler datasets with less linguistic complexity. Furthermore, it effectively improves coherence for moderately structured datasets, rendering it appropriate for semi-structured data. Building on this foundation, the combination approach of WordNet and GPT consistently provides the most significant improvements in topic coherence across diverse datasets. This is attributed to WordNet’s semantic grounding and GPT’s contextual adaptability. The synergy between the two addresses both semantic precision and contextual fluency, making it robust for both structured and informal datasets. The combined approach is highly recommended for challenging datasets that need both domain knowledge and contextual fluency. Individual approaches such as WordNet or GPT alone may suffice in straightforward datasets where only one aspect—either semantics or context—is critical. This highlights the importance of our work, as the human evaluation of models is both costly and time-intensive, underscoring the value of our efficient, automated solutions.
Our future work will focus on further automating the review process and broadening our methodologies to encompass a wider range of datasets. We will also investigate the use of in-context learning paradigms to generate alternative words as a means to enhance the quality of GPT’s responses.

Author Contributions

Conceptualization, H.-S.Y.; methodology, I.W.; software, I.W.; validation, H.-S.Y. and I.W.; formal analysis, I.W.; investigation, I.W.; data curation, I.W.; writing—original draft preparation, I.W.; writing—review and editing, I.W. and H.-S.Y.; visualization, I.W.; supervision, H.-S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The work is partially supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant RS-2022-00143782).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in this article; further inquiries can be directed to the corresponding author.

Acknowledgments

Some portions of code assistance and alternative word generation were facilitated using OpenAI’s ChatGPT 4.0 and model GPT-3.5-turbo-1106. However, all final interpretations, decisions, and evaluations were conducted by the authors, with the content and results being fully reviewed and verified by them.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Dinsa, E.F.; Das, M.; Abebe, T.U. A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information. Sci. Rep. 2024, 14, 32051. [Google Scholar] [CrossRef] [PubMed]
  2. Romero, J.D.; Feijoo-Garcia, M.A.; Nanda, G.; Newell, B.; Magana, A.J. Evaluating the Performance of Topic Modeling Techniques with Human Validation to Support Qualitative Analysis. Big Data Cogn. Comput. 2024, 8, 132. [Google Scholar] [CrossRef]
  3. Williams, L.; Anthi, E.; Arman, L.; Burnap, P. Topic Modelling: Going beyond Token Outputs. Big Data Cogn. Comput. 2024, 8, 44. [Google Scholar] [CrossRef]
  4. Taghandiki, K.; Mohammadi, M. Topic Modeling: Exploring the Processes, Tools, Challenges and Applications. Authorea Prepr. 2023. Available online: https://www.authorea.com/users/689415/articles/682028-topic-modeling-exploring-the-processes-tools-challenges-and-applications (accessed on 26 November 2024).
  5. Meddeb, A.; Romdhane, L.B. Using Topic Modeling and Word Embedding for Topic Extraction in Twitter. Procedia Comput. Sci. 2022, 207, 790–799. [Google Scholar] [CrossRef]
  6. Li, H.; Qian, Y.; Jiang, Y.; Liu, Y.; Zhou, F. A novel label-based multimodal topic model for social media analysis. Decis. Support. Syst. 2023, 164, 113863. [Google Scholar] [CrossRef]
  7. Zankadi, H.; Idrissi, A.; Daoudi, N.; Hilal, I. Identifying learners’ topical interests from social media content to enrich their course preferences in MOOCs using topic modeling and NLP techniques. Educ. Inf. Technol. 2023, 28, 5567–5584. [Google Scholar] [CrossRef]
  8. Li, S.; Xie, Z.; Chiu, D.K.W.; Ho, K.K.W. Sentiment Analysis and Topic Modeling Regarding Online Classes on the Reddit Platform: Educators versus Learners. Appl. Sci. 2023, 13, 2250. [Google Scholar] [CrossRef]
  9. Rijcken, E.; Kaymak, U.; Scheepers, F.; Mosteiro, P.; Zervanou, K.; Spruit, M. Topic Modeling for Interpretable Text Classification from EHRs. Front. Big Data 2022, 5, 846930. [Google Scholar] [CrossRef]
  10. Somani, S.; van Buchem, M.M.; Sarraju, A.; Hernandez-Boussard, T.; Rodriguez, F. Artificial Intelligence-Enabled Analysis of Statin-Related Topics and Sentiments on Social Media. JAMA Netw. Open 2023, 6, e239747. [Google Scholar] [CrossRef]
  11. Rahimi, H.; Mimno, D.; Hoover, J.L.; Naacke, H.; Constantin, C.; Amann, B. Contextualized Topic Coherence Metrics. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL, Dubrovnik, Croatia, 2–6 May 2023; pp. 1760–1773. Available online: https://arxiv.org/abs/2305.14587v1 (accessed on 12 December 2024).
  12. Li, Y.; Yang, A.Y.; Marelli, A.; Li, Y. MixEHR-SurG: A joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records. J. Biomed. Inform. 2024, 153, 104638. [Google Scholar] [CrossRef]
  13. Boyd-Graber, J.; Hu, Y.; Mimno, D. Applications of Topic Models. Found. Trends® Inf. Retr. 2017, 11, 143–296. [Google Scholar] [CrossRef]
  14. Chakkarwar, V.A.; Tamane, S.C. Information Retrieval Using Effective Bigram Topic Modeling. In Proceedings of the International Conference on Applications of Machine Intelligence and Data Analytics (ICAMIDA 2022), Aurangabad, India, 22–24 December; pp. 784–791. [CrossRef]
  15. Blei, D.M.; Ng, A.Y.; Edu, J.B. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
  16. Ozyurt, O.; Özköse, H.; Ayaz, A. Evaluating the latest trends of Industry 4.0 based on LDA topic model. J. Supercomput. 2024, 80, 19003–19030. [Google Scholar] [CrossRef]
  17. Blei, D.M. Probabilistic topic models. Commun. ACM 2012, 55, 77–84. [Google Scholar] [CrossRef]
  18. Bystrov, V.; Naboka-Krell, V.; Staszewska-Bystrova, A.; Winker, P. Choosing the Number of Topics in LDA Models—A Monte Carlo Comparison of Selection Criteria. J. Mach. Learn. Res. 2024, 25, 1–30. Available online: http://jmlr.org/papers/v25/23-0188.html (accessed on 1 February 2025).
  19. Liu, L.; Tang, L.; Dong, W.; Yao, S.; Zhou, W. An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 2016, 5, 1608. [Google Scholar] [CrossRef]
  20. Chen, W.; Rabhi, F.; Liao, W.; Al-Qudah, I. Leveraging State-of-the-Art Topic Modeling for News Impact Analysis on Financial Markets: A Comparative Study. Electronics 2023, 12, 2605. [Google Scholar] [CrossRef]
  21. Papadia, G.; Pacella, M.; Perrone, M.; Giliberti, V. A Comparison of Different Topic Modeling Methods through a Real Case Study of Italian Customer Care. Algorithms 2023, 16, 94. [Google Scholar] [CrossRef]
  22. Li, P.; Tseng, C.; Zheng, Y.; Chen, J.A.; Huang, L.; Jarman, B.; Needell, D. Guided Semi-Supervised Non-Negative Matrix Factorization. Algorithms 2022, 15, 136. [Google Scholar] [CrossRef]
  23. Blei, D.M.; Lafferty, J.D. A correlated topic model of Science. Ann. Appl. Stat. 2007, 1, 17–35. [Google Scholar] [CrossRef]
  24. Syahrial, S.; Perucha, R.; Afidh, F. Fine-Tuning Topic Modelling: A Coherence-Focused Analysis of Correlated Topic Models. Infolitika J. Data Sci. 2024, 2, 82–87. [Google Scholar] [CrossRef]
  25. Fang, Z.; He, Y.; Procter, R. BERTTM: Leveraging Contextualized Word Embeddings from Pre-trained Language Models for Neural Topic Modeling. arXiv 2023, arXiv:2305.09329. [Google Scholar]
  26. Bewong, M.; Wondoh, J.; Kwashie, S.; Liu, J.; Liu, L.; Li, J.; Islam, M.Z.; Kernnot, D. DATM: A Novel Data Agnostic Topic Modeling Technique with Improved Effectiveness for Both Short and Long Text. IEEE Access 2023, 11, 32826–32841. [Google Scholar] [CrossRef]
  27. Hoyle, A.; Goel, P.; Hian-Cheong, A.; Peskov, D.; Boyd-Graber, J.; Resnik, P. Is Automated Topic Model Evaluation Broken? The Incoherence of Coherence. Adv. Neural Inf. Process. Syst. 2021, 34, 2018–2033. [Google Scholar]
  28. Marani, A.H.; Baumer, E.P.S. A Review of Stability in Topic Modeling: Metrics for Assessing and Techniques for Improving Stability. ACM Comput. Surv. 2023, 56, 108. [Google Scholar] [CrossRef]
  29. Kapoor, S.; Gil, A.; Bhaduri, S.; Mittal, A.; Mulkar, R. Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling. arXiv 2024, arXiv:2409.15626. [Google Scholar]
  30. Geeganage, D.K.; Xu, Y.; Li, Y. A Semantics-enhanced Topic Modelling Technique: Semantic-LDA. ACM Trans. Knowl. Discov. Data 2024, 18, 93. [Google Scholar] [CrossRef]
  31. Li, R.; González-Pizarro, F.; Xing, L.; Murray, G.; Carenini, G. Diversity-Aware Coherence Loss for Improving Neural Topic Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 2, pp. 1710–1722. [Google Scholar] [CrossRef]
  32. Lewis, C.M.; Graduate, O.; Grossetti, F. A Statistical Approach for Optimal Topic Model Identification. J. Mach. Learn. Res. 2025, 23, 1–20. Available online: http://jmlr.org/papers/v23/19-297.html (accessed on 2 February 2025).
  33. Chang, J.; Gerrish, S.; Wang, C.; Boyd-Graber, J.; Blei, D. Reading Tea Leaves: How Humans Interpret Topic Models. Adv. Neural Inf. Process. Syst. 2009, 22, 288–296. [Google Scholar]
  34. Lee, T.Y.; Smith, A.; Seppi, K.; Elmqvist, N.; Boyd-Graber, J.; Findlater, L. The human touch: How non-expert users perceive, interpret, and fix topic models. Int. J. Hum. Comput. Stud. 2017, 105, 28–42. [Google Scholar] [CrossRef]
  35. El-Assady, M.; Kehlbeck, R.; Collins, C.; Keim, D.; Deussen, O. Semantic concept spaces: Guided topic model refinement using word-embedding projections. IEEE Trans. Vis. Comput. Graph. 2020, 26, 1001–1011. [Google Scholar] [CrossRef]
  36. Sperrle, F.; Schäfer, H.; Keim, D.; El-Assady, M. Learning Contextualized User Preferences for Co-Adaptive Guidance in Mixed-Initiative Topic Model Refinement. Comput. Graph. Forum 2021, 40, 215–226. [Google Scholar] [CrossRef]
  37. Rehman, K.M.H.U.; Wakabayashi, K. Keyphrase-based Refinement Functions for Efficient Improvement on Document-Topic Association in Human-in-the-Loop Topic Models. J. Inf. Process. 2023, 31, 353–364. [Google Scholar] [CrossRef]
  38. Chang, S.; Wang, R.; Ren, P.; Huang, H. Enhanced Short Text Modeling: Leveraging Large Language Models for Topic Refinement. arXiv 2024, arXiv:2403.17706. [Google Scholar]
  39. News-Classification/train_data.csv at master vijaynandwani/News-Classification GitHub. Available online: https://github.com/vijaynandwani/News-Classification/blob/master/train_data.csv (accessed on 5 December 2024).
  40. SMS Spam Collection Dataset. Available online: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset (accessed on 5 December 2024).
  41. Topic Modeling for Research Articles. Available online: https://www.kaggle.com/datasets/blessondensil294/topic-modeling-for-research-articles?select=train.csv (accessed on 5 December 2024).
  42. Anschütz, M.; Eder, T.; Groh, G. Retrieving Users’ Opinions on Social Media with Multimodal Aspect-Based Sentiment Analysis. arXiv 2022, arXiv:2210.15377. [Google Scholar]
  43. Wu, X.; Li, C.; Zhu, Y.; Miao, Y. Short Text Topic Modeling with Topic Distribution Quantization and Negative Sampling Decoder. In Proceedings of the 2020—2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 16–20 November 2020; pp. 1772–1782. [Google Scholar] [CrossRef]
  44. Wu, X.; Luu, A.T.; Dong, X. Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 2748–2760. [Google Scholar] [CrossRef]
  45. Garewal, I.K.; Jha, S.; Mahamuni, C.V. Topic Modeling for Identifying Emerging Trends on Instagram Using Latent Dirichlet Allocation and Non-Negative Matrix Factorization. In Proceedings of the 10th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 14–15 March 2024; pp. 1103–1110. [Google Scholar] [CrossRef]
  46. Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
  47. Wang, R.; Hu, X.; Zhou, D.; He, Y.; Xiong, Y.; Ye, C.; Xu, H. Neural Topic Modeling with Bidirectional Adversarial Training. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA, 3 April 2020; pp. 340–350. [Google Scholar] [CrossRef]
  48. Rieger, J.; Jentsch, C.; Rahnenführer, J. RollingLDA: An Update Algorithm of Latent Dirichlet Allocation to Construct Consistent Time Series from Textual Data. In Proceedings of the Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 2337–2347. [Google Scholar] [CrossRef]
  49. Vendrow, J.; Haddock, J.; Rebrova, E.; Needell, D. On a guided nonnegative matrix factorization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 3265–3269. [Google Scholar] [CrossRef]
  50. Nugumanova, A.; Alzhanov, A.; Mansurova, A.; Rakhymbek, K.; Baiburin, Y. Semantic Non-Negative Matrix Factorization for Term Extraction. Big Data Cogn. Comput. 2024, 8, 72. [Google Scholar] [CrossRef]
  51. Miller, G.A. WordNet. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
  52. Zotova, E.; Cuadros, M.; Rigau, G. Towards the Integration of WordNet into ClinIDMap. 2023. Available online: https://aclanthology.org/2023.gwc-1.42/ (accessed on 3 February 2025).
  53. API Platform|OpenAI. Available online: https://openai.com/api/ (accessed on 17 December 2024).
  54. Wood, J.; Arnold, C.; Wang, W. A Bayesian Topic Model for Human-Evaluated Interpretability. 2022. Available online: https://aclanthology.org/2022.lrec-1.674/ (accessed on 9 February 2025).
  55. Thielmann, A.; Reuter, A.; Seifert, Q.; Bergherr, E.; Säfken, B. Topics in the Haystack: Enhancing Topic Quality through Corpus Expansion. Comput. Linguist. 2024, 50, 619–655. [Google Scholar] [CrossRef]
  56. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL HLT 2019—2019 Conference of the North. American Chapter of the Association for Computational Linguistics: Human. Language Technologies—Proceedings of the Conference, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. Available online: https://arxiv.org/abs/1810.04805v2 (accessed on 17 December 2024).
  57. Deb, S.; Chanda, A.K. Comparative analysis of contextual and context-free embeddings in disaster prediction from Twitter data. Mach. Learn. Appl. 2022, 7, 100253. [Google Scholar] [CrossRef]
  58. Stankevičius, L.; Lukoševičius, M. Extracting Sentence Embeddings from Pretrained Transformer Models. Appl. Sci. 2024, 14, 8887. [Google Scholar] [CrossRef]
  59. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the EMNLP 2020—Conference on Empirical Methods in Natural Language Processing: Systems Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]
  60. Domanski, P.D. Statistical outlier labelling—A Comparative study. In Proceedings of the 7th International Conference on Control, Decision and Information Technologies (CoDIT 2020), Prague, Czech Republic, 29 June–2 July 2020; pp. 439–444. [Google Scholar] [CrossRef]
  61. Casteleyn, S.; Ometov, A.; Torres-Sospedra, J.; Yaro, A.S.; Maly, F.; Prazak, P. Outlier Detection in Time-Series Receive Signal Strength Observation Using Z-Score Method with Sn Scale Estimator for Indoor Localization. Appl. Sci. 2023, 13, 3900. [Google Scholar] [CrossRef]
  62. Menéndez-García, L.A.; García-Nieto, P.J.; García-Gonzalo, E.; Lasheras, F.S.; Álvarez-de-Prado, L.; Bernardo-Sánchez, A. Method for the Detection of Functional Outliers Applied to Quality Monitoring Samples in the Vicinity of El Musel Seaport in the Metropolitan Area of Gijón (Northern Spain). Mathematics 2023, 11, 2631. [Google Scholar] [CrossRef]
  63. Choi, J.; Jung, E.; Lim, S.; Rhee, W. Finding Inverse Document Frequency Information in BERT. arXiv 2022, arXiv:2202.12191. [Google Scholar]
  64. Release v1.55.3 Openai/Openai-Python GitHub. Available online: https://github.com/openai/openai-python/releases/tag/v1.55.3 (accessed on 3 February 2025).
  65. Karas, B.; Qu, S.; Xu, Y.; Zhu, Q. Experiments with LDA and Top2Vec for embedded topic discovery on social media data—A case study of cystic fibrosis. Front. Artif. Intell. 2022, 5, 948313. [Google Scholar] [CrossRef]
  66. Röder, M.; Both, A.; Hinneburg, A. Exploring the space of topic coherence measures. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM 2015), Shanghai, China, 31 January–6 February 2015; pp. 399–408. [Google Scholar] [CrossRef]
  67. Doogan, C.; Buntine, W. Topic Model or Topic Twaddle? Re-evaluating Semantic Interpretability Measures. In Proceedings of the NAACL-HLT 2021—2021 Conference of the North. American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 6–11 June 2021; pp. 3824–3848. [Google Scholar] [CrossRef]
  68. Czyż, P.; Grabowski, F.; Vogt, J.E.; Beerenwinkel, N.; Marx, A. On the Properties and Estimation of Pointwise Mutual Information Profiles. arXiv 2023, arXiv:2310.10240. [Google Scholar]
Figure 1. Comprehensive overview of the proposed topic refinement framework.
Figure 1. Comprehensive overview of the proposed topic refinement framework.
Applsci 15 01962 g001
Figure 2. Topic coherence improvement by candidate word replacement: WordNet, GPT, and combined approaches.
Figure 2. Topic coherence improvement by candidate word replacement: WordNet, GPT, and combined approaches.
Applsci 15 01962 g002
Table 1. Dataset summary.
Table 1. Dataset summary.
DatasetDescriptionContent TypeAverage Length (Word)Size
(Document/Article)
AGNewsNews of articles across major topicsNews articles30120,000
SMS SpamMessage labeled as spam or hamShort text messages10–205574
TagMyNewsEnglish news articlesNews headlines15–2032,000
Yahoo AnswersUser-generated Q&AQuestion and answer pairs1001,400,000
20NewsgroupNewsgroup posts across 20 topicsFull news posts and threads20018,000
Kaggle’s Research ArticleResearch articles for topic modeling exercisesTitle and Abstract of Research Article20020,972
Table 2. Summary of topic modeling techniques.
Table 2. Summary of topic modeling techniques.
AspectLDANMFBERTopicG-BAT
Type of ModelProbabilisticMatrix DecompositionNeural
(Embedding + Clustering)
Neural
(VAE + Adversarial)
Input representationBag of WordsTF-IDF MatrixContextual
Embeddings
Pre-trained
Embeddings
Output
-
Topic as word distribution
-
Document topic proportions
-
Topic as word distributions
-
Document topic matrices
-
Topics as ranked words based on embeddings and clustering
-
Topics as latent Gaussian distributions
-
Document embeddings
Topic RepresentationTopic-wordTopic-wordCluster centers and their representative wordsLatent space clusters
Strength
-
Easy to implement
-
Interpretable
-
Easy to implement
-
Fast and scalable
-
Captures semantic relationship
-
Dynamic topic reduction
-
Captures complex latent patterns
-
Robust through adversarial learning
Weakness
-
Loses word order
-
Struggles with short or sparse text
-
Loses contextual relationship
-
Requires TF-IDF preprocessing
-
Computationally expensive
-
Depends on embedding quality
-
Computationally expensive
-
Complex to train
Best Use Cases
-
Long documents
-
Large datasets
-
Traditional NLP pipelines
-
Moderate size datasets
-
Semantic topic modeling
-
Dynamic topic reduction
-
Complex pattern in data
-
Short texts with sparse information
Application
-
News categorization
-
Large-scale document analysis
-
Academic research
-
Product review
-
Survey analysis
-
Social media analysis
-
Short-text classification
-
Customer feedback analysis
-
Domain specific document analysis
Table 3. Cosine similarity and Z-score.
Table 3. Cosine similarity and Z-score.
WordCosine SimilarityZ-Score
Word A0.850.40
Word B0.870.50
Word C0.830.30
Word D0.900.65
Word E0.40−1.85
Table 4. Statistical analysis of coherence score improvement.
Table 4. Statistical analysis of coherence score improvement.
Modelt-Statisticp-Value
LDA−2.7230.042
BERTopic−3.4910.017
G-BAT−3.2510.023
NMF−2.3180.068
Table 5. Coherence score for AGNews dataset.
Table 5. Coherence score for AGNews dataset.
ModelBefore RefinementAfter RefinementImprovement (%)
LDA0.5910.6174.40
BERTopic0.8970.9010.45
G-BAT0.4530.4713.97
NMF0.7710.7902.45
Table 6. Coherence score for TagMyNews dataset.
Table 6. Coherence score for TagMyNews dataset.
ModelBefore RefinementAfter RefinementImprovement (%)
LDA0.3360.43128.27
BERTopic0.5390.5726.12
G-BAT0.6460.6500.62
NMF0.5890.6042.55
Table 7. Coherence score for Yahoo Answers dataset.
Table 7. Coherence score for Yahoo Answers dataset.
ModelBefore RefinementAfter RefinementImprovement (%)
LDA0.4850.5033.71
BERTopic0.7060.7455.52
G-BAT0.4680.4925.13
NMF0.5640.5813.01
Table 8. Coherence score for the Newsgroup dataset.
Table 8. Coherence score for the Newsgroup dataset.
ModelBefore RefinementAfter RefinementImprovement (%)
LDA0.5830.6023.26
BERTopic0.8230.8391.94
G-BAT0.2090.29340.19
NMF0.7430.7430.00
Table 9. Coherence score for the SMS Spam dataset.
Table 9. Coherence score for the SMS Spam dataset.
ModelBefore RefinementAfter RefinementImprovement (%)
LDA0.4610.4875.64
BERTopic0.5060.5529.09
G-BAT0.4940.57015.38
NMF0.4270.48313.11
Table 10. Coherence score for the Science Article dataset.
Table 10. Coherence score for the Science Article dataset.
ModelBefore RefinementAfter RefinementImprovement (%)
LDA0.5260.5443.42
BERTopic0.7310.7401.23
G-BAT0.2650.34128.68
NMF0.6140.6190.81
Table 11. Extracted, refined topics and their corresponding replacement.
Table 11. Extracted, refined topics and their corresponding replacement.
DatasetModelExtracted TopicRefined TopicMisaligned WordReplacement Word
AGNewsLDAyear, u, sale, percent, share, cut, inc, profit, china, reportsales_event, u, sale, percent, share, cut, inc, profit, china, reportyear, reportsales_event,
NMFpresident, bush, state, afp, election, united, Kerry, talk, john, nuclearpresident, bush, state, senator, election, united, Kerry, talk, john, nuclearafpsenator
BERTopictendulkar, test, sachin, cricket, zealand, Australia, wicket, Nagpur, ponting, mcgrathtrial_run, test, sachin, cricket, zealand, Australia, wicket, nagpur, ponting, mcgrath tendulkartrial_run
G-BATbond, course, sale, poor, chief, charley, low, bay, coming, pick bond, course, sale, poor, quest charley, low, bay, coming, pick chiefquest
TagMyNewsLDAworld, u, year, job, court, star, coach, musical, john, wednesday.world, planet, year, job, court, star, coach, musical, john, earthu, wednesdayplanet, earth
NMFjapan, nuclear, earthquake, plant, crisis, tsunami, radiation, stock, power, quakejapan, nuclear, earthquake, ionizing radiation, crisis, tsunami, radiation, stock, power, quakeplantionizing radiation
BERTopictrial, jury, insider, rajaratnam, guilty, former, blagojevich, prosecutor, lawyer, accused trial, jury, insider, rajaratnam, guilty, former, prosecuting_officer, prosecutor, lawyer, accused blagojevichprosecuting_officer
G-BATyankee, south, focus, abidjan, shuttle, stake, Bahrain, wont, coach, nuclearyankee, south, focus, center, shuttle, stake, Bahrain, wont, coach, nuclearabidjancenter
Yahoo AnswersLDArange, x, water, b, weight, size, test, running, speed, forcerange, x, water, mass, weight, size, test, running, speed, forcebmass
NMFhelp, thanks, plz, problem, tried, yahoo, appreciated, site, computerhelp, thanks, lend a hand
problem, tried, yahoo, appreciated, site, computer
plzlend a hand
BERTopicguy, friend, love, girl, relationship, talk, boyfriend, together, he, marriedguy, friend, love, girl, relationship, talk, boyfriend, together, he, young_manmarriedyoung_man
G-BATability, mac, common, test, time, shes, running, medicine, deal, maybeability, mac, common, test, time, trade, running, medicine, deal, maybeshestrade
NewsgroupLDAline, subject, organization, writes, article, like, one, dont, would, get line, subject, organization, writes, article, like, one, pay_back, would, getdontpay_back
NMFwindow, file, program, problem, use, application, using, manager, run, serverwindow, file, program, problem, use, application, using, software, run, servermanagersoftware
BERTopicprinter, font, print, deskjet, hp, laser, ink, bubblejet, bj, atmprinter, font, print, deskjet, hp, laser, ink, bubblejet, laser printer, atmbjlaser printer
G-BATdrive, matthew, file, dead, clipper, ride, pat, drug, tax, managerdrive, matthew, file, dead, repulse, ride, pat, drug, tax, managerclipperrepulse
SMS SpamLDAnumber, urgent, show, prize, send, claim, u, message, contact, sentnumber, urgent, show, correspondence, send, claim, u, message, contact, sentprizecorrespondence
NMFill, later, sorry, meeting, yeah, aight, tonight, right, meet, thingill, later, sorry, meeting, yeah, match, tonight, right, meet, thingaightmatch
BERTopiclunch, dinner, eat, food, pizza, hungry, weight, eating, lor, menulunch, dinner, eat, food, pizza, hungry, weight, eating, selection, menulorselection
G-BATabiola, loving, ltgt, player, cool, later, big, waiting, regard, dudeabiola, loving, bed, player, cool, later, big, waiting, regard, dudeltgtbed
Science ArticleLDAstate, system, phase, quantum, transition, field, magnetic, interaction, spin, energystate, system, phase, quantum, transition, changeover, magnetic, interaction, spin, energyfieldchangeover
NMFlearning, deep, task, training, machine, model, feature, neural, classification, representationlearning, deep, task, training, machine, model, feature, neural, train, representationclassificationtrain
BERTopiclogic, program, language, semantic, automaton, proof, calculus, verificationlogic, program, language, semantic, reasoning, proof, calculus, verificationautomatonreasoning
G-BATgraph, space, constraint, site, integer, logic, frame, patient, diffusion, clustering graph, space, constraint, site, integer, logic, frame, patient, diffusion, dispersal clusteringdispersal
Note: In the Extracted Topic column, bold words indicate Misaligned Word. In the Refine Topic column, bold words indicate Replacement Word.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Widiastuti, I.; Yong, H.-S. TR-GPT-CF: A Topic Refinement Method Using GPT and Coherence Filtering. Appl. Sci. 2025, 15, 1962. https://doi.org/10.3390/app15041962

AMA Style

Widiastuti I, Yong H-S. TR-GPT-CF: A Topic Refinement Method Using GPT and Coherence Filtering. Applied Sciences. 2025; 15(4):1962. https://doi.org/10.3390/app15041962

Chicago/Turabian Style

Widiastuti, Ika, and Hwan-Seung Yong. 2025. "TR-GPT-CF: A Topic Refinement Method Using GPT and Coherence Filtering" Applied Sciences 15, no. 4: 1962. https://doi.org/10.3390/app15041962

APA Style

Widiastuti, I., & Yong, H.-S. (2025). TR-GPT-CF: A Topic Refinement Method Using GPT and Coherence Filtering. Applied Sciences, 15(4), 1962. https://doi.org/10.3390/app15041962

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop