TR-GPT-CF: A Topic Refinement Method Using GPT and Coherence Filtering
Abstract
:1. Introduction and State of the Art
2. Materials and Methods
- Removing high-frequency words: Identify and remove the most frequent words in the corpus, based on TF-IDF scores. This procedure prevents the model from focusing on very common words (e.g., “the” and “and”).
- Eliminating repetitive patterns: Eliminate redundant sequences such as “hahahaha” or “hihihhihi” that contribute unnecessary noise to the corpus. This procedure uses regular expressions to replace any character repeated three or more times with a single occurrence.
- Removing non-alphabet character: employ regular expression to eliminate non-alphabetic characters.
- Eliminating very short documents: exclude documents that are excessively brief or potentially solely consist of nonsensical content, which might not be meaningful for topic modeling.
Algorithm 1. Pre-processing |
Input: Corpus D, stop words SW, minimum word count min_word_count Output: Preprocessed corpus Dprocessed |
For each document dm ∈ D: convert dm to lowercase remove non-alphabetic characters from dm remove repetitive pattern from dm tokenize dm into words lemmatize each word w remove stop words w ∈ SW end for Compute high-frequency word vectorize D, using TF-IDF to obtain BoW calculate word frequencies freq(w) for all words w ∈ BoW identify most frequent word HF remove high frequency words w ∈ HF Filter short or empty document remove dm if |words(dm)| < min_word_count Return preprocessed corpus Dprocessed |
Algorithm 2. TR-GPT-CF |
Input: A set of topics T = {t1, t2, …, tK}, Embedding model M, Corpus C, Large Language Model L, WordNet W, Z-score threshold θz, Inverse document frequency threshold θf‘ Output: Refine topics T’ = {{t’1, t’2, …, t’K}. 1: Initialize the set of refine topics T’ ← Ø 2: For each topic ti ∈ T do 3: Initialize the refined topic t’i ← ti 4: Compute the topic centroid using word embedding from M(ti) 5: For each word wj ∈ t’i: 6: Compute the centroid c ← mean M(ti) 7: Compute the cosine similarity s between wj and the centroid c 8: Compute Z-score z of similarities s 9: Compute IDF value IDF (wij, C) 10: if zij < θz and IDF (wij) > θf‘ then 11: mark wij as a misaligned word wmisaligned 12: for each detected misaligned word wmisaligned do 13: initialize WordNet W ← Ø 14: select wc most similar to the centroid c 15: retrieve a hypernym or hyponym of wc from WordNet W 16: generate via prompt L to provide alternative for wc in the context of ti 17: combine all candidates: WK ∪ LK 18: calculate the coherence score of all candidates 19: check if replacement word improves overall coherence score 20: replace wmisaligned in ti with whighest_coherence score else 21: retain wmisaligned 22: Update the refined topic t’i 23: End if no further improvement in coherence is observed 24: End for (for all topics in T) 25: Return the set of refined topics T’ |
2.1. Misaligned Word Detection
2.2. Misaligned Word Replacement
3. Results and Discussion
3.1. Evaluating Topic Coherence Improvement Across Datasets
3.1.1. Improvement of Topic Coherence in AGNews Dataset
3.1.2. Improvement of Topic Coherence in TagMyNews Dataset
3.1.3. Improvement of Topic Coherence in Yahoo Answers Dataset
3.1.4. Improvement of Topic Coherence in the Newsgroup Dataset
3.1.5. Improvement of Topic Coherence in the SMS Spam Dataset
3.1.6. Improvement of Topic Coherence in Science Article Dataset
3.2. Evaluating Topic Coherence Improvements by Candidate Word Replacement: WordNet, GPT, and Combined Approaches
3.3. Evaluating Topic Coherence by Qualitative Comparison
3.4. Limitations
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Dinsa, E.F.; Das, M.; Abebe, T.U. A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information. Sci. Rep. 2024, 14, 32051. [Google Scholar] [CrossRef] [PubMed]
- Romero, J.D.; Feijoo-Garcia, M.A.; Nanda, G.; Newell, B.; Magana, A.J. Evaluating the Performance of Topic Modeling Techniques with Human Validation to Support Qualitative Analysis. Big Data Cogn. Comput. 2024, 8, 132. [Google Scholar] [CrossRef]
- Williams, L.; Anthi, E.; Arman, L.; Burnap, P. Topic Modelling: Going beyond Token Outputs. Big Data Cogn. Comput. 2024, 8, 44. [Google Scholar] [CrossRef]
- Taghandiki, K.; Mohammadi, M. Topic Modeling: Exploring the Processes, Tools, Challenges and Applications. Authorea Prepr. 2023. Available online: https://www.authorea.com/users/689415/articles/682028-topic-modeling-exploring-the-processes-tools-challenges-and-applications (accessed on 26 November 2024).
- Meddeb, A.; Romdhane, L.B. Using Topic Modeling and Word Embedding for Topic Extraction in Twitter. Procedia Comput. Sci. 2022, 207, 790–799. [Google Scholar] [CrossRef]
- Li, H.; Qian, Y.; Jiang, Y.; Liu, Y.; Zhou, F. A novel label-based multimodal topic model for social media analysis. Decis. Support. Syst. 2023, 164, 113863. [Google Scholar] [CrossRef]
- Zankadi, H.; Idrissi, A.; Daoudi, N.; Hilal, I. Identifying learners’ topical interests from social media content to enrich their course preferences in MOOCs using topic modeling and NLP techniques. Educ. Inf. Technol. 2023, 28, 5567–5584. [Google Scholar] [CrossRef]
- Li, S.; Xie, Z.; Chiu, D.K.W.; Ho, K.K.W. Sentiment Analysis and Topic Modeling Regarding Online Classes on the Reddit Platform: Educators versus Learners. Appl. Sci. 2023, 13, 2250. [Google Scholar] [CrossRef]
- Rijcken, E.; Kaymak, U.; Scheepers, F.; Mosteiro, P.; Zervanou, K.; Spruit, M. Topic Modeling for Interpretable Text Classification from EHRs. Front. Big Data 2022, 5, 846930. [Google Scholar] [CrossRef]
- Somani, S.; van Buchem, M.M.; Sarraju, A.; Hernandez-Boussard, T.; Rodriguez, F. Artificial Intelligence-Enabled Analysis of Statin-Related Topics and Sentiments on Social Media. JAMA Netw. Open 2023, 6, e239747. [Google Scholar] [CrossRef]
- Rahimi, H.; Mimno, D.; Hoover, J.L.; Naacke, H.; Constantin, C.; Amann, B. Contextualized Topic Coherence Metrics. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL, Dubrovnik, Croatia, 2–6 May 2023; pp. 1760–1773. Available online: https://arxiv.org/abs/2305.14587v1 (accessed on 12 December 2024).
- Li, Y.; Yang, A.Y.; Marelli, A.; Li, Y. MixEHR-SurG: A joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records. J. Biomed. Inform. 2024, 153, 104638. [Google Scholar] [CrossRef]
- Boyd-Graber, J.; Hu, Y.; Mimno, D. Applications of Topic Models. Found. Trends® Inf. Retr. 2017, 11, 143–296. [Google Scholar] [CrossRef]
- Chakkarwar, V.A.; Tamane, S.C. Information Retrieval Using Effective Bigram Topic Modeling. In Proceedings of the International Conference on Applications of Machine Intelligence and Data Analytics (ICAMIDA 2022), Aurangabad, India, 22–24 December; pp. 784–791. [CrossRef]
- Blei, D.M.; Ng, A.Y.; Edu, J.B. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Ozyurt, O.; Özköse, H.; Ayaz, A. Evaluating the latest trends of Industry 4.0 based on LDA topic model. J. Supercomput. 2024, 80, 19003–19030. [Google Scholar] [CrossRef]
- Blei, D.M. Probabilistic topic models. Commun. ACM 2012, 55, 77–84. [Google Scholar] [CrossRef]
- Bystrov, V.; Naboka-Krell, V.; Staszewska-Bystrova, A.; Winker, P. Choosing the Number of Topics in LDA Models—A Monte Carlo Comparison of Selection Criteria. J. Mach. Learn. Res. 2024, 25, 1–30. Available online: http://jmlr.org/papers/v25/23-0188.html (accessed on 1 February 2025).
- Liu, L.; Tang, L.; Dong, W.; Yao, S.; Zhou, W. An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 2016, 5, 1608. [Google Scholar] [CrossRef]
- Chen, W.; Rabhi, F.; Liao, W.; Al-Qudah, I. Leveraging State-of-the-Art Topic Modeling for News Impact Analysis on Financial Markets: A Comparative Study. Electronics 2023, 12, 2605. [Google Scholar] [CrossRef]
- Papadia, G.; Pacella, M.; Perrone, M.; Giliberti, V. A Comparison of Different Topic Modeling Methods through a Real Case Study of Italian Customer Care. Algorithms 2023, 16, 94. [Google Scholar] [CrossRef]
- Li, P.; Tseng, C.; Zheng, Y.; Chen, J.A.; Huang, L.; Jarman, B.; Needell, D. Guided Semi-Supervised Non-Negative Matrix Factorization. Algorithms 2022, 15, 136. [Google Scholar] [CrossRef]
- Blei, D.M.; Lafferty, J.D. A correlated topic model of Science. Ann. Appl. Stat. 2007, 1, 17–35. [Google Scholar] [CrossRef]
- Syahrial, S.; Perucha, R.; Afidh, F. Fine-Tuning Topic Modelling: A Coherence-Focused Analysis of Correlated Topic Models. Infolitika J. Data Sci. 2024, 2, 82–87. [Google Scholar] [CrossRef]
- Fang, Z.; He, Y.; Procter, R. BERTTM: Leveraging Contextualized Word Embeddings from Pre-trained Language Models for Neural Topic Modeling. arXiv 2023, arXiv:2305.09329. [Google Scholar]
- Bewong, M.; Wondoh, J.; Kwashie, S.; Liu, J.; Liu, L.; Li, J.; Islam, M.Z.; Kernnot, D. DATM: A Novel Data Agnostic Topic Modeling Technique with Improved Effectiveness for Both Short and Long Text. IEEE Access 2023, 11, 32826–32841. [Google Scholar] [CrossRef]
- Hoyle, A.; Goel, P.; Hian-Cheong, A.; Peskov, D.; Boyd-Graber, J.; Resnik, P. Is Automated Topic Model Evaluation Broken? The Incoherence of Coherence. Adv. Neural Inf. Process. Syst. 2021, 34, 2018–2033. [Google Scholar]
- Marani, A.H.; Baumer, E.P.S. A Review of Stability in Topic Modeling: Metrics for Assessing and Techniques for Improving Stability. ACM Comput. Surv. 2023, 56, 108. [Google Scholar] [CrossRef]
- Kapoor, S.; Gil, A.; Bhaduri, S.; Mittal, A.; Mulkar, R. Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling. arXiv 2024, arXiv:2409.15626. [Google Scholar]
- Geeganage, D.K.; Xu, Y.; Li, Y. A Semantics-enhanced Topic Modelling Technique: Semantic-LDA. ACM Trans. Knowl. Discov. Data 2024, 18, 93. [Google Scholar] [CrossRef]
- Li, R.; González-Pizarro, F.; Xing, L.; Murray, G.; Carenini, G. Diversity-Aware Coherence Loss for Improving Neural Topic Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 2, pp. 1710–1722. [Google Scholar] [CrossRef]
- Lewis, C.M.; Graduate, O.; Grossetti, F. A Statistical Approach for Optimal Topic Model Identification. J. Mach. Learn. Res. 2025, 23, 1–20. Available online: http://jmlr.org/papers/v23/19-297.html (accessed on 2 February 2025).
- Chang, J.; Gerrish, S.; Wang, C.; Boyd-Graber, J.; Blei, D. Reading Tea Leaves: How Humans Interpret Topic Models. Adv. Neural Inf. Process. Syst. 2009, 22, 288–296. [Google Scholar]
- Lee, T.Y.; Smith, A.; Seppi, K.; Elmqvist, N.; Boyd-Graber, J.; Findlater, L. The human touch: How non-expert users perceive, interpret, and fix topic models. Int. J. Hum. Comput. Stud. 2017, 105, 28–42. [Google Scholar] [CrossRef]
- El-Assady, M.; Kehlbeck, R.; Collins, C.; Keim, D.; Deussen, O. Semantic concept spaces: Guided topic model refinement using word-embedding projections. IEEE Trans. Vis. Comput. Graph. 2020, 26, 1001–1011. [Google Scholar] [CrossRef]
- Sperrle, F.; Schäfer, H.; Keim, D.; El-Assady, M. Learning Contextualized User Preferences for Co-Adaptive Guidance in Mixed-Initiative Topic Model Refinement. Comput. Graph. Forum 2021, 40, 215–226. [Google Scholar] [CrossRef]
- Rehman, K.M.H.U.; Wakabayashi, K. Keyphrase-based Refinement Functions for Efficient Improvement on Document-Topic Association in Human-in-the-Loop Topic Models. J. Inf. Process. 2023, 31, 353–364. [Google Scholar] [CrossRef]
- Chang, S.; Wang, R.; Ren, P.; Huang, H. Enhanced Short Text Modeling: Leveraging Large Language Models for Topic Refinement. arXiv 2024, arXiv:2403.17706. [Google Scholar]
- News-Classification/train_data.csv at master vijaynandwani/News-Classification GitHub. Available online: https://github.com/vijaynandwani/News-Classification/blob/master/train_data.csv (accessed on 5 December 2024).
- SMS Spam Collection Dataset. Available online: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset (accessed on 5 December 2024).
- Topic Modeling for Research Articles. Available online: https://www.kaggle.com/datasets/blessondensil294/topic-modeling-for-research-articles?select=train.csv (accessed on 5 December 2024).
- Anschütz, M.; Eder, T.; Groh, G. Retrieving Users’ Opinions on Social Media with Multimodal Aspect-Based Sentiment Analysis. arXiv 2022, arXiv:2210.15377. [Google Scholar]
- Wu, X.; Li, C.; Zhu, Y.; Miao, Y. Short Text Topic Modeling with Topic Distribution Quantization and Negative Sampling Decoder. In Proceedings of the 2020—2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 16–20 November 2020; pp. 1772–1782. [Google Scholar] [CrossRef]
- Wu, X.; Luu, A.T.; Dong, X. Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 2748–2760. [Google Scholar] [CrossRef]
- Garewal, I.K.; Jha, S.; Mahamuni, C.V. Topic Modeling for Identifying Emerging Trends on Instagram Using Latent Dirichlet Allocation and Non-Negative Matrix Factorization. In Proceedings of the 10th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 14–15 March 2024; pp. 1103–1110. [Google Scholar] [CrossRef]
- Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
- Wang, R.; Hu, X.; Zhou, D.; He, Y.; Xiong, Y.; Ye, C.; Xu, H. Neural Topic Modeling with Bidirectional Adversarial Training. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA, 3 April 2020; pp. 340–350. [Google Scholar] [CrossRef]
- Rieger, J.; Jentsch, C.; Rahnenführer, J. RollingLDA: An Update Algorithm of Latent Dirichlet Allocation to Construct Consistent Time Series from Textual Data. In Proceedings of the Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 2337–2347. [Google Scholar] [CrossRef]
- Vendrow, J.; Haddock, J.; Rebrova, E.; Needell, D. On a guided nonnegative matrix factorization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 3265–3269. [Google Scholar] [CrossRef]
- Nugumanova, A.; Alzhanov, A.; Mansurova, A.; Rakhymbek, K.; Baiburin, Y. Semantic Non-Negative Matrix Factorization for Term Extraction. Big Data Cogn. Comput. 2024, 8, 72. [Google Scholar] [CrossRef]
- Miller, G.A. WordNet. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
- Zotova, E.; Cuadros, M.; Rigau, G. Towards the Integration of WordNet into ClinIDMap. 2023. Available online: https://aclanthology.org/2023.gwc-1.42/ (accessed on 3 February 2025).
- API Platform|OpenAI. Available online: https://openai.com/api/ (accessed on 17 December 2024).
- Wood, J.; Arnold, C.; Wang, W. A Bayesian Topic Model for Human-Evaluated Interpretability. 2022. Available online: https://aclanthology.org/2022.lrec-1.674/ (accessed on 9 February 2025).
- Thielmann, A.; Reuter, A.; Seifert, Q.; Bergherr, E.; Säfken, B. Topics in the Haystack: Enhancing Topic Quality through Corpus Expansion. Comput. Linguist. 2024, 50, 619–655. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL HLT 2019—2019 Conference of the North. American Chapter of the Association for Computational Linguistics: Human. Language Technologies—Proceedings of the Conference, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. Available online: https://arxiv.org/abs/1810.04805v2 (accessed on 17 December 2024).
- Deb, S.; Chanda, A.K. Comparative analysis of contextual and context-free embeddings in disaster prediction from Twitter data. Mach. Learn. Appl. 2022, 7, 100253. [Google Scholar] [CrossRef]
- Stankevičius, L.; Lukoševičius, M. Extracting Sentence Embeddings from Pretrained Transformer Models. Appl. Sci. 2024, 14, 8887. [Google Scholar] [CrossRef]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the EMNLP 2020—Conference on Empirical Methods in Natural Language Processing: Systems Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]
- Domanski, P.D. Statistical outlier labelling—A Comparative study. In Proceedings of the 7th International Conference on Control, Decision and Information Technologies (CoDIT 2020), Prague, Czech Republic, 29 June–2 July 2020; pp. 439–444. [Google Scholar] [CrossRef]
- Casteleyn, S.; Ometov, A.; Torres-Sospedra, J.; Yaro, A.S.; Maly, F.; Prazak, P. Outlier Detection in Time-Series Receive Signal Strength Observation Using Z-Score Method with Sn Scale Estimator for Indoor Localization. Appl. Sci. 2023, 13, 3900. [Google Scholar] [CrossRef]
- Menéndez-García, L.A.; García-Nieto, P.J.; García-Gonzalo, E.; Lasheras, F.S.; Álvarez-de-Prado, L.; Bernardo-Sánchez, A. Method for the Detection of Functional Outliers Applied to Quality Monitoring Samples in the Vicinity of El Musel Seaport in the Metropolitan Area of Gijón (Northern Spain). Mathematics 2023, 11, 2631. [Google Scholar] [CrossRef]
- Choi, J.; Jung, E.; Lim, S.; Rhee, W. Finding Inverse Document Frequency Information in BERT. arXiv 2022, arXiv:2202.12191. [Google Scholar]
- Release v1.55.3 Openai/Openai-Python GitHub. Available online: https://github.com/openai/openai-python/releases/tag/v1.55.3 (accessed on 3 February 2025).
- Karas, B.; Qu, S.; Xu, Y.; Zhu, Q. Experiments with LDA and Top2Vec for embedded topic discovery on social media data—A case study of cystic fibrosis. Front. Artif. Intell. 2022, 5, 948313. [Google Scholar] [CrossRef]
- Röder, M.; Both, A.; Hinneburg, A. Exploring the space of topic coherence measures. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM 2015), Shanghai, China, 31 January–6 February 2015; pp. 399–408. [Google Scholar] [CrossRef]
- Doogan, C.; Buntine, W. Topic Model or Topic Twaddle? Re-evaluating Semantic Interpretability Measures. In Proceedings of the NAACL-HLT 2021—2021 Conference of the North. American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 6–11 June 2021; pp. 3824–3848. [Google Scholar] [CrossRef]
- Czyż, P.; Grabowski, F.; Vogt, J.E.; Beerenwinkel, N.; Marx, A. On the Properties and Estimation of Pointwise Mutual Information Profiles. arXiv 2023, arXiv:2310.10240. [Google Scholar]
Dataset | Description | Content Type | Average Length (Word) | Size (Document/Article) |
---|---|---|---|---|
AGNews | News of articles across major topics | News articles | 30 | 120,000 |
SMS Spam | Message labeled as spam or ham | Short text messages | 10–20 | 5574 |
TagMyNews | English news articles | News headlines | 15–20 | 32,000 |
Yahoo Answers | User-generated Q&A | Question and answer pairs | 100 | 1,400,000 |
20Newsgroup | Newsgroup posts across 20 topics | Full news posts and threads | 200 | 18,000 |
Kaggle’s Research Article | Research articles for topic modeling exercises | Title and Abstract of Research Article | 200 | 20,972 |
Aspect | LDA | NMF | BERTopic | G-BAT |
---|---|---|---|---|
Type of Model | Probabilistic | Matrix Decomposition | Neural (Embedding + Clustering) | Neural (VAE + Adversarial) |
Input representation | Bag of Words | TF-IDF Matrix | Contextual Embeddings | Pre-trained Embeddings |
Output |
|
|
|
|
Topic Representation | Topic-word | Topic-word | Cluster centers and their representative words | Latent space clusters |
Strength |
|
|
|
|
Weakness |
|
|
|
|
Best Use Cases |
|
|
|
|
Application |
|
|
|
|
Word | Cosine Similarity | Z-Score |
---|---|---|
Word A | 0.85 | 0.40 |
Word B | 0.87 | 0.50 |
Word C | 0.83 | 0.30 |
Word D | 0.90 | 0.65 |
Word E | 0.40 | −1.85 |
Model | t-Statistic | p-Value |
---|---|---|
LDA | −2.723 | 0.042 |
BERTopic | −3.491 | 0.017 |
G-BAT | −3.251 | 0.023 |
NMF | −2.318 | 0.068 |
Model | Before Refinement | After Refinement | Improvement (%) |
---|---|---|---|
LDA | 0.591 | 0.617 | 4.40 |
BERTopic | 0.897 | 0.901 | 0.45 |
G-BAT | 0.453 | 0.471 | 3.97 |
NMF | 0.771 | 0.790 | 2.45 |
Model | Before Refinement | After Refinement | Improvement (%) |
---|---|---|---|
LDA | 0.336 | 0.431 | 28.27 |
BERTopic | 0.539 | 0.572 | 6.12 |
G-BAT | 0.646 | 0.650 | 0.62 |
NMF | 0.589 | 0.604 | 2.55 |
Model | Before Refinement | After Refinement | Improvement (%) |
---|---|---|---|
LDA | 0.485 | 0.503 | 3.71 |
BERTopic | 0.706 | 0.745 | 5.52 |
G-BAT | 0.468 | 0.492 | 5.13 |
NMF | 0.564 | 0.581 | 3.01 |
Model | Before Refinement | After Refinement | Improvement (%) |
---|---|---|---|
LDA | 0.583 | 0.602 | 3.26 |
BERTopic | 0.823 | 0.839 | 1.94 |
G-BAT | 0.209 | 0.293 | 40.19 |
NMF | 0.743 | 0.743 | 0.00 |
Model | Before Refinement | After Refinement | Improvement (%) |
---|---|---|---|
LDA | 0.461 | 0.487 | 5.64 |
BERTopic | 0.506 | 0.552 | 9.09 |
G-BAT | 0.494 | 0.570 | 15.38 |
NMF | 0.427 | 0.483 | 13.11 |
Model | Before Refinement | After Refinement | Improvement (%) |
---|---|---|---|
LDA | 0.526 | 0.544 | 3.42 |
BERTopic | 0.731 | 0.740 | 1.23 |
G-BAT | 0.265 | 0.341 | 28.68 |
NMF | 0.614 | 0.619 | 0.81 |
Dataset | Model | Extracted Topic | Refined Topic | Misaligned Word | Replacement Word |
---|---|---|---|---|---|
AGNews | LDA | year, u, sale, percent, share, cut, inc, profit, china, report | sales_event, u, sale, percent, share, cut, inc, profit, china, report | year, report | sales_event, |
NMF | president, bush, state, afp, election, united, Kerry, talk, john, nuclear | president, bush, state, senator, election, united, Kerry, talk, john, nuclear | afp | senator | |
BERTopic | tendulkar, test, sachin, cricket, zealand, Australia, wicket, Nagpur, ponting, mcgrath | trial_run, test, sachin, cricket, zealand, Australia, wicket, nagpur, ponting, mcgrath | tendulkar | trial_run | |
G-BAT | bond, course, sale, poor, chief, charley, low, bay, coming, pick | bond, course, sale, poor, quest charley, low, bay, coming, pick | chief | quest | |
TagMyNews | LDA | world, u, year, job, court, star, coach, musical, john, wednesday. | world, planet, year, job, court, star, coach, musical, john, earth | u, wednesday | planet, earth |
NMF | japan, nuclear, earthquake, plant, crisis, tsunami, radiation, stock, power, quake | japan, nuclear, earthquake, ionizing radiation, crisis, tsunami, radiation, stock, power, quake | plant | ionizing radiation | |
BERTopic | trial, jury, insider, rajaratnam, guilty, former, blagojevich, prosecutor, lawyer, accused | trial, jury, insider, rajaratnam, guilty, former, prosecuting_officer, prosecutor, lawyer, accused | blagojevich | prosecuting_officer | |
G-BAT | yankee, south, focus, abidjan, shuttle, stake, Bahrain, wont, coach, nuclear | yankee, south, focus, center, shuttle, stake, Bahrain, wont, coach, nuclear | abidjan | center | |
Yahoo Answers | LDA | range, x, water, b, weight, size, test, running, speed, force | range, x, water, mass, weight, size, test, running, speed, force | b | mass |
NMF | help, thanks, plz, problem, tried, yahoo, appreciated, site, computer | help, thanks, lend a hand problem, tried, yahoo, appreciated, site, computer | plz | lend a hand | |
BERTopic | guy, friend, love, girl, relationship, talk, boyfriend, together, he, married | guy, friend, love, girl, relationship, talk, boyfriend, together, he, young_man | married | young_man | |
G-BAT | ability, mac, common, test, time, shes, running, medicine, deal, maybe | ability, mac, common, test, time, trade, running, medicine, deal, maybe | shes | trade | |
Newsgroup | LDA | line, subject, organization, writes, article, like, one, dont, would, get | line, subject, organization, writes, article, like, one, pay_back, would, get | dont | pay_back |
NMF | window, file, program, problem, use, application, using, manager, run, server | window, file, program, problem, use, application, using, software, run, server | manager | software | |
BERTopic | printer, font, print, deskjet, hp, laser, ink, bubblejet, bj, atm | printer, font, print, deskjet, hp, laser, ink, bubblejet, laser printer, atm | bj | laser printer | |
G-BAT | drive, matthew, file, dead, clipper, ride, pat, drug, tax, manager | drive, matthew, file, dead, repulse, ride, pat, drug, tax, manager | clipper | repulse | |
SMS Spam | LDA | number, urgent, show, prize, send, claim, u, message, contact, sent | number, urgent, show, correspondence, send, claim, u, message, contact, sent | prize | correspondence |
NMF | ill, later, sorry, meeting, yeah, aight, tonight, right, meet, thing | ill, later, sorry, meeting, yeah, match, tonight, right, meet, thing | aight | match | |
BERTopic | lunch, dinner, eat, food, pizza, hungry, weight, eating, lor, menu | lunch, dinner, eat, food, pizza, hungry, weight, eating, selection, menu | lor | selection | |
G-BAT | abiola, loving, ltgt, player, cool, later, big, waiting, regard, dude | abiola, loving, bed, player, cool, later, big, waiting, regard, dude | ltgt | bed | |
Science Article | LDA | state, system, phase, quantum, transition, field, magnetic, interaction, spin, energy | state, system, phase, quantum, transition, changeover, magnetic, interaction, spin, energy | field | changeover |
NMF | learning, deep, task, training, machine, model, feature, neural, classification, representation | learning, deep, task, training, machine, model, feature, neural, train, representation | classification | train | |
BERTopic | logic, program, language, semantic, automaton, proof, calculus, verification | logic, program, language, semantic, reasoning, proof, calculus, verification | automaton | reasoning | |
G-BAT | graph, space, constraint, site, integer, logic, frame, patient, diffusion, clustering | graph, space, constraint, site, integer, logic, frame, patient, diffusion, dispersal | clustering | dispersal |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Widiastuti, I.; Yong, H.-S. TR-GPT-CF: A Topic Refinement Method Using GPT and Coherence Filtering. Appl. Sci. 2025, 15, 1962. https://doi.org/10.3390/app15041962
Widiastuti I, Yong H-S. TR-GPT-CF: A Topic Refinement Method Using GPT and Coherence Filtering. Applied Sciences. 2025; 15(4):1962. https://doi.org/10.3390/app15041962
Chicago/Turabian StyleWidiastuti, Ika, and Hwan-Seung Yong. 2025. "TR-GPT-CF: A Topic Refinement Method Using GPT and Coherence Filtering" Applied Sciences 15, no. 4: 1962. https://doi.org/10.3390/app15041962
APA StyleWidiastuti, I., & Yong, H.-S. (2025). TR-GPT-CF: A Topic Refinement Method Using GPT and Coherence Filtering. Applied Sciences, 15(4), 1962. https://doi.org/10.3390/app15041962