An Innovative Approach to Topic Clustering for Social Media and Web Data Using AI

Kapantaidakis, Ioannis; Perakakis, Emmanouil; Mastorakis, George; Kopanakis, Ioannis

doi:10.3390/computers14040142

Open AccessArticle

An Innovative Approach to Topic Clustering for Social Media and Web Data Using AI

¹

Department of Management Science and Technology, Hellenic Mediterranean University, 72100 Agios Nikolaos, Greece

²

Mentionlytics Ltd., 20–22 Wenlock Road, London N1 7GU, UK

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(4), 142; https://doi.org/10.3390/computers14040142

Submission received: 12 February 2025 / Revised: 30 March 2025 / Accepted: 7 April 2025 / Published: 10 April 2025

(This article belongs to the Special Issue Harnessing Artificial Intelligence for Social and Semantic Understanding)

Download

Browse Figures

Versions Notes

Abstract

:

The vast amount of social media and web data offers valuable insights for purposes such as brand reputation management, topic research, competitive analysis, product development, and public opinion surveys. However, analysing these data to identify patterns and extract valuable insights is challenging due to the vast number of posts, which can number in the thousands within a single day. One practical approach is topic clustering, which creates clusters of mentions that refer to a specific topic. Following this process will create several manageable clusters, each containing hundreds or thousands of posts. These clusters offer a more meaningful overview of the discussed topics, eliminating the need to categorise each post manually. Several topic detection algorithms can achieve clustering of posts, such as LDA, NMF, BERTopic, etc. The existing algorithms, however, have several important drawbacks, including language constraints and slow or resource-intensive data processing. Moreover, the labels for the clusters typically consist of a few keywords that may not make sense unless one explores the mentions within the cluster. Recently, with the introduction of AI large language models, such as GPT-4, new techniques can be realised for topic clustering to address the aforementioned issues. Our novel approach (AI Mention Clustering) employs LLMs at its core to produce an algorithm for efficient and accurate topic clustering of web and social data. Our solution was tested on social and web data and compared to the popular existing algorithm of BERTopic, demonstrating superior resource efficiency and absolute accuracy of clustered documents. Furthermore, it produces summaries of the clusters that are easily understood by humans instead of just representative keywords. This approach enhances the productivity of social and web data researchers by providing more meaningful and interpretable results.

Keywords:

social media monitoring; social listening; topic clustering; data analysis; AI-powered analytics; intelligent insights; LLM

1. Introduction

Gathering data from social media and the web has become essential for businesses and researchers, offering various applications and benefits.

Brand monitoring has become increasingly important in the digital age, with social media platforms providing a wealth of data for enterprises to analyse and improve their reputation among consumers. Cloud-based big data sentiment analysis applications can be used for brand monitoring and analysis of social media streams, allowing enterprises to detect sentiment in social posts and their influence on consumers [1]. AI-powered social media monitoring platforms can provide intelligent insights for effective online reputation management and competitor monitoring, helping digital marketers better understand customers and improve their brand’s web and social presence [2]. By leveraging these tools, companies can enhance their competitiveness and better meet consumer needs and expectations in the digital landscape.

Moreover, social media monitoring extends beyond business applications. In the healthcare sector, it has been used to track public responses to health threats, such as the COVID-19 pandemic. For example, a study in Poland used social listening tools to analyse coronavirus discussions across various social media platforms [3].

Social media listening platforms have become increasingly popular for product research and development, offering valuable insights into customer preferences, market trends, and product feedback. These AI-driven tools extract actionable information from large amounts of social media data, addressing research questions and helping develop data-backed brand strategies [4].

While the many uses of large social media and web data make them valuable, data volume and velocity pose major obstacles to social media monitoring. The massive amount of user-generated content produced daily across platforms like Facebook, X, Instagram, and YouTube creates difficulties in data storage, processing, and analysis [5]. The high dynamics and real-time aspects make effective capture and analysis difficult [6]. Additionally, new social media are rising (e.g., TikTok, Threads, Bluesky, etc.), making it even more difficult to acquire and process all these heterogeneous data from all the different sources. Media monitoring and social listening tools help greatly with collecting these data; however, they often lack advanced functionality for efficiently processing and analysing large amounts of data [7].

Topic detection refers to the clustering of different pieces of content based on the similarity of the topic they discuss. It focuses on identifying and extracting meaningful topics from large volumes of textual data, particularly news streams and social media content [8,9]. An example of this, applied in news articles, is how Google clusters multiple news sources under a news topic in Google News, so that the reader can see a list of today’s topics easily. Google News employs sophisticated topic clustering algorithms to effectively organise and present news articles [10]. If the reader is interested in more coverage of a particular news topic, they can easily see the different sources, with the news pieces about the topic, and visit the different websites to see more. This makes Google News very easy to read, allowing users to get an overview of today’s news in just a few seconds. This approach helps avoid repetitive browsing through similar materials and visiting multiple news sites’ home pages. Therefore, the clustering process is crucial for assisting users in navigating, summarising, and organising the vast amounts of textual documents available on the internet and news sources [11].

In this paper, the same principle is applied to social and web data gathered from social media listening tools. By clustering these data, users and data analysts will find it much easier to extract the information they seek more quickly and meaningfully.

2. Literature Review

Businesses and researchers often utilize brand monitoring and social listening to retrieve posts from multiple online sources. This is usually triggered by a keyword or a query related to their interests, which could include the name of a brand or specific product, a particular event, a public figure’s name, or a location, among others. These tools typically leverage APIs (application programming interfaces) provided by social media platforms to access and collect publicly available data [12]. The posts and comments collected are typically referred to as “mentions”. Depending on the popularity of the keyword, the retrieved mentions can range from just a few to even millions. Analysing social media mentions presents significant challenges due to the vast volume and dynamic nature of the data. The complexity of social media content requires human interpretation; however, the growing scale necessitates automated analysis techniques [13]. Topic detection algorithms could be very helpful in clustering mentions that refer to the same or similar topics, even from multiple social media sources (e.g., X, Instagram, Facebook, YouTube, etc.), thereby consolidating multiple posts on the same topic into a single cluster. This could save an enormous amount of time for the users of such a system, as they would not need to go through each mention separately; instead, they can quickly and easily get an overview of the topics of mentions. They could then focus on the topic clusters they are most interested in for further analysis, cutting through the noise and clutter.

There are many different approaches to topic detection and clustering. The next chapters outline the main categories of these algorithms.

2.1. “Traditional” Topic Detection Algorithms

2.1.1. Bag-of-Words Based

In this category, prominent topic modelling algorithms such as Latent Dirichlet Allocation (LDA) [14], Non-negative Matrix Factorization (NMF) [15], and Latent Semantic Analysis (LSA) [16] assume a bag-of-words representation of text, thereby disregarding word order and semantic relationships. As a result, they provide topics that are less comprehensible and lack interpretability [17,18]. Furthermore, they encounter difficulties in distinguishing words that might have the same meaning (synonymy) or different meanings of the same word (polysemy), which results in mixed or inaccurate topic extraction [19]. Additionally, these algorithms perform ineffectively when processing short texts, such as social media posts, owing to the limited word availability, thereby hindering the discernment of underlying patterns [17].

2.1.2. Embedding-Based

Recent approaches in natural language processing, including BERTopic [20] and Top2Vec [21], use embeddings for text representation that offer enhanced coherence relative to Bag-of-Words based methodologies. Nevertheless, the actual representation of topics is based on Bag-of-Words and does not directly account for context, which might lead to redundancy in the words used to represent each topic. Moreover, resulting topics are presented as keyword lists that frequently lack clarity in interpretation, while certain mathematical inconsistencies within their formulations render them ineffective at eliminating stop words [22].

2.2. Using Large Language Models (LLMs)

In recent years, new methods incorporating LLMs in several ways into text clustering and topic analysis have emerged due to their current explosion. Some studies demonstrate that LLMs can serve as an intelligent guide to improve clustering outcomes, essentially injecting domain knowledge or preferences into the process [23,24]. It is also shown that LLMs, with appropriate prompting, can serve as an alternative to traditional topic modelling [25]. Furthermore, Miller et al. [26], used LLMs to interpret clusters generated by other methods. Their results showed that an LLM-inclusive clustering approach produced more distinctive and interpretable clusters than LDA or doc2vec, as confirmed by human review.

However, large language models (LLMs) present several challenges when applied to topic detection, particularly for large document collections. A key limitation is the restricted contextual limit. The contextual limit or context length in an LLM refers to the number of tokens that a model can process. Each model has its context length, also known as max tokens or token limit. For instance, a standard GPT-4 model has a context length of 128,000 tokens [27]. As a result, LLMs can only process a limited amount of text at once, meaning long documents must be split into chunks [28]. This chunking approach, however, potentially compromises the prevailing context, resulting in incorrect topic detection.

Subsequently, using LLMs for large-scale text processing can be computationally expensive. Processing large corpora of data requires significant computational resources that incur high costs. For example, the costs of using the GPT-4 model to analyse large datasets, like a corpus of 10 K social media posts, will exceed USD 10 for input and output tokens. This cost can be prohibitive for many applications, especially when dealing with continuously updated datasets or real-time processing requirements.

Ongoing academic work into novel techniques, including hierarchical summarisation and memory-augmented LLMs [29,30], aims to moderate these obstacles. However, these emerging methodologies remain under development and do not eliminate the challenges associated with processing sizable amounts of documents using LLMs.

This work is distinguished from previous research by combining the strengths of traditional clustering and LLMs while moderating their weaknesses. Unlike existing methods that attempt to prompt an LLM with an entire corpus [25], we first employ a classical unsupervised clustering to split the data into coherent groups. We then apply the LLM exclusively to a small subset of representative documents from each cluster. This minimises the LLM’s context requirements and reduces costs to a fraction of what they would be for processing the full dataset. Yet, it still harnesses its powerful language understanding to generate interpretable summaries. In the next section, we detail the methodology of AI Mention Clustering, which embodies these theoretical innovations.

3. Proposed Solution

We propose a novel approach for topic detection in social media corpora that exploits the power of large language models (LLMs) while minimising the computational cost. Our method applies a clustering-based approach to group semantically similar social media documents (posts) together and then uses LLMs to analyse the clusters but only by sending a small subset of representative documents from each cluster. This allows us to efficiently and cost-effectively process large datasets of social media posts while benefiting from the LLMs’ advanced language understanding capabilities. Specifically, our approach consists of the following steps (Figure 1):

Extract embeddings from web and social media documents (posts).
Cluster embeddings.
Specify the representative social media documents from each cluster.
Send the representatives to LLM to extract topics and summarise.

Extract embeddings

In this step we transform the text of social media documents into a sequence of numbers (i.e., vectors) called embeddings. These embeddings capture the semantic meaning of the text, with similar documents having similar vector representations (i.e., close to each other in the vector space). Various techniques and models can be used for embedding generation. From traditional methods like Word2Vec [31] and TF-IDF [32] to more advanced (transformer-based models). In this category, we can find both free (Sentence-BERT [33]) and commercial (OpenAI’s Ada) models.

2.: Cluster Embeddings

Once we generate the social media document embeddings, we can employ a clustering algorithm like DBSCAN [34], K-means [35], or Optics [36]. This step aims to group semantically similar documents together, assuming that documents within the same cluster discuss related topics and to identify outliers in order to exclude them from further processing.

3.: Specify the Representative Documents

Rather than sending all documents within a cluster to the LLM as input, which can be computationally expensive and cost-inefficient, in this step, we select a small percentage of a few representative documents from each cluster. These representatives should ideally capture the core themes and discussions within the cluster. Various methodologies can be employed to determine the representatives, including identifying documents proximal to the cluster centroid or determining medoids [37].

The concept of a medoid refers to a representative point within a cluster that minimises the average dissimilarity (or distance) to all other points within the cluster. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the dataset. The formal definition of a medoid is the following:

Let

X ≔ {x_{1}, x_{2}, \dots x_{n}}

be a set of n points in space with a distance function d. A medoid is defined as [37]

x_{m e d o i d} = \arg \min_{y \in Χ} \sum_{i = 1}^{n} d (y, x_{i})

This step of specifying the medoid representatives for each cluster will significantly reduce the total amount of input data that the LLM will finally process.

4.: Send cluster representatives to LLM

In this final step, each cluster’s representative documents are sent as input to the LLM. The LLM is then prompted to generate a summary of the overall discussion within each cluster, thus providing a cohesive overview of each topic. This process exploits the LLM’s text analysis and synthesis capabilities to produce topic summaries that are both meaningful and comprehensible to humans.

4. Evaluation

4.1. Qualitative Evaluation

In order to evaluate our AI Mention Clustering, we created a dataset of approximately 10 K multilingual posts about Ryanair from various websites and social media platforms, and we compared our approach with the BERTopic algorithm. Furthermore, a secondary dataset, consisting of about 5000 exclusively English posts, was used for the evaluation (Figure 2 and Figure 3).

To create vector representations of the data, our evaluation used OpenAI’s text-embedding-ada-002 embeddings (dimension 1536). We used the density-based spatial clustering of applications with noise (DBSCAN) approach, with parameters (min_samples = 5, epsilon = 0.24), because the number of clusters in the dataset was unknown beforehand. Medoids, which offer a reliable indicator of central tendency, were chosen as cluster representatives. In practice, we found that using three representative posts per cluster worked well for large clusters, and just one or two for smaller clusters. Lastly, GPT-4 large language model was used to summarize each cluster using the following prompt: “write a summary of up to 30 words for the following list of news titles and social media posts”.

In Figure 4, we present a summary of the resulting clusters in both datasets. One important difference that someone could easily detect is the variation between our AI Mention Clustering and BERTopic in the proportion of documents assigned to clusters and the granularity of the clustering itself. BERTopic clustered more than 50% of the total posts in both datasets, resulting in a larger number of clusters. This indicates an overclustering strategy that potentially contains noise and fragmenting topics. In contrast, our approach clustered between 15% and 20% of the total posts, indicating a more refined clustering approach that accurately captures topics, without including irrelevant documents in each generated cluster. Another crucial point to note is that our approach achieves significant performance efficiency through cautious use of large language models. It processes less than 1% of the posts by only sending representative ones, thereby reducing costs and computational demands. This efficiency makes our method a more practical and scalable option for large datasets.

Starting with the English dataset, in Figure 5 we can see from the top three clusters that while both methods concur that a major incident at a Milan airport was important, they differ greatly in how effectively they capture it. Our method recognized this event as the main topic, grouping 499 documents into it. On the other hand, BERTopic also noted this event but not as the main one, and only assigned 229 documents to it, which is less than half of what our approach did. This difference suggests that BERTopic may have missed or misclassified many relevant posts related to this event and incorrectly placed them in less relevant clusters.

Additionally, our method created more precise and easily interpretable topics than BERTopic’s clustering, which produced clusters with generic keywords (e.g., “love”, ”know”, ”good”, ”airplane”, etc.) that reduce interpretability and the identification of underlying topics. Thus, it was very hard for a human reader to understand what each cluster is about, based only on these few keywords. A user will need to look at a number of mentions from within the cluster to understand the actual topic. On the contrary, our approach describes each cluster as a textual summary of its inner mentions. This description is very accurate, and a user can simply understand the full context of the cluster without any need to read the actual mentions, thus making it very efficient for analysts to understand the results.

The performance gap was not a problem just with the English language, and it worsened significantly with the multilingual dataset (Figure 6). On the Milan airport incident topic (which is clearly the major topic in the dataset), our method identified 819 relevant documents, showcasing its robust multilingual skills. BERTopic, in contrast, only found 232 related documents, a notably smaller portion that placed it in the third position. This significant difference suggests that BERTopic may not be able to capture key information across multilingual data.

Furthermore, another limitation BERTopic exhibited in processing the multilingual dataset is that dominant clusters contain high-frequency words such as “que”, “te”, “por”, “el”, “da”, “ma”, and “pi”. These words lack semantic significance to the underlying topics and are considered stop words, meaning that they should have been excluded. This shows a weakness in the model’s ability to effectively filter noise from multilingual data.

In contrast, our approach produced the same topics as in the English dataset, and only the number of the assigned documents was altered. However, we requested a summary in Greek to demonstrate the potential to leverage topic summaries across a diverse range of languages despite the actual languages that appear in the multilingual dataset.

The screenshot provided in Figure 7 illustrates a commercial implementation of our methodology, utilized by the social listening tool Mentionlytics [38], which depicts a ranked cluster ordering based on the number of documents in each cluster. Additionally, informative data and key metrics, such as accumulated engagement, overall reach, sentiment, and corresponding channel sources for documents within each cluster, are also presented.

4.2. Quantitative Evaluation

Besides qualitative evaluation, a quantitative assessment was required to compare our method with BERTopic. For this purpose, we used the previously described English Ryanair dataset, and we added three more datasets: Easyjet (another aviation company), Trello (a computer software,) and Asana (another computer software). To offer a more comprehensive review scope and reduce the inherent bias of relying on one source, these datasets differed in size, chronological range, and span in two very different industries (aviation and computer software).

We selected these datasets to represent typical social listening scenarios: two from the airline industry and two from the tech industry, encompassing different time spans and dataset sizes. This variety ensures that our evaluation includes cases of relatively focused conversation (software communities) and broad, sometimes volatile discussions (airline customers and news). It also enables us to test how the approach scales from approximately 5000 to 10,000 documents. All datasets consist of public posts collected through a social listening tool (Mentionlytics) by querying specific keywords, primarily their brand names. Duplicate posts (exact repeats or retweets) were eliminated. We conducted light preprocessing, which involved removing URLs, emojis, and Twitter handles (usernames) to minimize noise in topic modelling.

All four datasets were used for our evaluation. The clustering results from both approaches are described in Table 1, while Figure 8 depicts their source distribution.

As we already noted in the previous chapter, for the Ryanair dataset, over half of the posts in each of the three new datasets were clustered by BERTopic, indicating the possibility of producing noise and fragmenting topics. Our method, in comparison, keeps grouping a much smaller percentage of mentions of the dataset (15–20%), suggesting a more focused and refined clustering technique.

Since our approach outputs human-readable summaries instead of keywords for each cluster, we first used the TF-IDF technique [32] to identify the top 10 most important keywords per cluster (the main keywords from each dataset, i.e., Ryanair, Easyjet, Trello, and Asana, were excluded). The TF-IDF score for a term in a document is obtained by multiplying its TF and IDF scores.

T F - I D F (t, d, D) = T F (t, d) \times I D F (t, D)

where

T F (t, d) = \frac{N u m b e r o f t i m e s t e r m t a p p e a r s i n d o c u m e n t d}{T o t a l n u m b e r o f t e r m s i n d o c u m e n t d}

I D F (t, D) = \log (\frac{T o t a l n u m b e r o f d o c u m e n t s i n t h e c o r p u s N}{N u m b e r o f d o c u m e n t s c o n t a i n i n g t e r m t})

Using the keyword sets found by employing the TF-IDF technique, we calculated two important metrics: topic coherence and topic diversity. These metrics allowed us to quantitively measure our method’s performance against BERTopic across the four datasets. Also, to evaluate the clustering structure itself, we calculated Davies–Bouldin index metric [39].

4.2.1. Topic Coherence

The topic coherence metric [40] assesses the semantic similarity of words within a given cluster identified by clustering (or topic modelling) algorithm. Assuming

T = {w 1, w 2, \dots, w n}

as a generated topic which is represented by its top-n most important words and given a similarity measure

S i m (w i, w j)

, topic coherence is defined as follows:

T o p i c C o h e r e n c e = \frac{\sum_{\begin{matrix} 1 \leq i \leq n - 1 \\ i + 1 \leq j \leq n \end{matrix}} S i m (w i, w j)}{(\binom{n}{2})}

A high coherence score suggests a well-defined and relevant topic since it shows that the words within the topic are closely connected and make intuitive sense together. On the other hand, a low coherence score suggests that the topic is poorly defined or meaningless and that the words are mostly unrelated. For our evaluation, we used the Cv method, which, as described in [41], was found to correlate the most with human interpretation.

Table 2 represents the resulting coherence scores of our approach compared to BERTopic across all four datasets. Our approach achieved higher coherence scores in each case, from 5% to 12%. This suggests that our approach produces more semantically coherent topics compared to BERTopic.

4.2.2. Topic Diversity

Topic diversity metrics measure how distinct the generated topics are, ensuring that a clustering method does not output variations on the same topic. A high diversity score indicates that the clustering method identified distinct topics within the dataset, while low diversity scores suggest redundant and potentially recurrent topics. For evaluating topic diversity, we used two approaches: (1) the proportion of the unique keywords to the total number of keywords produced from the computed clusters and (2) the word embedding-based centroid distance [42].

In this approach, we computed the FastText model using the embeddings of the keywords that describe each cluster [43]. Then, the diversity score was calculated as the average cosine distance between the centroids of clusters from all pairs of clusters (see Algorithm 1).

Despite BERTopic clustering a larger number of posts, Table 3 demonstrates that our approach achieves better topic diversity scores than BERTopic in all four datasets (Ryanair, EasyJet, Trello, and Asana). Although the two approaches’ centroid distances are comparable, our approach’s clustering extracts a larger percentage of unique keywords (between 67% and 78%) than BERTopic (from 48% to 53%). This suggests that our approach produces more unique and varied subjects.

Algorithm 1:Word Embedding-Based Centroid Distance Calculation

  Input: clusters, embedding_model, topk=10
  distances_array = [ ]
  For each cluster1, cluster2 in combinations(clusters, 2) do:
centroid1 = [ ]
centroid2 = [ ]
For each word1 in cluster1[:topk] do:
centroid1 = centroid1 + embedding_model[word1]
For each word2 in cluster2[:topk] do:
centroid2 = centroid2 + embedding_model[word2]
centroid1 = centroid1/length(cluster1[:topk])
centroid2 = centroid2/length (cluster2[:topk])
distances_array.append(distance.cosine(centroid1, centroid2))
return average (distances_array)

4.2.3. Davies–Bouldin Index

The Davies–Bouldin index (DBI) [39] helps us understand how good a clustering algorithm is. It looks at how similar items are within the same cluster and how different clusters are from each other. Lower values of the Davies–Bouldin index indicate better clustering quality. Assuming that there is a dataset of k clusters

X = {X_{1}, X_{2}, \dots, X_{k}}

, the Davies–Bouldin index can be calculated as

D B I = \frac{1}{k} \sum_{i = 1}^{k} m a x (\frac{Δ (Χ_{i}) + Δ (Χ_{j})}{δ (Χ_{i}, Χ_{j})})

where

Δ (Χ_{k})

is the intracluster distance (compactness) within the cluster

Χ_{k}

, and

δ (Χ_{i}, Χ_{j})

is the intercluster distance (separation) between the clusters

Χ_{i}

and

Χ_{j}

.

Table 4 represents the DBI scores of our approach compared to BERTopic across all four datasets. Our approach exhibits significantly lower DBI scores than BERTopic for each dataset tested. This suggests AI Mention Clustering creates more distinct and well-defined clusters compared to BERTopic for these datasets.

5. Discussion and Future Work

This work presents an efficient approach for extracting easily interpretable topics from large social media data. By leveraging the power of large language models (LLMs) for natural language processing, we achieve effective topic modelling compared to BERTopic for both multilingual and language-specific datasets while maintaining cost-effectiveness since only 1% of the posts are sent to the LLM for processing.

The demonstrated methodology generates meaningful interpretations of topics from noisy social media data and could offer valuable insights for various applications, including social trend analysis, market research, social media crisis identification, and public opinion monitoring. Additionally, the underlying framework’s adaptability suggests that it could be applied to other NLP tasks beyond topic extraction, like knowledge graph generation, sentiment analysis, and named entity recognition (NER).

Future research includes optimisations in the clustering step of our methodology, including techniques such as dimensionality reduction on embedding representations. Dimensionality reduction techniques are crucial for improving the efficiency and effectiveness of embedding representations. These methods aim to preserve essential information while reducing the dimensionality of high-dimensional data, which is particularly useful for word embeddings and other types of vector representations [44]. Also, parallelisation within clustering algorithms will further enhance the methodology’s capability to process larger volumes of social data rapidly. Parallel clustering algorithms distribute the workload across multiple processors, allowing for simultaneous computation of different parts of the clustering process [45].

Additionally, a comprehensive evaluation of different embedding models, LLMs, and alternative methods for selecting the most representative documents for each cluster could further improve the interpretability and accuracy of the extracted topics. We could also compare our approach to other topic modelling algorithms besides BERTopic, such as Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), and Top2Vec.

In conclusion, while the current approach demonstrates considerable effectiveness and efficiency, ongoing improvements and comparisons with other methodologies will ensure that the solution remains at the forefront of topic modelling in social media and web data analytics. The continued evolution of these techniques promises even greater scalability and adaptability in the future, opening up new possibilities for effective social data analysis.

Author Contributions

Conceptualization, I.K. (Ioannis Kapantaidakis), E.P. and I.K. (Ioannis Kopanakis); methodology, I.K. (Ioannis Kapantaidakis) and I.K. (Ioannis Kopanakis); software, I.K. (Ioannis Kapantaidakis) and E.P.; validation, E.P., G.M. and I.K. (Ioannis Kapantaidakis); data curation, I.K. (Ioannis Kapantaidakis); writing—original draft preparation, E.P. and I.K. (Ioannis Kapantaidakis); writing—review and editing, E.P. and I.K. (Ioannis Kapantaidakis); visualization, E.P.; supervision, I.K. (Ioannis Kopanakis) and E.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data analysed in this study are public posts data available on social media and the web as described in the corresponding section. They can be derived from the social media APIs of the providers or by using a social media monitoring tool using the described keywords and dates.

Conflicts of Interest

The authors declare that E. Perakakis, G. Mastorakis, and I. Kopanakis are cofounders of Mentionlytics Ltd. The data collection for this study was conducted using Mentionlytics, developed by Mentionlytics Ltd. The company had no role in the study design, data analysis, interpretation of results, manuscript preparation, or the decision to publish.

References

Tedeschi, A.; Benedetto, F. A cloud-based big data sentiment analysis application for enterprises’ brand monitoring in social media streams. Proc. IEEE RSI Conf. Robot. Mechatron. 2015, 2, 186–191. [Google Scholar] [CrossRef]
Perakakis, E.; Mastorakis, G.; Kopanakis, I. Social Media Monitoring: An Innovative Intelligent Approach. Designs 2019, 3, 24. [Google Scholar] [CrossRef]
Burzyńska, J.; Bartosiewicz, A.; Rękas, M. The social life of COVID-19: Early insights from social media monitoring data collected in Poland. Health Inform. J. 2020, 26, 3056–3065. [Google Scholar] [CrossRef] [PubMed]
Hayes, J.L.; Britt, B.C.; Evans, W.; Rush, S.W.; Towery, N.A.; Adamson, A.C. Can Social Media Listening Platforms’ Artificial Intelligence Be Trusted? Examining the Accuracy of Crimson Hexagon’s (Now Brandwatch Consumer Research’s) AI-Driven Analyses. J. Advert. 2020, 50, 81–91. [Google Scholar] [CrossRef]
Hussain, Z.; Hussain, M.; Zaheer, K.; Bhutto, Z.A.; Rai, G. Statistical Analysis of Network-Based Issues and Their Impact on Social Computing Practices in Pakistan. J. Comput. Commun. 2016, 4, 23–39. [Google Scholar] [CrossRef]
Shi, L.; Luo, J.; Zhu, C.; Kou, F.; Cheng, G.; Liu, X. A survey on cross-media search based on user intention understanding in social networks. Inf. Fusion 2022, 91, 566–581. [Google Scholar] [CrossRef]
Kitchens, B.; Abbasi, A.; Claggett, J.L. Timely, Granular, and Actionable: Designing a Social Listening Platform for Public Health 3.0. MIS Q. 2024, 48, 899–930. [Google Scholar] [CrossRef]
He, Q.; Lim, E.-P.; Banerjee, A.; Chang, K. Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1795–1808. [Google Scholar] [CrossRef]
Li, C.; Liu, M.; Yu, Y.; Wang, H.; Cai, J. Topic Detection and Tracking Based on Windowed DBSCAN and Parallel KNN. IEEE Access 2020, 9, 3858–3870. [Google Scholar] [CrossRef]
Ahmed, A.; Ho, Q.; Smola, A.J.; Teo, C.H.; Xing, E.; Eisenstein, J. Unified analysis of streaming news. In Proceedings of the 2011 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 21–24 August 2011; ACM: New York, NY, USA, 2011; pp. 1–9. [Google Scholar] [CrossRef]
Lu, Q.; Conrad, J.G.; Al-Kofahi, K.; Keenan, W. Legal document clustering with built-in topic segmentation. In Proceedings of the Fifth International Conference on Statistical Data Analysis Based on the L1-Norm and Related Methods, Shanghai, China, 5–8 July 2011; Elsevier: Amsterdam, The Netherlands, 2011; pp. 383–392. [Google Scholar]
Davis, C.A.; Serrette, B.; Hong, K.; Rudnick, A.; Pentchev, V.; Menczer, F.; Gonçalves, B.; Grabowicz, P.A.; Mckelvey, K.; Chung, K.; et al. OSoMe: The IUNI Observatory on Social Media. PeerJ Comput. Sci. 2016, 2, e87. [Google Scholar] [CrossRef]
Chen, X.; Vorvoreanu, M.; Madhavan, K.P.C. Mining Social Media Data for Understanding Students’ Learning Experiences. IEEE Trans. Learn. Technol. 2014, 7, 246–259. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar] [CrossRef]
Lee, D.D.; Seung, H.S. Learning the Parts of Objects by Non-negative Matrix Factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef] [PubMed]
Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 417–428. [Google Scholar] [CrossRef]
Zhou, K.; Yang, Q. LDA-PSTR: A Topic Modeling Method for Short Text. In Proceedings of the 2018 International Conference on Big Data Analysis, Beijing, China, 25–27 July 2018; Springer: Singapore, 2018; pp. 339–352. [Google Scholar] [CrossRef]
Kim, H.D.; Zhai, C.; Park, D.H.; Lu, Y. Enriching Text Representation with Frequent Pattern Mining for Probabilistic Topic Modeling. Proc. Am. Soc. Inf. Sci. Technol. 2012, 49, 1–10. [Google Scholar] [CrossRef]
Sriurai, W. Improving Text Categorization By Using A Topic Model. Adv. Comput. Int. J. 2011, 2, 21–27. [Google Scholar] [CrossRef]
Grootendorst, M. BERTopic: Neural topic modeling with a class-based embedding model. arXiv 2022, arXiv:2203.05794. [Google Scholar] [CrossRef]
Angelov, D. Top2Vec: Distributed Representations of Topics. arXiv 2020, arXiv:2008.09470. [Google Scholar] [CrossRef]
Milios, E.; Zhang, X. MPTopic: Improving Topic Modeling via Masked Permuted Pre-training. arXiv 2023, arXiv:2309.01015. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Z.; Shang, J. ClusterLLM: Large Language Models as a Guide for Text Clustering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), Singapore, 7–11 November 2023; Association for Computational Linguistics: Singapore, 2023; pp. 13903–13920. [Google Scholar] [CrossRef]
Viswanathan, V.; Gashteovski, K.; Lawrence, C.; Wu, T.; Neubig, G. Large Language Models Enable Few-Shot Clustering. arXiv 2023, arXiv:2307.00524. [Google Scholar] [CrossRef]
Mu, Y.; Dong, C.; Bontcheva, K.; Song, X. Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling. arXiv 2024, arXiv:2403.16248. [Google Scholar] [CrossRef]
Miller, J.K.; Alexander, T.J. Human-Interpretable Clustering of Short-Text Using Large Language Models. arXiv 2024, arXiv:2405.07278. [Google Scholar] [CrossRef] [PubMed]
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; Van Den Driessche, G.B.; Lespiau, J.B.; Damoc, B.; Clark, A.; et al. Improving Language Models by Retrieving from Trillions of Tokens. In Proceedings of the International Conference on Machine Learning (ICML 2022), Baltimore, MD, USA, 17–23 July 2022; PMLR: Baltimore, MD, USA, 2022. [Google Scholar] [CrossRef]
Li, S.; Xu, J. HierMDS: A hierarchical multi-document summarization model with global–local document dependencies. Neural Comput. Appl. 2023, 35, 18553–18570. [Google Scholar] [CrossRef]
Moro, G.; Ragazzi, L.; Valgimigli, L.; Frisoni, G.; Sartori, C.; Marfia, G. Efficient Memory-Enhanced Transformer for Long-Document Summarization in Low-Resource Regimes. Sensors 2022, 23, 3542. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NeurIPS 2013), Lake Tahoe, NV, USA, 5–10 December 2013; Curran Associates, Inc.: Red Hook, NY, USA, 2013; pp. 3111–3119. [Google Scholar] [CrossRef]
Salton, G.; Buckley, C. Term-weighting Approaches in Automatic Text Retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Hong Kong, China, 2019; pp. 3982–3992. [Google Scholar] [CrossRef]
DBSCAN (Density-Based Spatial Clustering of Applications with Noise); Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA, 2–4 August 1996; AAAI Press: Portland, OR, USA, 1996; pp. 226–231. [Google Scholar]
MacQueen, J.K. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21–23 June 1967; University of California Press: Berkeley, CA, USA, 1967; Volume 1, pp. 281–297. [Google Scholar]
Ankerst, M.; Breunig, M.M.; Kriegel, H.P.; Sander, J. OPTICS: Ordering Points to Identify the Clustering Structure. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA, 1–3 June 1999; ACM: New York, NY, USA, 1999; pp. 49–60. [Google Scholar] [CrossRef]
Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; Wiley: Hoboken, NJ, USA, 2005. [Google Scholar]
Mentionlytics [Computer Software]. Available online: https://www.mentionlytics.com (accessed on 15 March 2025).
Davies, D.; Bouldin, D. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
Newman, D.; Lau, J.H.; Grieser, K.; Baldwin, T. Automatic Evaluation of Topic Coherence. In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2010), Los Angeles, CA, USA, 1–6 June 2010; Association for Computational Linguistics: Los Angeles, CA, USA, 2010; pp. 100–108. [Google Scholar] [CrossRef]
Röder, M.; Both, A.; Hinneburg, A. Exploring the Space of Topic Coherence Measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM 2015), Shanghai, China, 2–6 February 2015; ACM: New York, NY, USA, 2015; pp. 399–408. [Google Scholar] [CrossRef]
Bianchi, F.; Terragni, S.; Hovy, D.; Nozza, D.; Fersini, E. Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 2021 European Chapter of the Association for Computational Linguistics (EACL 2021), Online, 19–23 April 2021; Association for Computational Linguistics: Online, 2021; pp. 84–96. Available online: https://aclanthology.org/2021.eacl-main.9/ (accessed on 8 January 2025).
Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. arXiv 2016, arXiv:1607.04606. [Google Scholar] [CrossRef]
Allaoui, M.; Kherfi, M.L.; Cheriet, A. Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study. In Proceedings of the 2020 International Conference on Machine Learning and Data Science, Singapore, 6–8 September 2020; Springer: Cham, Switzerland, 2020; pp. 317–325. [Google Scholar] [CrossRef]
Luo, G.; Luo, X.; Tian, L.; Gooch, T.F.; Qin, K. A Parallel DBSCAN Algorithm Based on Spark. In Proceedings of the 2016 IEEE International Conference on Big Data and Cloud Computing, Beijing, China, 4–6 November 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 548–553. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed solution—AI Mention Clustering.

Figure 2. Source distribution for: (a) multilingual and (b) English dataset.

Figure 3. Language distribution for the multilingual dataset (English language excluded).

Figure 4. Summary of resulting clusters in: (a) multilingual and (b) English dataset.

Figure 5. Top 3 clusters in the English dataset for (a) BERTopic and (b) AI Mention Clustering.

Figure 6. Top 3 clusters in multilingual dataset for (a) BERTopic and (b) AI Mention Clustering.

Figure 7. An implementation of our AI Mention Clustering applied in the social listening tool Mentionlytics [38].

Figure 8. Source distribution for datasets: (a) Ryanair, (b) Easyjet, (c) Trello, and (d) Asana.

Table 1. Dataset description and clustering result summarization.

Dataset	Ryanair	Easyjet	Trello	Asana
Size	5600	5216	9470	10,004
Date range	28 September 2024–4 October 2024	1 December 2024–18 December 2024	1 December 2024–15 January 2025	1 November 2024–10 January 2025
Language	English	English	English	English
AI Mention Clustering (% total document)	29 (20%)	57 (19%)	68 (18%)	75 (18%)
BERTopic clustering (% total document)	107 (51%)	125 (74%)	156 (60%)	159 (63%)

Table 2. Topic coherence performance.

Dataset	Ryanair	Easyjet	Trello	Asana
AI Mention Clustering	0.46	0.40	0.37	0.38
BERTopic	0.41	0.37	0.35	0.36

Table 3. Topic diversity performance.

Dataset	Ryanair	Easyjet	Trello	Asana
AI Mention Clustering
Unique keywords	78%	68%	67%	68%
Centroid distance	0.56	0.55	0.54	0.57
BERTopic
Unique keywords	52%	52%	48%	53%
Centroid distance	0.56	0.55	0.53	0.58

Table 4. Davies–Bouldin index scores.

Dataset	Ryanair	Easyjet	Trello	Asana
AI Mention Clustering	1.8460	1.7529	1.6616	1.6689
BERTopic	3.0871	3.1058	3.4994	3.4099

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kapantaidakis, I.; Perakakis, E.; Mastorakis, G.; Kopanakis, I. An Innovative Approach to Topic Clustering for Social Media and Web Data Using AI. Computers 2025, 14, 142. https://doi.org/10.3390/computers14040142

AMA Style

Kapantaidakis I, Perakakis E, Mastorakis G, Kopanakis I. An Innovative Approach to Topic Clustering for Social Media and Web Data Using AI. Computers. 2025; 14(4):142. https://doi.org/10.3390/computers14040142

Chicago/Turabian Style

Kapantaidakis, Ioannis, Emmanouil Perakakis, George Mastorakis, and Ioannis Kopanakis. 2025. "An Innovative Approach to Topic Clustering for Social Media and Web Data Using AI" Computers 14, no. 4: 142. https://doi.org/10.3390/computers14040142

APA Style

Kapantaidakis, I., Perakakis, E., Mastorakis, G., & Kopanakis, I. (2025). An Innovative Approach to Topic Clustering for Social Media and Web Data Using AI. Computers, 14(4), 142. https://doi.org/10.3390/computers14040142

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Innovative Approach to Topic Clustering for Social Media and Web Data Using AI

Abstract

1. Introduction

2. Literature Review

2.1. “Traditional” Topic Detection Algorithms

2.1.1. Bag-of-Words Based

2.1.2. Embedding-Based

2.2. Using Large Language Models (LLMs)

3. Proposed Solution

4. Evaluation

4.1. Qualitative Evaluation

4.2. Quantitative Evaluation

4.2.1. Topic Coherence

4.2.2. Topic Diversity

4.2.3. Davies–Bouldin Index

5. Discussion and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI