1. Introduction
Topic modeling is a model that explores the underlying topic structure behind a text. By analyzing lexical co-occurrence patterns in documents, topic models can identify and extract potential topic structures in a document set. This is crucial for understanding and organizing large-scale text data. In the era of information explosion, efficiently automating the processing and parsing of text content is crucial. In information retrieval [
1], topic models can improve the performance of search engines by identifying the underlying topics in user queries, thereby increasing the relevance of retrieval results. In recommender systems [
2], topic models can offer users personalized recommendations by analyzing their interests. In addition, topic models also play an important role in the fields of natural language processing [
3], social network analysis [
4], and bioinformatics [
5], assisting researchers in mining and understanding complex data patterns. Through these applications, topic modeling not only enhances the automation of text data processing but also significantly improves the efficiency of information acquisition and knowledge discovery, thereby holding great significance for both academic research and practical applications.
The Latent Dirichlet Allocation (LDA) model [
6] is a widely used statistical model for discovering hidden topics from a collection of documents [
7]. LDA achieves this by assuming that documents are mixtures of probability distributions over multiple topics, each characterized by a specific lexical distribution, and employs Bayesian inference for parameter estimation. Many LDA-based topic models followed, such as MetaLDA [
8] proposed by Zhao et al. MetaLDA is able to use document or word meta-information for topic modeling. Then, Bunk et al. proposed the WELDA [
9] model, which combines word embeddings with Latent Dirichlet Allocation to improve topic quality. However, despite its success in many applications, LDA has certain limitations. Its a priori assumptions, including the independence of documents from topics and the specific lexical distribution of topics, may not accurately reflect the actual data distribution. As a result, these assumptions can impact the accuracy and validity of the model.
With the rise of deep learning, neural topic models have become an important development in topic modeling research. Neural topic models improve the flexibility and performance of topic modeling by combining neural networks and probabilistic models. Among them, the variational autoencoder (VAE) [
10] is widely used in topic modeling. VAE ensures the generative power of the model by compressing the data into the latent space and then regenerating the data from the latent space while introducing stochasticity in the encoding process. This approach ameliorates the problem of a priori assumptions in topic models based on the Dirichlet distribution and enhances the coherence of topic models. Therefore, most of the recent topics are based on VAE architecture, Kumar et al. proposed a topic model that combines reinforcement learning and variational autoencoder [
11]. Later, Adhya et al. proposed a topic model, CTMKD [
12], that introduces knowledge distillation into the topic modeling process. However, despite the significant advances in topic modeling, the VAE topic model still suffers from a number of shortcomings. The training process of the VAE model involves mapping the data to a low-dimensional topic distribution space and reconstructing the original data from that space. The difficulty in fully decoupling this process and the problem of information loss limit the topic modeling capability of neural topic models.
In order to solve these problems, we propose the prompt topic model PTM, which applies prompt learning to the topic model and classifies the topics of the text with the semantic capture ability of the pre-trained model trained on a large amount of data, which is free from the potential Dirichlet allocation and the variational self-encoder architecture used in the original topic model, and overcomes the shortcomings of the traditional topic model in terms of the a priori assumptions as well as the model structure.
Furthermore, we introduce an unsupervised prompt word selection method that employs a pre-trained language model to derive document embedding vectors. These vectors are clustered by topic using a predetermined number of topics. Subsequently, a statistical method filters out a sufficient number of prompt candidates from each cluster, removing interfering words to finalize the prompt word set.
In summary, our principal contributions are as follows:
The proposal of the PTM, utilizing prompt learning to enhance topic modeling, effectively addresses the limitations of traditional models concerning a priori assumptions and structural constraints.
The development of an unsupervised, clustering-based prompt word selection method, independent of class name labels, facilitates the extraction of prompt words without label information, thereby supporting unsupervised topic modeling.
Extensive experimentation on three public datasets, with comparative and ablation studies demonstrating that the PTM surpasses traditional topic models in terms of topic consistency and validates the effectiveness of our prompt word selection technique.
The rest of this paper is structured as follows:
Section 2 reviews related work.
Section 3 describes our proposed method.
Section 4 provides a detailed analysis of the dataset, experimental setup, and results. Finally,
Section 5 concludes the paper, discusses the limitations of the model, and outlines future work.
2. Related Work
In recent years, topic modeling has emerged as a prominent research direction garnering significant attention in the domains of text analysis and machine learning. The objective of topic modeling is to automatically uncover the latent topic structure and semantic information embedded within extensive text data, offering substantial assistance in text comprehension and information retrieval. Throughout the progression of topic modeling, scholars have introduced numerous enhanced adaptations of LDA, including the Correlated Topic Model (CTM) [
13]. CTM, a topic model extension devised by Blei et al. in 2005, successfully addresses the limitation of LDA in capturing topic correlations. CTM assumes a relationship among topics within a document and employs a multivariate Gaussian distribution to depict the correlation between them. Consequently, CTM delivers enhanced precision in topic representation and facilitates improved topic interpretation compared to LDA. The Dynamic Topic Model (DTM) [
14], introduced by Blei et al. in 2006, is specifically designed for modeling time-series text data. DTM incorporates the notion that the topic distribution within a document evolves over time, employing temporal variables to capture such dynamics. DTM exhibits exceptional performance in analyzing texts with temporal sequences, including news, social media, and text streaming data. The Biterm Topic Model (BTM) [
15], introduced by Yan et al. in 2013, is a topic model specifically tailored for short texts characterized by incompleteness and noise. BTM focuses on modeling word pairs within the text rather than the conventional document-level modeling. This approach enables BTM to more precisely capture the topical information present in short texts, leading to notable achievements in tasks like short text classification, recommender systems, and social media analytics. In 2012, Holmes et al. introduced the Dirichlet Multinomial Mixture (DMM) [
16] as a method for inferring hidden topics in short texts. The DMM model relies on a relatively simple assumption where each text is sampled from a single potential topic, in contrast to the complex assumption of multiple topics within a document in the LDA model. This simplifying assumption renders the DMM model more suitable for processing short texts in a reasonable manner.
With the emergence of deep learning, neural-network-based topic models have garnered significant attention in research. One prominent model is the variational autoencoder (VAE), which integrates the concepts of autoencoder and variational inference. The VAE is capable of learning a continuous representation of text and conducting topic modeling through generative modeling. VAE topic models have demonstrated notable advancements in capturing text semantics and generating novel samples.
In 2016, Srivastava et al. introduced variational autoencoders (VAEs) to topic modeling and presented the ProdLDA model [
17]. Diverging from the conventional Latent Dirichlet Allocation model, the ProdLDA model adopts expert products instead of mixture models and leverages VAEs for variational inference. This enhancement leads to improved topic consistency and substantial performance gains in topic modeling tasks. In the same year, Miao et al. proposed another VAE-based topic model called the Neural Variational Document Model (NVDM) [
18]. NVDM incorporates continuous hidden variables and is optimized using stochastic gradient variational Bayes. By combining Neural Variational Reasoning with topic modeling, this model introduces fresh concepts and methods to the field of topic modeling.
Building upon the achievements of Generative Adversarial Nets (GANs) [
19], Wang et al. introduced the Adversarial Neural Topic Model (ANTM) in 2017 [
20]. ANTM adopts a GAN-like framework for topic modeling. Through an adversarial training strategy, the model aligns the distribution of possible topics with the prior distribution, thereby augmenting the model’s generalization capability and uncovering more cohesive topics.
In 2018, Tolstikhin et al. introduced Wasserstein Auto-Encoders (WAEs) [
21] as an enhancement to the conventional VAE. WAE employs Wasserstein distance as a metric in the latent space and is applied in topic modeling to mitigate issues like posterior crashes commonly encountered in VAE-based models. This approach enhances the quality and stability of topics.
In 2019, Gui et al. enhanced the Neural Topic Model (NTM) [
22] through a reinforcement learning framework to further elevate its performance. Their approach guides the topic modeling’s learning process by evaluating topic coherence during training and utilizing it as a reward signal. This methodology strengthens NTM’s capability to model topic coherence and enhances the model’s expressiveness and effectiveness.
In 2019, Dieng et al. introduced the Embedded Topic Model (ETM) in the embedding space [
23]. The fundamental concept of ETM is to incorporate word embedding techniques into topic modeling for the improved capture of semantic relationships among words. In contrast to traditional topic models that treat words as discrete symbols, ETM represents words as continuous embedding vectors. By utilizing the distance and similarity between word embedding vectors as indicators of semantic relationships between words, ETM achieves more precise learning of associations between topics and the semantic information conveyed by words.
In 2021, Bianchi et al. introduced the Contextualized Topic Model (CTM) [
24], a neural topic model that combines contextualized representations. CTM effectively enhances the meaning and coherence of the generated topics by connecting contextualized SBERT embeddings with the original BoW representation, forming a new textual representation input into the VAE framework. In 2022, Linha et al. introduced a graph-convolutional topic model that combines graph-convolutional neural networks with topic models [
25]. This model exhibited superior performance in processing short text streams with noise. The graph-convolutional topic model represents text data as graph structures by utilizing the capability of graph-convolutional neural networks to model and infer topics on graphs. This approach enhances the capture of semantic relationships and contextual information between texts, thereby improving the effectiveness and robustness of topic modeling. In the same year, Adhya et al. introduced the Contextualized Topic Model with Negative Sampling (CTM-Neg) [
26], an improved topic model based on the CTM. CTM-Neg introduces a negative sampling mechanism for contextualized topic models to enhance the quality of generated topics. During model training, it perturbs the generated document topic vectors and uses ternary loss to ensure that the model’s output is more similar to the original input when reconstructed with the correct document topic vectors, while the output from perturbed vectors is dissimilar to the original vectors. In 2024, Huang et al. presented a novel dependency-aware neural topic model [
27]. This model considers the dependencies between topics as a generative process, where each topic is generated based on its dependencies on other topics.
Moreover, topic models have found extensive applications in various domains. In the field of text mining, topic models are applied for tasks such as topic discovery and opinion analysis. In the domains of information retrieval and recommendation systems, topic models offer the potential for more accurate text matching and personalized recommendations. In the domain of social media analysis, topic models facilitate the understanding of user interests and topic evolution.
To provide a clear and concise overview of the advancements in topic modeling discussed in this paper, we have summarized the key contributions of each referenced work in
Table 1. This table includes the model, authors, the problem addressed by each study, and the solutions proposed. By presenting this information in a structured format, readers can easily grasp the evolution and improvements in the field of topic modeling.
In conclusion, topic modeling has garnered significant attention in the field of natural language processing and machine learning as a powerful tool for text analysis. It not only uncovers the topic structure within text data but also offers rich semantic information and valuable applications, providing crucial support for research and practical implementation in the field of text understanding and information processing.
3. Proposed Method
In this section, we introduce the method of incorporating prompt learning into the topic modeling task. Specifically, we begin by presenting the overall process of utilizing prompt learning for the text classification task. We then provide detailed explanations of the category-free name prompt extraction method and the prompt filtering method. Abbreviations and their meanings are shown in
Table 2.
This paper assumes a document collection consisting of documents, where . The objective of this paper is to discover k topics from these documents and assign each document to a corresponding topic label , forming the label collection . Subsequently, the words w under each topic label are ranked based on their importance, and the top m words in each topic, denoted as , are selected as the topic words for that particular topic.
Unlike traditional text classification tasks, we lack the given labeling information Y to aid the model in training or fine-tuning. In the topic modeling task, our only available resource is the document data, which we leverage to accomplish our objective. To enhance the performance of prompt learning in this task, we propose a fully unsupervised module for extracting label words, supplemented by a label word filtering session.
The steps for implementing the Prompt Topic Model (PTM) are detailed in Algorithm 1.
Algorithm 1. The steps for implementing the Prompt Topic Model |
Step 1: Extract Embedding Vectors |
Input: Document collection D consisting of n documents |
Use pre-trained language model (Sentence-BERT) to obtain document embeddings |
Embeddings = [] |
for document in D: |
tokens = tokenize(document) |
embeddings = SentenceBERT(tokens) |
Embeddings.append(embeddings) |
|
Step 2: Clustering |
Apply K-means clustering to group document embeddings into k clusters |
k = predefined_number_of_topics |
Clusters = KMeans(n_clusters=k).fit(Embeddings) |
|
Step 3: Label Word Selection |
For each cluster, select top words based on TF-IDF scores |
LabelWords = [] |
for cluster in Clusters: |
documents_in_cluster = get_documents(cluster) |
tfidf_scores = compute_tfidf(documents_in_cluster) |
top_words = select_top_words(tfidf_scores, top_n=10) |
LabelWords.append(top_words) |
|
Step 4: Construct Prompts |
Construct prompt templates for each document |
Prompts = [] |
for document in D: |
prompt = construct_prompt(document, template=“The topic of this text is about [MASK].”) |
Prompts.append(prompt) |
|
Step 5: Prompt Learning for Topic Classification |
Initialize pre-trained language model (BERT) |
Model = PreTrainedModel(“BERT-large”) |
|
Apply prompt learning to classify topics |
PredictedTopics = [] |
for prompt in Prompts: |
probabilities = Model.predict(prompt) |
topic = select_topic(probabilities, LabelWords) |
PredictedTopics.append(topic) |
|
Step 6: Evaluate Performance |
Compute evaluation metrics (NPMI, Cv) to assess topic consistency |
NPMI_scores = compute_NPMI(PredictedTopics, D) |
Cv_scores = compute_Cv(PredictedTopics, D) |
|
Output: Final topic classifications and evaluation scores |
print(“Final Topic Classifications:”, PredictedTopics) |
print(“NPMI Scores:”, NPMI_scores) |
print(“Cv Scores:”, Cv_scores) |
3.1. Model
Let M denote a pre-trained language model. When employing PLM for prompt learning in topic classification, the model transforms the classification task into a completion task.
Initially, this paper encapsulates the input sequence by employing a prompt template, which consists of a natural language text snippet. For instance, when it is required to encapsulate the document D = {I have a question about the space shuttle launch system. How does the solid rocket booster contribute to the overall performance of the launch? Can anyone provide some insight into its design and functionality? Thank you!}, the original document is enclosed within a D: The topic of this text is about [MASK]. during the categorization process. Here, [CLS] represents a special token utilized in classification tasks by models like BERT, while “[MASK]” serves as a placeholder that requires prediction by the model.
Subsequently, we employ a pre-trained language model M to calculate the probability of assigning the word v from the vocabulary list to the placeholder “[MASK]”. To establish a relationship between word probability and label probability, this paper introduces a prompt selection module F, which maps certain words from the vocabulary list to a set of labels Y. These words collectively form a set of prompts V. Here, represents the subset of prompts V mapped to a specific label y, where y belongs to the label set Y. In this case, .
Subsequently, this paper calculates the probability
of assigning the document
to label
y.
The process of prompt learning for topic categorization involves the utilization of a summation function, denoted as
, which aggregates the probabilities of the label words to determine the label probabilities.
Figure 1 illustrates the schematic of prompt learning for topic categorization.
In this paper, a predefined set of label word mappings is introduced, namely = {“Science”} and = {“Humanities”}. If the probability of “science” exceeds that of “humanities,” the document is categorized as “science” according to this paper’s approach. Consequently, through prompt tuning, this paper effectively maps the input text to the appropriate category labels by leveraging the pre-trained language model M and the summation function .
3.2. Label Word Selection Module
Currently, prompt learning tasks frequently rely on expanding class name labels as labeling prompts. This practice extensively exploits the explicit information conveyed by class names to guide the learning process of the model. However, this approach is not applicable to topic discovery tasks since the data processed in such tasks typically lack explicit class name information. To address this problem, this study introduces an innovative unsupervised method for selecting labeled words that facilitates prompt learning. The method utilizes the clustering results of textual representations to efficiently carry out the task of topic classification. This study’s main contribution lies in the design of an unsupervised label word selection method.
The method starts from a set of document collection
D and first extracts the embedding vectors of the documents using an advanced pre-trained language model, Sentence-BERT. These vectors not only capture the textual information but also encode the deeper semantics of the documents. The acquisition process is as follows: Let document
D be composed of a collection of tokens
. To input these tokens into the BERT model, it is necessary to add the special markers [CLS] and [SEP] at the beginning and the end of the sequence, respectively, resulting in [CLS],
, [SEP]. The sequence with the special tokens added is then encoded using the BERT model to obtain a contextualized representation of each token. This typically involves propagating the tokenized sequence through the BERT layer:
where H is a matrix in which each row represents the contextualized vector representation of the corresponding token. Subsequently, a pooling strategy is selected to extract the embedding of the entire document from the contextualized representations. Typically, the [CLS]-labeled output can be used, or the representations of all tokens can be equally pooled:
Here, the Pooling function could involve taking the [CLS] vector, mean pooling, or other pooling strategies that may be more appropriate for the data. In this study, mean pooling is selected as the pooling strategy. Ultimately, this study obtains a fixed-size vector , representing the embedding of the entire document, which can be used for subsequent text clustering tasks. Subsequently, the K-means clustering algorithm is applied to group the document vectors into multiple clusters, aiming to group documents with similar semantic features into the same category. We chose the K-means clustering algorithm for our PTM because of its simplicity, efficiency, and effectiveness in handling large datasets, making it particularly well-suited for topic modeling tasks that require scalability, ease of implementation, and accurate clustering of document embeddings. This organization facilitates effective comprehension and organization of the topics and contents of the textual data.
Once the clustering process is completed, the statistical features are computed for the documents within each cluster. In this paper, TF-IDF values are chosen as the statistical features, which constitute a standard tool for assessing word importance in information retrieval and text mining. TF-IDF analysis provides a quantitative depiction of word importance within a cluster, thus explicitly reflecting the topic and content of a document collection. Based on these statistical features, this paper selects the top 10 words ranked by TF-IDF from each clustering cluster as the candidate set of label words. These words are chosen based on their high-frequency occurrences within their respective clusters and their strong association with specific topics, accurately reflecting the topic characteristics of the corresponding clusters. Subsequently, we filter out duplicate and irrelevant words across different topics from the finalized candidate set of label words to enhance the classification accuracy of the model. The label words obtained are shown in
Table 3. With this approach, the unsupervised label word selection method presented in this paper offers an innovative solution for topic discovery tasks that lack class name information. The method effectively extracts key information for topic categorization without manual annotation, providing a new perspective for addressing challenging topic discovery tasks.
3.3. Prompt Learning Topic Classfiction Module
The process of using prompt learning for text classification tasks includes selecting a pre-trained model, constructing a prompt template, and applying prompt learning for text classification.
First, we need to select an appropriate pre-training model. In this paper, we use BERT-large as the pre-trained language model to complete the task. These pre-trained models are deep neural networks trained on large-scale text data, capable of capturing the semantic and contextual information of the text.
Next, we construct prompt templates, which are specific forms of text that contain information about the task and category. Designing a prompt template helps the model to better understand and categorize the text.
Finally, we apply prompt learning for text classification. In this phase, we combine the pre-trained model with the constructed prompt templates to form a text classification model. The model accepts the input text and the corresponding prompts, generating a representation of the text by jointly encoding the prompts with the input text. This joint encoding process is typically implemented using the weights of the pre-trained model. Finally, the generated text representation is input to the classification layer for prediction. The specific implementation of the model has been described in
Section 3.1 and will not be reiterated here.
4. Detailed Analysis of the Dataset, Experimental Setup, and Results
In this section, we present the dataset used for the experiments, specify the experimental parameters, and perform ablation experiments on our proposed PTM to demonstrate its effectiveness. Furthermore, we compare the PTM with various topic modeling approaches to experimentally demonstrate its effectiveness.
4.1. Datasets
We used three datasets with different text lengths in our experiments: 20Newsgroups, M10, and GN. We followed OCTIS [
28] to preprocess these raw datasets.
The 20Newsgroups dataset is a standard dataset commonly used in natural language processing and text mining, especially in document classification and topic modeling tasks. It consists of 16,309 newsgroup documents distributed across 20 different newsgroups, each corresponding to a specific topic. These documents were collected from USENET newsgroups spanning from March 1995 to September 1995. These newsgroups cover a wide range of topics, including politics, religion, technology, and sports. M10 is a subset of the CiteSeerX data and consists of 8355 scientific publications from 10 different fields of study. The GoogleNews (GN) dataset includes 11,109 news articles, headlines, and snippets collected from the Google News website in November 2013.
4.2. Experimental Setup
We use Sentence-BERT as a model for text semantic representation, resulting in a document embedding of 768 dimensions. In the LDA topic model, we set the hyperparameter α to 0.1. In the NVDM topic model and CTM-Neg topic model, we set the number of neurons in the two hidden layers to 100. In this paper, we set the number of topics in the clustering process for the 20Newsgroups dataset, the M10 dataset, and the GN dataset to K = 10, K = 20, and K = 10, respectively.
The experimental environment is shown in
Table 4.
4.3. Compared with Other Methods
4.3.1. Evaluation Indicators
For unsupervised topic models, topic consistency is typically evaluated using the following two metrics: NPMI (Normalized Pointwise Mutual Information) and (Coefficient of Variance). NPMI measures the correlation between word pairs. It is calculated by comparing the ratio of the probability of word pairs co-occurring to the probability of each occurring independently and normalizing this value to the interval [−1, 1]. Word pairs with higher NPMI indicate frequent co-occurrence in the corpus, suggesting meaningful relationships within the topic. is calculated based on NPMI and cosine similarity and has been shown to be the closest metric to manual evaluation results.
For two words
and
, PMI and NPMI are calculated as follows.
where
is the probability of both words occurring simultaneously, and
and
are the probabilities of their respective occurrences.
In topic modeling, NPMI is often used to measure the consistency of word pairs within a topic. In practice, computing the NPMI of a topic usually involves the following steps.
Select the highest ranked words in the topic.
For each pair of these words, calculate their NPMI value.
The average NPMI value of all word pairs is calculated as the consistency score for the topic.
The calculation of the
value is then based on NPMI, and in this paper, the word vector of
is defined as follows:
Calculating the value for a topic usually involves the following steps:
Select the highest ranked words in the topic.
For each p of these words, calculate the value of the cosine similarity of their word vectors.
Calculate the average of the cosine similarity of all word pairs as the value for the topic.
4.3.2. Baselines
We selected the Latent Dirichlet Allocation (LDA), Neural Variational Document Model (NVDM), Contextualized Topic Model with Negative Sampling (CTM), and Contextualized Topic Model with Negative Sampling (CTM-Neg) as benchmark models for comparison due to their prominence and widespread use in the field. LDA is a representative traditional topic model, NVDM exemplifies topic models based on the VAE architecture, CTM is a high-performing VAE-based model, and CTM-Neg is an improved version of CTM, which is recognized as one of the best-performing models in recent years within this architecture.
LDA: LDA is a classical probabilistic graphical model that describes word occurrence patterns in a document through the distribution of latent topics. LDA assumes that a document is generated from an underlying set of topics, each defined by a probability distribution over a set of words.
NVDM: NVDM is a generative model based on neural networks and variational inference for learning document representations in a continuous latent space. It combines the advantages of deep learning and probabilistic graphical models to capture the semantic structure of documents in an end-to-end manner.
CTM: CTM is a neural topic model that combines contextualized representations, effectively enhancing the meaning and coherence of the generated topics by connecting contextualized SBERT embeddings with the original BoW representation as a new textual representation input into the VAE framework.
CTM-Neg: CTM-Neg is an improved topic model based on the CTM, introducing a negative sampling mechanism for contextualized topic models to enhance the quality of generated topics. During model training, it perturbs the generated document topic vectors and uses ternary loss to ensure that the model’s output is more similar to the original input when reconstructed with the correct document topic vectors, while the output from perturbed vectors is dissimilar to the original vectors.
4.3.3. Experimental Results and Analysis
Table 5,
Table 6 and
Table 7 present the experimental results of our proposed PTM topic models and each baseline model on the 20Newsgroups dataset, the M10 dataset, and the GN dataset, respectively.
Figure 2 and
Figure 3 visualize the performance of each model on different datasets using bar charts.
This paper rigorously compares the PTM to other mainstream topic modeling approaches, with
Table 5,
Table 6 and
Table 7 illustrating the results of this comparison on the NPMI and
metrics. The results indicate that the PTM excels on both metrics, achieving the highest scores on NPMI and
relative to other models such as LDA, NVDM, and CTM. PTM’s significant improvement in NPMI over classic LDA is attributed to its superior semantic capture capability and effective modeling of the topic discovery task. NVDM, a deep-learning-based topic modeling approach, inherently excels at capturing deep semantic information from text. However, PTM surpasses NVDM on NPMI and
due to a more efficient prompt-learning approach. CTM-Neg employs a negative sampling mechanism that enhances its modeling of the topic discovery task. However, its performance remains constrained by the VAE model’s shortcomings in practical applications, yielding inferior results compared to PTM.
Further analysis reveals that PTM’s excellent performance in topic discovery tasks is primarily attributable to its ability to directly utilize pre-trained language models. Pre-trained models, typically trained on extensive datasets, can capture extensive contextual information, facilitating understanding of complex topics and fine-grained categorization. In contrast, traditional and neural topic models are constrained by their inadequate structural frameworks, which hinder further performance enhancements.
In summary, the PTM utilizes a prompt learning approach with a pre-trained language model for the topic discovery task, yielding excellent results in both NPMI and , crucial evaluation metrics for topic models. These results not only confirm the superiority of PTM in topic modeling but also suggest a new direction for subsequent research in this field.
4.4. Ablation Experiment
To assess the contribution of the label word filtering module of the PTM to its performance, this study performed ablation experiments on three datasets. The results of the ablation experiments are detailed in
Table 8, which involved removing specific components of the PTM while maintaining the integrity of the others, to ascertain the significance of each component. The specific setup for the ablation experiment is as follows:
w/o Prompt_Selection: upon removing the Prompt_Selection module, the model ceases to filter out non-essential words post prompt selection, instead utilizing the set of prompts selected based on statistical features following clustering.
The ablation experiments demonstrated the significant impact of the label word filtering module on the overall performance of the PTM. As shown in
Table 8, the removal of the Prompt_Selection module resulted in notable decreases in both NPMI and
scores across all three datasets. Specifically, for the 20NewsGroups dataset, the NPMI score dropped from 0.136 to 0.092, and the
score decreased from 0.658 to 0.564. Similar trends were observed for the M10 and GN datasets, where the exclusion of the Prompt_Selection module led to reductions in both metrics, indicating the module’s critical role in enhancing model performance.
Through ablation experiments, this study gained insights into the impact of the label word filtering module on overall model performance. The label word screening module plays a crucial role in enhancing model performance. Label words initially screened out after clustering may contain terms with unknown meanings or low relevance to the target task, potentially interfering with the model’s judgment and leading to erroneous inferences. The label word screening module reduces noise input by manually filtering out terms of unknown meaning, thereby enhancing the model’s recognition accuracy. In the ablation experiments, removing the label word screening module resulted in decreased model performance, indicating that enhancing the quality of input features is critical to model improvement.
Furthermore, the label word filtering module not only enhances model performance but also deepens our understanding of the model’s decision-making process. Analyzing the filtered-out label words enables us to better understand the model’s sensitivity and preferences for specific tasks, crucial for enhancing model interpretability, optimization, and tuning.
4.5. Runtime of PTM
To evaluate the efficiency of our Prompt Topic Model (PTM), we measured its runtime on three different datasets: 20Newsgroups, M10, and GoogleNews. Runtime is a critical factor in assessing the practical applicability of our model, particularly when dealing with large-scale text data.
Table 9 summarizes the runtime of PTM on these dataset.