A Topic Modeling Based on Prompt Learning

Qiu, Mingjie; Yang, Wenzhong; Wei, Fuyuan; Chen, Mingliang

doi:10.3390/electronics13163212

Open AccessArticle

A Topic Modeling Based on Prompt Learning

¹

School of Software, Xinjiang University, Urumqi 830091, China

²

Xinjiang Key Laboratory of Multilingual Information Technology, Xinjiang University, Urumqi 830017, China

³

School of Information Science and Engineering, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3212; https://doi.org/10.3390/electronics13163212

Submission received: 21 July 2024 / Revised: 7 August 2024 / Accepted: 12 August 2024 / Published: 14 August 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Most of the existing topic models are based on the Latent Dirichlet Allocation (LDA) or the variational autoencoder (VAE), but these methods have inherent flaws. The a priori assumptions of LDA on documents may not match the actual distribution of the data, and VAE suffers from information loss during the mapping and reconstruction process, which tends to affect the effectiveness of topic modeling. To this end, we propose a Prompt Topic Model (PTM) utilizing prompt learning for topic modeling, which circumvents the structural limitations of LDA and VAE, thereby overcoming the deficiencies of traditional topic models. Additionally, we develop a prompt word selection method that enhances PTM’s efficiency in performing the topic modeling task. Experimental results demonstrate that the PTM surpasses traditional topic models on three public datasets. Ablation experiments further validate that our proposed prompt word selection method enhances the PTM’s effectiveness in topic modeling.

Keywords:

topic modeling; prompt learning; prompt word

1. Introduction

Topic modeling is a model that explores the underlying topic structure behind a text. By analyzing lexical co-occurrence patterns in documents, topic models can identify and extract potential topic structures in a document set. This is crucial for understanding and organizing large-scale text data. In the era of information explosion, efficiently automating the processing and parsing of text content is crucial. In information retrieval [1], topic models can improve the performance of search engines by identifying the underlying topics in user queries, thereby increasing the relevance of retrieval results. In recommender systems [2], topic models can offer users personalized recommendations by analyzing their interests. In addition, topic models also play an important role in the fields of natural language processing [3], social network analysis [4], and bioinformatics [5], assisting researchers in mining and understanding complex data patterns. Through these applications, topic modeling not only enhances the automation of text data processing but also significantly improves the efficiency of information acquisition and knowledge discovery, thereby holding great significance for both academic research and practical applications.

The Latent Dirichlet Allocation (LDA) model [6] is a widely used statistical model for discovering hidden topics from a collection of documents [7]. LDA achieves this by assuming that documents are mixtures of probability distributions over multiple topics, each characterized by a specific lexical distribution, and employs Bayesian inference for parameter estimation. Many LDA-based topic models followed, such as MetaLDA [8] proposed by Zhao et al. MetaLDA is able to use document or word meta-information for topic modeling. Then, Bunk et al. proposed the WELDA [9] model, which combines word embeddings with Latent Dirichlet Allocation to improve topic quality. However, despite its success in many applications, LDA has certain limitations. Its a priori assumptions, including the independence of documents from topics and the specific lexical distribution of topics, may not accurately reflect the actual data distribution. As a result, these assumptions can impact the accuracy and validity of the model.

With the rise of deep learning, neural topic models have become an important development in topic modeling research. Neural topic models improve the flexibility and performance of topic modeling by combining neural networks and probabilistic models. Among them, the variational autoencoder (VAE) [10] is widely used in topic modeling. VAE ensures the generative power of the model by compressing the data into the latent space and then regenerating the data from the latent space while introducing stochasticity in the encoding process. This approach ameliorates the problem of a priori assumptions in topic models based on the Dirichlet distribution and enhances the coherence of topic models. Therefore, most of the recent topics are based on VAE architecture, Kumar et al. proposed a topic model that combines reinforcement learning and variational autoencoder [11]. Later, Adhya et al. proposed a topic model, CTMKD [12], that introduces knowledge distillation into the topic modeling process. However, despite the significant advances in topic modeling, the VAE topic model still suffers from a number of shortcomings. The training process of the VAE model involves mapping the data to a low-dimensional topic distribution space and reconstructing the original data from that space. The difficulty in fully decoupling this process and the problem of information loss limit the topic modeling capability of neural topic models.

In order to solve these problems, we propose the prompt topic model PTM, which applies prompt learning to the topic model and classifies the topics of the text with the semantic capture ability of the pre-trained model trained on a large amount of data, which is free from the potential Dirichlet allocation and the variational self-encoder architecture used in the original topic model, and overcomes the shortcomings of the traditional topic model in terms of the a priori assumptions as well as the model structure.

Furthermore, we introduce an unsupervised prompt word selection method that employs a pre-trained language model to derive document embedding vectors. These vectors are clustered by topic using a predetermined number of topics. Subsequently, a statistical method filters out a sufficient number of prompt candidates from each cluster, removing interfering words to finalize the prompt word set.

In summary, our principal contributions are as follows:

The proposal of the PTM, utilizing prompt learning to enhance topic modeling, effectively addresses the limitations of traditional models concerning a priori assumptions and structural constraints.
The development of an unsupervised, clustering-based prompt word selection method, independent of class name labels, facilitates the extraction of prompt words without label information, thereby supporting unsupervised topic modeling.
Extensive experimentation on three public datasets, with comparative and ablation studies demonstrating that the PTM surpasses traditional topic models in terms of topic consistency and validates the effectiveness of our prompt word selection technique.

The rest of this paper is structured as follows: Section 2 reviews related work. Section 3 describes our proposed method. Section 4 provides a detailed analysis of the dataset, experimental setup, and results. Finally, Section 5 concludes the paper, discusses the limitations of the model, and outlines future work.

2. Related Work

In recent years, topic modeling has emerged as a prominent research direction garnering significant attention in the domains of text analysis and machine learning. The objective of topic modeling is to automatically uncover the latent topic structure and semantic information embedded within extensive text data, offering substantial assistance in text comprehension and information retrieval. Throughout the progression of topic modeling, scholars have introduced numerous enhanced adaptations of LDA, including the Correlated Topic Model (CTM) [13]. CTM, a topic model extension devised by Blei et al. in 2005, successfully addresses the limitation of LDA in capturing topic correlations. CTM assumes a relationship among topics within a document and employs a multivariate Gaussian distribution to depict the correlation between them. Consequently, CTM delivers enhanced precision in topic representation and facilitates improved topic interpretation compared to LDA. The Dynamic Topic Model (DTM) [14], introduced by Blei et al. in 2006, is specifically designed for modeling time-series text data. DTM incorporates the notion that the topic distribution within a document evolves over time, employing temporal variables to capture such dynamics. DTM exhibits exceptional performance in analyzing texts with temporal sequences, including news, social media, and text streaming data. The Biterm Topic Model (BTM) [15], introduced by Yan et al. in 2013, is a topic model specifically tailored for short texts characterized by incompleteness and noise. BTM focuses on modeling word pairs within the text rather than the conventional document-level modeling. This approach enables BTM to more precisely capture the topical information present in short texts, leading to notable achievements in tasks like short text classification, recommender systems, and social media analytics. In 2012, Holmes et al. introduced the Dirichlet Multinomial Mixture (DMM) [16] as a method for inferring hidden topics in short texts. The DMM model relies on a relatively simple assumption where each text is sampled from a single potential topic, in contrast to the complex assumption of multiple topics within a document in the LDA model. This simplifying assumption renders the DMM model more suitable for processing short texts in a reasonable manner.

With the emergence of deep learning, neural-network-based topic models have garnered significant attention in research. One prominent model is the variational autoencoder (VAE), which integrates the concepts of autoencoder and variational inference. The VAE is capable of learning a continuous representation of text and conducting topic modeling through generative modeling. VAE topic models have demonstrated notable advancements in capturing text semantics and generating novel samples.

In 2016, Srivastava et al. introduced variational autoencoders (VAEs) to topic modeling and presented the ProdLDA model [17]. Diverging from the conventional Latent Dirichlet Allocation model, the ProdLDA model adopts expert products instead of mixture models and leverages VAEs for variational inference. This enhancement leads to improved topic consistency and substantial performance gains in topic modeling tasks. In the same year, Miao et al. proposed another VAE-based topic model called the Neural Variational Document Model (NVDM) [18]. NVDM incorporates continuous hidden variables and is optimized using stochastic gradient variational Bayes. By combining Neural Variational Reasoning with topic modeling, this model introduces fresh concepts and methods to the field of topic modeling.

Building upon the achievements of Generative Adversarial Nets (GANs) [19], Wang et al. introduced the Adversarial Neural Topic Model (ANTM) in 2017 [20]. ANTM adopts a GAN-like framework for topic modeling. Through an adversarial training strategy, the model aligns the distribution of possible topics with the prior distribution, thereby augmenting the model’s generalization capability and uncovering more cohesive topics.

In 2018, Tolstikhin et al. introduced Wasserstein Auto-Encoders (WAEs) [21] as an enhancement to the conventional VAE. WAE employs Wasserstein distance as a metric in the latent space and is applied in topic modeling to mitigate issues like posterior crashes commonly encountered in VAE-based models. This approach enhances the quality and stability of topics.

In 2019, Gui et al. enhanced the Neural Topic Model (NTM) [22] through a reinforcement learning framework to further elevate its performance. Their approach guides the topic modeling’s learning process by evaluating topic coherence during training and utilizing it as a reward signal. This methodology strengthens NTM’s capability to model topic coherence and enhances the model’s expressiveness and effectiveness.

In 2019, Dieng et al. introduced the Embedded Topic Model (ETM) in the embedding space [23]. The fundamental concept of ETM is to incorporate word embedding techniques into topic modeling for the improved capture of semantic relationships among words. In contrast to traditional topic models that treat words as discrete symbols, ETM represents words as continuous embedding vectors. By utilizing the distance and similarity between word embedding vectors as indicators of semantic relationships between words, ETM achieves more precise learning of associations between topics and the semantic information conveyed by words.

In 2021, Bianchi et al. introduced the Contextualized Topic Model (CTM) [24], a neural topic model that combines contextualized representations. CTM effectively enhances the meaning and coherence of the generated topics by connecting contextualized SBERT embeddings with the original BoW representation, forming a new textual representation input into the VAE framework. In 2022, Linha et al. introduced a graph-convolutional topic model that combines graph-convolutional neural networks with topic models [25]. This model exhibited superior performance in processing short text streams with noise. The graph-convolutional topic model represents text data as graph structures by utilizing the capability of graph-convolutional neural networks to model and infer topics on graphs. This approach enhances the capture of semantic relationships and contextual information between texts, thereby improving the effectiveness and robustness of topic modeling. In the same year, Adhya et al. introduced the Contextualized Topic Model with Negative Sampling (CTM-Neg) [26], an improved topic model based on the CTM. CTM-Neg introduces a negative sampling mechanism for contextualized topic models to enhance the quality of generated topics. During model training, it perturbs the generated document topic vectors and uses ternary loss to ensure that the model’s output is more similar to the original input when reconstructed with the correct document topic vectors, while the output from perturbed vectors is dissimilar to the original vectors. In 2024, Huang et al. presented a novel dependency-aware neural topic model [27]. This model considers the dependencies between topics as a generative process, where each topic is generated based on its dependencies on other topics.

Moreover, topic models have found extensive applications in various domains. In the field of text mining, topic models are applied for tasks such as topic discovery and opinion analysis. In the domains of information retrieval and recommendation systems, topic models offer the potential for more accurate text matching and personalized recommendations. In the domain of social media analysis, topic models facilitate the understanding of user interests and topic evolution.

To provide a clear and concise overview of the advancements in topic modeling discussed in this paper, we have summarized the key contributions of each referenced work in Table 1. This table includes the model, authors, the problem addressed by each study, and the solutions proposed. By presenting this information in a structured format, readers can easily grasp the evolution and improvements in the field of topic modeling.

In conclusion, topic modeling has garnered significant attention in the field of natural language processing and machine learning as a powerful tool for text analysis. It not only uncovers the topic structure within text data but also offers rich semantic information and valuable applications, providing crucial support for research and practical implementation in the field of text understanding and information processing.

3. Proposed Method

In this section, we introduce the method of incorporating prompt learning into the topic modeling task. Specifically, we begin by presenting the overall process of utilizing prompt learning for the text classification task. We then provide detailed explanations of the category-free name prompt extraction method and the prompt filtering method. Abbreviations and their meanings are shown in Table 2.

This paper assumes a document collection

D

consisting of

n

documents, where

D = d_{1}, d_{2}, \dots, d_{n}

. The objective of this paper is to discover k topics from these documents and assign each document

d_{i}

to a corresponding topic label

y_{i}

, forming the label collection

Y = {{y}_{1}, y_{2}, \dots, y_{k}}

. Subsequently, the words w under each topic label

y_{i}

are ranked based on their importance, and the top m words in each topic, denoted as

W = {{w}_{1}, w_{2}, \dots, w_{m}}

, are selected as the topic words for that particular topic.

Unlike traditional text classification tasks, we lack the given labeling information Y to aid the model in training or fine-tuning. In the topic modeling task, our only available resource is the document data, which we leverage to accomplish our objective. To enhance the performance of prompt learning in this task, we propose a fully unsupervised module for extracting label words, supplemented by a label word filtering session.

The steps for implementing the Prompt Topic Model (PTM) are detailed in Algorithm 1.

Algorithm 1. The steps for implementing the Prompt Topic Model

Step 1: Extract Embedding Vectors

Input: Document collection D consisting of n documents

Use pre-trained language model (Sentence-BERT) to obtain document embeddings

Embeddings = []

for document in D:

tokens = tokenize(document)

embeddings = SentenceBERT(tokens)

Embeddings.append(embeddings)

Step 2: Clustering

Apply K-means clustering to group document embeddings into k clusters

k = predefined_number_of_topics

Clusters = KMeans(n_clusters=k).fit(Embeddings)

Step 3: Label Word Selection

For each cluster, select top words based on TF-IDF scores

LabelWords = []

for cluster in Clusters:

documents_in_cluster = get_documents(cluster)

tfidf_scores = compute_tfidf(documents_in_cluster)

top_words = select_top_words(tfidf_scores, top_n=10)

LabelWords.append(top_words)

Step 4: Construct Prompts

Construct prompt templates for each document

Prompts = []

for document in D:

prompt = construct_prompt(document, template=“The topic of this text is about [MASK].”)

Prompts.append(prompt)

Step 5: Prompt Learning for Topic Classification

Initialize pre-trained language model (BERT)

Model = PreTrainedModel(“BERT-large”)

Apply prompt learning to classify topics

PredictedTopics = []

for prompt in Prompts:

probabilities = Model.predict(prompt)

topic = select_topic(probabilities, LabelWords)

PredictedTopics.append(topic)

Step 6: Evaluate Performance

Compute evaluation metrics (NPMI, Cv) to assess topic consistency

NPMI_scores = compute_NPMI(PredictedTopics, D)

Cv_scores = compute_Cv(PredictedTopics, D)

Output: Final topic classifications and evaluation scores

print(“Final Topic Classifications:”, PredictedTopics)

print(“NPMI Scores:”, NPMI_scores)

print(“Cv Scores:”, Cv_scores)

3.1. Model

Let M denote a pre-trained language model. When employing PLM for prompt learning in topic classification, the model transforms the classification task into a completion task.

Initially, this paper encapsulates the input sequence by employing a prompt template, which consists of a natural language text snippet. For instance, when it is required to encapsulate the document D = {I have a question about the space shuttle launch system. How does the solid rocket booster contribute to the overall performance of the launch? Can anyone provide some insight into its design and functionality? Thank you!}, the original document is enclosed within a

D_{p} = [C L S]

D: The topic of this text is about [MASK]. during the categorization process. Here, [CLS] represents a special token utilized in classification tasks by models like BERT, while “[MASK]” serves as a placeholder that requires prediction by the model.

Subsequently, we employ a pre-trained language model M to calculate the probability

P_{M} ([M A S K] = v ∣ D_{p})

of assigning the word v from the vocabulary list to the placeholder “[MASK]”. To establish a relationship between word probability and label probability, this paper introduces a prompt selection module F, which maps certain words from the vocabulary list to a set of labels Y. These words collectively form a set of prompts V. Here,

V_{y}

represents the subset of prompts V mapped to a specific label y, where y belongs to the label set Y. In this case,

\cup_{y \in Y} V_{y} = V

.

Subsequently, this paper calculates the probability

P (y ∣ D_{p})

of assigning the document

D_{p}

to label y.

P (y ∣ D_{p}) = S U M (P_{M} ([M A S K] = v ∣ D_{p}) ∣ v \in V_{y})

(1)

The process of prompt learning for topic categorization involves the utilization of a summation function, denoted as

S U M

, which aggregates the probabilities of the label words to determine the label probabilities. Figure 1 illustrates the schematic of prompt learning for topic categorization.

In this paper, a predefined set of label word mappings is introduced, namely

V_{1}

= {“Science”} and

V_{2}

= {“Humanities”}. If the probability of “science” exceeds that of “humanities,” the document is categorized as “science” according to this paper’s approach. Consequently, through prompt tuning, this paper effectively maps the input text to the appropriate category labels by leveraging the pre-trained language model M and the summation function

S U M

.

3.2. Label Word Selection Module

Currently, prompt learning tasks frequently rely on expanding class name labels as labeling prompts. This practice extensively exploits the explicit information conveyed by class names to guide the learning process of the model. However, this approach is not applicable to topic discovery tasks since the data processed in such tasks typically lack explicit class name information. To address this problem, this study introduces an innovative unsupervised method for selecting labeled words that facilitates prompt learning. The method utilizes the clustering results of textual representations to efficiently carry out the task of topic classification. This study’s main contribution lies in the design of an unsupervised label word selection method.

The method starts from a set of document collection D and first extracts the embedding vectors of the documents using an advanced pre-trained language model, Sentence-BERT. These vectors not only capture the textual information but also encode the deeper semantics of the documents. The acquisition process is as follows: Let document D be composed of a collection of tokens

\{w_{1}, w_{2}, \dots, w_{m}\}

. To input these tokens into the BERT model, it is necessary to add the special markers [CLS] and [SEP] at the beginning and the end of the sequence, respectively, resulting in [CLS],

w_{1}, w_{2}, \dots, w_{m}

, [SEP]. The sequence with the special tokens added is then encoded using the BERT model to obtain a contextualized representation of each token. This typically involves propagating the tokenized sequence through the BERT layer:

H = B e r t ([C L S], w_{1}, w_{2}, \dots, w_{m}, [S E P])

(2)

where H is a matrix in which each row represents the contextualized vector representation of the corresponding token. Subsequently, a pooling strategy is selected to extract the embedding of the entire document from the contextualized representations. Typically, the [CLS]-labeled output can be used, or the representations of all tokens can be equally pooled:

e_{D} = Pooling (H)

(3)

Here, the Pooling function could involve taking the [CLS] vector, mean pooling, or other pooling strategies that may be more appropriate for the data. In this study, mean pooling is selected as the pooling strategy. Ultimately, this study obtains a fixed-size vector

e_{D}

, representing the embedding of the entire document, which can be used for subsequent text clustering tasks. Subsequently, the K-means clustering algorithm is applied to group the document vectors into multiple clusters, aiming to group documents with similar semantic features into the same category. We chose the K-means clustering algorithm for our PTM because of its simplicity, efficiency, and effectiveness in handling large datasets, making it particularly well-suited for topic modeling tasks that require scalability, ease of implementation, and accurate clustering of document embeddings. This organization facilitates effective comprehension and organization of the topics and contents of the textual data.

Once the clustering process is completed, the statistical features are computed for the documents within each cluster. In this paper, TF-IDF values are chosen as the statistical features, which constitute a standard tool for assessing word importance in information retrieval and text mining. TF-IDF analysis provides a quantitative depiction of word importance within a cluster, thus explicitly reflecting the topic and content of a document collection. Based on these statistical features, this paper selects the top 10 words ranked by TF-IDF from each clustering cluster as the candidate set of label words. These words are chosen based on their high-frequency occurrences within their respective clusters and their strong association with specific topics, accurately reflecting the topic characteristics of the corresponding clusters. Subsequently, we filter out duplicate and irrelevant words across different topics from the finalized candidate set of label words to enhance the classification accuracy of the model. The label words obtained are shown in Table 3. With this approach, the unsupervised label word selection method presented in this paper offers an innovative solution for topic discovery tasks that lack class name information. The method effectively extracts key information for topic categorization without manual annotation, providing a new perspective for addressing challenging topic discovery tasks.

3.3. Prompt Learning Topic Classfiction Module

The process of using prompt learning for text classification tasks includes selecting a pre-trained model, constructing a prompt template, and applying prompt learning for text classification.

First, we need to select an appropriate pre-training model. In this paper, we use BERT-large as the pre-trained language model to complete the task. These pre-trained models are deep neural networks trained on large-scale text data, capable of capturing the semantic and contextual information of the text.

Next, we construct prompt templates, which are specific forms of text that contain information about the task and category. Designing a prompt template helps the model to better understand and categorize the text.

Finally, we apply prompt learning for text classification. In this phase, we combine the pre-trained model with the constructed prompt templates to form a text classification model. The model accepts the input text and the corresponding prompts, generating a representation of the text by jointly encoding the prompts with the input text. This joint encoding process is typically implemented using the weights of the pre-trained model. Finally, the generated text representation is input to the classification layer for prediction. The specific implementation of the model has been described in Section 3.1 and will not be reiterated here.

4. Detailed Analysis of the Dataset, Experimental Setup, and Results

In this section, we present the dataset used for the experiments, specify the experimental parameters, and perform ablation experiments on our proposed PTM to demonstrate its effectiveness. Furthermore, we compare the PTM with various topic modeling approaches to experimentally demonstrate its effectiveness.

4.1. Datasets

We used three datasets with different text lengths in our experiments: 20Newsgroups, M10, and GN. We followed OCTIS [28] to preprocess these raw datasets.

The 20Newsgroups dataset is a standard dataset commonly used in natural language processing and text mining, especially in document classification and topic modeling tasks. It consists of 16,309 newsgroup documents distributed across 20 different newsgroups, each corresponding to a specific topic. These documents were collected from USENET newsgroups spanning from March 1995 to September 1995. These newsgroups cover a wide range of topics, including politics, religion, technology, and sports. M10 is a subset of the CiteSeerX data and consists of 8355 scientific publications from 10 different fields of study. The GoogleNews (GN) dataset includes 11,109 news articles, headlines, and snippets collected from the Google News website in November 2013.

4.2. Experimental Setup

We use Sentence-BERT as a model for text semantic representation, resulting in a document embedding of 768 dimensions. In the LDA topic model, we set the hyperparameter α to 0.1. In the NVDM topic model and CTM-Neg topic model, we set the number of neurons in the two hidden layers to 100. In this paper, we set the number of topics in the clustering process for the 20Newsgroups dataset, the M10 dataset, and the GN dataset to K = 10, K = 20, and K = 10, respectively.

The experimental environment is shown in Table 4.

4.3. Compared with Other Methods

4.3.1. Evaluation Indicators

For unsupervised topic models, topic consistency is typically evaluated using the following two metrics: NPMI (Normalized Pointwise Mutual Information) and

C_{v}

(Coefficient of Variance). NPMI measures the correlation between word pairs. It is calculated by comparing the ratio of the probability of word pairs co-occurring to the probability of each occurring independently and normalizing this value to the interval [−1, 1]. Word pairs with higher NPMI indicate frequent co-occurrence in the corpus, suggesting meaningful relationships within the topic.

C_{v}

is calculated based on NPMI and cosine similarity and has been shown to be the closest metric to manual evaluation results.

For two words

w_{i}

and

w_{j}

, PMI and NPMI are calculated as follows.

PMI (w_{i}, w_{j}) = \log \frac{P (w_{i}, w_{j})}{P (w_{i}) \cdot P (w_{j})}

(4)

NPMI (w_{i}, w_{j}) = \frac{PMI (w_{i}, w_{j})}{- \log (P (w_{i}, w_{j}))}

(5)

where

P (w_{i}, w_{j})

is the probability of both words occurring simultaneously, and

P (w_{i})

and

P (w_{j})

are the probabilities of their respective occurrences.

In topic modeling, NPMI is often used to measure the consistency of word pairs within a topic. In practice, computing the NPMI of a topic usually involves the following steps.

Select the $N$ highest ranked words in the topic.
For each pair $(w_{i}, w_{j})$ of these $N$ words, calculate their NPMI value.
The average NPMI value of all word pairs is calculated as the consistency score for the topic.

The calculation of the

C_{v}

value is then based on NPMI, and in this paper, the word vector of

w_{i}

is defined as follows:

\vec{w_{i}} = (N P M I (\vec{w_{i}}, \vec{w_{1}}), N P M I (\vec{w_{i}}, \vec{w_{2}}), \dots, N P M I (\vec{w_{i}}, \vec{w_{N}}))

(6)

Calculating the

C_{v}

value for a topic usually involves the following steps:

Select the $N$ highest ranked words in the topic.
For each p $(w_{i}, w_{j})$ of these $N$ words, calculate the value of the cosine similarity of their word vectors.
Calculate the average of the cosine similarity of all word pairs as the $C_{v}$ value for the topic.

4.3.2. Baselines

We selected the Latent Dirichlet Allocation (LDA), Neural Variational Document Model (NVDM), Contextualized Topic Model with Negative Sampling (CTM), and Contextualized Topic Model with Negative Sampling (CTM-Neg) as benchmark models for comparison due to their prominence and widespread use in the field. LDA is a representative traditional topic model, NVDM exemplifies topic models based on the VAE architecture, CTM is a high-performing VAE-based model, and CTM-Neg is an improved version of CTM, which is recognized as one of the best-performing models in recent years within this architecture.

LDA: LDA is a classical probabilistic graphical model that describes word occurrence patterns in a document through the distribution of latent topics. LDA assumes that a document is generated from an underlying set of topics, each defined by a probability distribution over a set of words.

NVDM: NVDM is a generative model based on neural networks and variational inference for learning document representations in a continuous latent space. It combines the advantages of deep learning and probabilistic graphical models to capture the semantic structure of documents in an end-to-end manner.

CTM: CTM is a neural topic model that combines contextualized representations, effectively enhancing the meaning and coherence of the generated topics by connecting contextualized SBERT embeddings with the original BoW representation as a new textual representation input into the VAE framework.

CTM-Neg: CTM-Neg is an improved topic model based on the CTM, introducing a negative sampling mechanism for contextualized topic models to enhance the quality of generated topics. During model training, it perturbs the generated document topic vectors and uses ternary loss to ensure that the model’s output is more similar to the original input when reconstructed with the correct document topic vectors, while the output from perturbed vectors is dissimilar to the original vectors.

4.3.3. Experimental Results and Analysis

Table 5, Table 6 and Table 7 present the experimental results of our proposed PTM topic models and each baseline model on the 20Newsgroups dataset, the M10 dataset, and the GN dataset, respectively. Figure 2 and Figure 3 visualize the performance of each model on different datasets using bar charts.

This paper rigorously compares the PTM to other mainstream topic modeling approaches, with Table 5, Table 6 and Table 7 illustrating the results of this comparison on the NPMI and

C_{v}

metrics. The results indicate that the PTM excels on both metrics, achieving the highest scores on NPMI and

C_{v}

relative to other models such as LDA, NVDM, and CTM. PTM’s significant improvement in NPMI over classic LDA is attributed to its superior semantic capture capability and effective modeling of the topic discovery task. NVDM, a deep-learning-based topic modeling approach, inherently excels at capturing deep semantic information from text. However, PTM surpasses NVDM on NPMI and

C_{v}

due to a more efficient prompt-learning approach. CTM-Neg employs a negative sampling mechanism that enhances its modeling of the topic discovery task. However, its performance remains constrained by the VAE model’s shortcomings in practical applications, yielding inferior results compared to PTM.

Further analysis reveals that PTM’s excellent performance in topic discovery tasks is primarily attributable to its ability to directly utilize pre-trained language models. Pre-trained models, typically trained on extensive datasets, can capture extensive contextual information, facilitating understanding of complex topics and fine-grained categorization. In contrast, traditional and neural topic models are constrained by their inadequate structural frameworks, which hinder further performance enhancements.

In summary, the PTM utilizes a prompt learning approach with a pre-trained language model for the topic discovery task, yielding excellent results in both NPMI and

C_{v}

, crucial evaluation metrics for topic models. These results not only confirm the superiority of PTM in topic modeling but also suggest a new direction for subsequent research in this field.

4.4. Ablation Experiment

To assess the contribution of the label word filtering module of the PTM to its performance, this study performed ablation experiments on three datasets. The results of the ablation experiments are detailed in Table 8, which involved removing specific components of the PTM while maintaining the integrity of the others, to ascertain the significance of each component. The specific setup for the ablation experiment is as follows:

w/o Prompt_Selection: upon removing the Prompt_Selection module, the model ceases to filter out non-essential words post prompt selection, instead utilizing the set of prompts selected based on statistical features following clustering.

The ablation experiments demonstrated the significant impact of the label word filtering module on the overall performance of the PTM. As shown in Table 8, the removal of the Prompt_Selection module resulted in notable decreases in both NPMI and

C_{v}

scores across all three datasets. Specifically, for the 20NewsGroups dataset, the NPMI score dropped from 0.136 to 0.092, and the

C_{v}

score decreased from 0.658 to 0.564. Similar trends were observed for the M10 and GN datasets, where the exclusion of the Prompt_Selection module led to reductions in both metrics, indicating the module’s critical role in enhancing model performance.

Through ablation experiments, this study gained insights into the impact of the label word filtering module on overall model performance. The label word screening module plays a crucial role in enhancing model performance. Label words initially screened out after clustering may contain terms with unknown meanings or low relevance to the target task, potentially interfering with the model’s judgment and leading to erroneous inferences. The label word screening module reduces noise input by manually filtering out terms of unknown meaning, thereby enhancing the model’s recognition accuracy. In the ablation experiments, removing the label word screening module resulted in decreased model performance, indicating that enhancing the quality of input features is critical to model improvement.

Furthermore, the label word filtering module not only enhances model performance but also deepens our understanding of the model’s decision-making process. Analyzing the filtered-out label words enables us to better understand the model’s sensitivity and preferences for specific tasks, crucial for enhancing model interpretability, optimization, and tuning.

4.5. Runtime of PTM

To evaluate the efficiency of our Prompt Topic Model (PTM), we measured its runtime on three different datasets: 20Newsgroups, M10, and GoogleNews. Runtime is a critical factor in assessing the practical applicability of our model, particularly when dealing with large-scale text data. Table 9 summarizes the runtime of PTM on these dataset.

5. Conclusions and Future Work

5.1. Conclusions

This study introduces a novel Prompt Topic Model (PTM) for topic modeling tasks that incorporate a prompt learning approach. The model aims to address the limitations associated with traditional topic models, specifically those related to Dirichlet allocation a priori assumptions and information loss in variational autoencoders. Additionally, this study proposes an unsupervised label word selection method, wherein document embedding vectors obtained by pre-training the language model are clustered using an algorithm. Subsequently, a sufficient number of words are filtered as label words from each cluster. Experimental results demonstrate that the proposed PTM method achieves optimal performance across multiple datasets. These findings suggest that prompt learning can enhance traditional topic modeling and effectively support topic classification tasks via an unsupervised label word selection approach. This research contributes significantly to understanding the underlying topic structure in texts and improving topic modeling performance, offering valuable insights and methodologies for further research in natural language processing.

5.2. Future Work

Despite the promising results, our PTM has certain limitations, primarily its higher computational complexity compared to traditional models. This increased complexity results from the use of advanced embedding techniques and clustering mechanisms, which may lead to longer processing times and higher resource consumption.

Future work will concentrate on optimizing the computational efficiency of the PTM without compromising its performance. We intend to explore alternative clustering algorithms and dimensionality reduction techniques to alleviate the computational burden. Additionally, extending the PTM to support multilingual datasets and to enable real-time topic modeling in dynamic environments constitutes a key area for future research. Furthermore, investigating the integration of PTM with other natural language processing tasks may enhance its versatility and applicability across various domains.

Author Contributions

Conceptualization, M.Q.; methodology, M.Q.; validation, M.Q.; writing—original draft preparation, M.Q.; writing—review and editing, M.Q., W.Y., F.W. and M.C.; supervision, F.W. and M.C.; funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is a research achievement supported by the “Tianshan Talent” Research Project of Xinjiang (No. 2022TSYCLJ0037), the National Natural Science Foundation of China (No. 62262065), the Science and Technology Program of Xinjiang (No. 2022B01008), the National Key R&D Program of China Major Project (No. 2022ZD0115800), the National Natural Science Foundation of China (No. 62341206), the Central Government Guides Local Projects (No. ZYYD2022C19).

Data Availability Statement

The 20NewsGroups dataset and M10 dataset are available at https://github.com/MIND-Lab/OCTIS, accessed on 14 June 2023. The GoogleNews dataset is available at https://github.com/AdhyaSuman/CTMNeg/tree/master/preprocessed_datasets/GN, accessed on 12 May 2023.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ateyah, S.; Al-Augby, S. Proposed information retrieval systems using LDA topic modeling for answer finding of COVID-19 pandemic: A brief survey of approaches and techniques. In Proceedings of the AIP Conference Proceedings, Baghdad, Iraq, 8–9 December 2021; AIP Publishing: College Park, MD, USA, 2023; Volume 2591. [Google Scholar]
Kawai, M.; Sato, H.; Shiohama, T. Topic model-based recommender systems and their applications to cold-start problems. Expert Syst. Appl. 2022, 202, 117129. [Google Scholar] [CrossRef]
Abdelrazek, A.; Eid, Y.; Gawish, E.; Medhat, W.; Hassan, A. Topic modeling algorithms and applications: A survey. Inf. Syst. 2023, 112, 102131. [Google Scholar] [CrossRef]
Shi, L.; Song, G.; Cheng, G.; Liu, X. A user-based aggregation topic model for understanding user’s preference and intention in social network. Neurocomputing 2020, 413, 1–13. [Google Scholar] [CrossRef]
Liu, L.; Tang, L.; Dong, W.; Yao, S.; Zhou, W. An overview of topic modeling and its current applications in bioinformatics. Springerplus 2016, 5, 1608. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Chauhan, U.; Shah, A. Topic modeling using latent Dirichlet allocation: A survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
Zhao, H.; Du, L.; Buntine, W.; Liu, G. MetaLDA: A topic model that efficiently incorporates meta information. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; pp. 635–644. [Google Scholar]
Bunk, S.; Krestel, R. Welda: Enhancing topic models by incorporating local word context. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, Fort Worth, TX, USA, 3–7 June 2018; pp. 293–302. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. Stat 2014, 1050, 1. [Google Scholar]
Adhya, S.; Sanyal, D.K. Improving neural topic models with Wasserstein knowledge distillation. In Proceedings of the European Conference on Information Retrieval, Dublin, Ireland, 2–6 April 2023; Springer Nature: Cham, Switzerland, 2023; pp. 321–330. [Google Scholar]
Kumar, A.; Esmaili, N.; Piccardi, M. A reinforced variational autoencoder topic model. In Proceedings of the Neural Information Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, 8–12 December 2021; Proceedings, Part V 28. Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 360–369. [Google Scholar]
Blei, D.M.; Lafferty, J.D. A correlated topic model of Science. Ann. Appl. Stat. 2007, 1, 17–35. [Google Scholar] [CrossRef]
Blei, D.M.; Lafferty, J.D. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 113–120. [Google Scholar]
Yan, X.; Guo, J.; Lan, Y.; Cheng, X. A biterm topic model for short texts. In Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil, 13–17 May 2013; pp. 1445–1456. [Google Scholar]
Holmes, I.; Harris, K.; Quince, C. Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLoS ONE 2012, 7, e30126. [Google Scholar] [CrossRef] [PubMed]
Srivastava, A.; Sutton, C. Autoencoding Variational Inference for Topic Models. Stat 2017, 1050, 4. [Google Scholar]
Miao, Y.; Yu, L.; Blunsom, P. Neural variational inference for text processing. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1727–1736. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; Volume 27. [Google Scholar]
Wang, R.; Zhou, D.; He, Y. Atm: Adversarial-neural topic model. Inf. Process. Manag. 2019, 56, 102098. [Google Scholar] [CrossRef]
Tolstikhin, I.; Bousquet, O.; Gelly, S.; Schoelkopf, B. Wasserstein Auto-Encoders. Stat 2018, 1050, 12. [Google Scholar]
Gui, L.; Leng, J.; Pergola, G.; Zhou, Y.; Xu, R.; He, Y. Neural topic model with reinforcement learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3478–3483. [Google Scholar]
Dieng, A.B.; Ruiz, F.J.R.; Blei, D.M. Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 2020, 8, 439–453. [Google Scholar] [CrossRef]
Bianchi, F.; Terragni, S.; Hovy, D. Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv 2020, arXiv:2004.03974. [Google Scholar]
Van Linh, N.; Bach, T.X.; Than, K. A graph convolutional topic model for short and noisy text streams. Neurocomputing 2022, 468, 345–359. [Google Scholar] [CrossRef]
Adhya, S.; Lahiri, A.; Sanyal, D.K.; Das, P.P. Improving Contextualized Topic Models with Negative Sampling. In Proceedings of the 19th International Conference on Natural Language Processing (ICON), New Delhi, India, 15–18 December 2022; pp. 128–138. [Google Scholar]
Huang, H.; Tang, Y.-K.; Shi, X.; Mao, X.-L. Dependency-Aware Neural Topic Model. Inf. Process. Manag. 2024, 61, 103530. [Google Scholar] [CrossRef]
Terragni, S.; Fersini, E.; Galuzzi, B.G.; Tropeano, P.; Candelieri, A. OCTIS: Comparing and optimizing topic models is simple! In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, Online, 19–23 April 2021; pp. 263–270. [Google Scholar]

Figure 1. Schematic diagram of the PTM, where (a) is the Label Word Selection Module, (b) is the Prompt Learning TopicClassfiction Module.

Figure 2. NPMI values for different topic models on the three datasets.

Figure 3.

C_{v}

values for different topic models on the three datasets.

Figure 3.

C_{v}

values for different topic models on the three datasets.

Table 1. The key contributions of each referenced work.

Model	Authors	Problem Addressed	Solution Proposed
LDA	Blei et al. [6]	Baseline topic modeling	Latent Dirichlet Allocation
CTM	Blei et al. [13]	Capturing topic correlations	Multivariate Gaussian distribution to depict correlations
DTM	Blei et al. [14]	Modeling time-series text data	Temporal variables for capturing dynamics
BTM	Yan et al. [15]	Handling short texts	Modeling word pairs
DMM	Holmes et al. [16]	Processing short texts	Single topic per document assumption
ProdLDA	Srivastava et al. [17]	Improved topic consistency with expert products	Using VAEs for variational inference
NVDM	Miao et al. [18]	Combining Neural Variational Reasoning with topic modeling	Continuous hidden variables optimized using stochastic gradient variational Bayes
ANTM	Wang et al. [20]	Adversarial training for better topic generalization	GAN-like framework with adversarial training
WAE	Tolstikhin et al. [21]	Mitigating posterior crashes in VAE-based models	Wasserstein distance in latent space
NTM	Gui et al. [22]	Reinforcement learning for topic coherence	Reinforcement learning framework
ETM	Dieng et al. [23]	Incorporating word embeddings into topic modeling	Word embeddings for semantic relationships
CTM	Bianchi et al. [24]	Combining contextualized representations with BoW	Contextualized SBERT embeddings with VAE
CTM-Neg	Adhya et al. [26]	Negative sampling for enhancing topic quality	Negative sampling mechanism during training
Dependency-aware neural topic model	Huang et al. [27]	Considering dependencies between topics	Generative process with topic dependencies

Table 2. Abbreviations and their meanings.

Abbreviation	Meaning
D	Document collection
n	Number of documents in the collection D
k	Number of topics to discover
$d_{i}$	Individual document in the collection D
$y_{i}$	Topic label assigned to document $d_{i}$
Y	Collection of topic labels
w	Words under each topic label
m	Number of top words selected for each topic
W	Collection of top words for each topic
M	Pre-trained language model
[MASK]	Placeholder token in BERT model
[CLS]	Special token used in BERT to represent the start of a sequence
[SEP]	Special token used in BERT to represent the separation between sequences
SUM	Summation function used in prompt learning
V	Set of prompt words
$V_{y}$	Subset of prompts mapped to a specific label y

Table 3. Examples of label words.

Dataset	Label Words
20NewsGroups	color, image, graphic, window, display, file, printer…
20NewsGroups	price, sell, sale, offer, buy, pay, card… car, bike, ride, drive, engine, make, oil…
M10	market, stock, price, financial, risk, monetary, asset…
M10	robot, learn, model, system, simulation, learning, base… gas, methane, fuel, oil, diesel, natural, emission…

Table 4. The experimental environment.

Component	Description
Operating System	Linux (Ubuntu 18.04.6)
CPU	Intel(R) Xeon(R) Gold 5318Y
GPU	NVIDIA A40
Storage	10 Terabyte
Editor	PyCharm 2021.3.1
Programming Language	Python 3.8
CUDA Version	CUDA 11.2
Prompt Learning Component	Open Prompt

Table 5. Comparative experimental results on the 20NewsGroups dataset.

Model	NPMI	$C_{v}$
LDA	0.075	0.386
NVDM	0.080	0.448
CTM	0.106	0.529
CTM-Neg	0.129	0.562
PTM	0.136	0.658

Table 6. Comparative experimental results on the M10 dataset.

Model	NPMI	$C_{v}$
LDA	−0.032	0.382
NVDM	0.045	0.468
CTM	0.063	0.451
CTM-Neg	0.072	0.472
PTM	0.127	0.628

Table 7. Comparative experimental results on the GN dataset.

Model	NPMI	$C_{v}$
LDA	−0.164	0.325
NVDM	0.056	0.476
CTM	0.081	0.513
CTM-Neg	0.142	0.552
PTM	0.157	0.671

Table 8. Ablation experiments of the PTM on three datasets.

Dataset	Model	NPMI	$C_{v}$
20NewsGroups	PTM	0.136	0.658
20NewsGroups	PTM w/o Prompt_Selection	0.092	0.564
M10	PTM	0.127	0.628
M10	PTM w/o Prompt_Selection	0.086	0.493
GN	PTM	0.157	0.671
GN	PTM w/o Prompt_Selection	0.107	0.479

Table 9. Runtime of PTM on different datasets.

Dataset	Number of Documents	Runtime
20Newsgroups	16,309	34 min 42 s
M10	8355	26 min 36 s
GoogleNews	11,109	23 min 36 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiu, M.; Yang, W.; Wei, F.; Chen, M. A Topic Modeling Based on Prompt Learning. Electronics 2024, 13, 3212. https://doi.org/10.3390/electronics13163212

AMA Style

Qiu M, Yang W, Wei F, Chen M. A Topic Modeling Based on Prompt Learning. Electronics. 2024; 13(16):3212. https://doi.org/10.3390/electronics13163212

Chicago/Turabian Style

Qiu, Mingjie, Wenzhong Yang, Fuyuan Wei, and Mingliang Chen. 2024. "A Topic Modeling Based on Prompt Learning" Electronics 13, no. 16: 3212. https://doi.org/10.3390/electronics13163212

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Topic Modeling Based on Prompt Learning

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Model

3.2. Label Word Selection Module

3.3. Prompt Learning Topic Classfiction Module

4. Detailed Analysis of the Dataset, Experimental Setup, and Results

4.1. Datasets

4.2. Experimental Setup

4.3. Compared with Other Methods

4.3.1. Evaluation Indicators

4.3.2. Baselines

4.3.3. Experimental Results and Analysis

4.4. Ablation Experiment

4.5. Runtime of PTM

5. Conclusions and Future Work

5.1. Conclusions

5.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI