1. Introduction
The emergence of ChatGPT [
1] has significantly impacted dialogue generation with its remarkable performance. The public has widely embraced and actively participated in discussions about this chatbot [
2,
3] on social media like Twitter (
http://twitter.com, accessed on 4 February 2024) and Reddit (
http://reddit.com, accessed on 4 February 2024). For instance,
Figure 1 shows the discussions regarding diverse concerns about ChatGPT which are posted by Twitter users. The distinct colors are marked to represent words for specific concerns (e.g., yellow means ‘
Education’, green denotes ‘
Security’, and blue represents ‘
Technical Basis’). Also, a vast number of users eagerly interact with this conversational agent and explore its capabilities. Compared to texts from other sources, the review posts and queries about ChatGPT encapsulate the primary concerns more effectively, such as the services it can offer and user experiences. Therefore, mining these silent concerns lying behind the texts is valuable for academia [
4] and public daily life.
However, the massive volume of review posts and user-generated queries, the wide range of concerns on this open domain topic, and the noisy nature (e.g., casual expressions, grammar mistakes) [
5] of the user-generated unlabeled text make it challenging to mine high-quality concerns. Considering the characteristics of this extraction task, topic models could provide a promising way to extract these public concerns in an unsupervised manner. Furthermore, to ensure the quality of the extracted concerns, we should consider both of the following aspects: (1).
Interpretability: extracted concerns should be semantically coherent and agree with human judgment (mined representative words should be semantically coherent and have the ability to represent the meaning of specific concerns, such as the highlighted words in
Figure 1). (2).
Diversity: a wide range of concerns on ChatGPT should be reflected in mined results (all the discussed concerns, like ‘
Education’, ‘
Security’ and ‘
Technical Basis’ shown in
Figure 1 should be discovered in the extracted results without neglecting any of them).
Neural topic modeling [
6] addresses the inefficiency of the Gibbs sampler in traditional topic models [
7], serving as a tool for unsupervised open information extraction [
8] and offering a promising direction for mining public concern. Nevertheless, two streams of advanced models based on the Variational Auto-Encoder (VAE) [
9] and Generative Adversarial Network (GAN) [
10] could not fulfill both of our requirements. The improper prior of topic distribution used in VAE-based neural topic models [
7,
11,
12,
13] often sacrifices the interpretability, thereby diminshing the ability to represent specific focuses. Adversarial topic models [
10,
14] face topic collapse problems (fail in discovering diverse semantic patterns [
7,
15]), resulting in a reduction in concern diversity and the loss of crucial information.
Thus, to address the potential issues posed by the aforementioned approaches and mine high-quality public concerns about ChatGPT, the Self-Supervised Neural Topic Model (SSTM) is proposed in this paper. Compared to most existing models, topic modeling is formalized as representation learning to mitigate topic collapse [
16]. Its principal idea is to build a one-way projection from the document-word distribution to the document-topic distribution, along with a text augmentation scheme, to capture the correlations between words and topics. Specifically, to ensure interpretability, we leverage an invariance regularizer to preserve topic information from augmented pairs and the maximum mean discrepancy to match the statistical distribution to the Dirichlet prior in the topic space. Additionally, a covariance regularizer is utilized to decorrelate dimensions of topic distributions and alleviate the generation of mixed topics. On the other hand, SSTM also incorporates the variance regularizer to prevent topic collapse and ensure the diversity of concerns/topics.
The main contributions of our work are as follows:
An end-to-end framework is proposed for mining public concerns about ChatGPT from social media posts and user queries, which is, to our best knowledge, the first attempt for this task.
A novel Self-Supervised Neural Topic Model (SSTM) based on representation learning is proposed to mine high-quality concerns/topics.
Experimental results on three publicly available datasets show that the SSTM could discover higher-quality concerns/topics than state-of-the-art approaches in terms of interpretability and diversity.
The practical significance of this research lies in designing the novel Self-Supervised Neural Topic Model for mining public concerns about the advanced AI-dialogue system ChatGPT. Additionally, it has the potential to generate high-quality public concerns when compared to existing state-of-the-art approaches. The remainder of the paper is organized as follows:
Section 2 presents the related technical work, while
Section 3 delves into the detailed architecture of the proposed SSTM. In
Section 4, the experimental results and analysis, which include experimental setup (
Section 4.1), interpretability and diversity evaluation (
Section 4.2), hyper-parameter analysis (
Section 4.3), human evaluation (
Section 4.4), a case study (
Section 4.5), data analysis (
Section 4.6), and an ablation study (
Section 4.7), are presented.
Section 5 will discuss the results. At last,
Section 6 will conclude this paper and provide future directions.
3. Research Methodology
The proposed Self-Supervised Neural Topic Model (SSTM), shown in
Figure 2, contains three main components: (1). The text augmentation and representation module (left), which defines the text augmentation and representation scheme. Given a batch of documents
X, for each document
, it outputs document-word distributions of augmented texts
and
; (2). The topic inference network
I (middle) utilizes document-word distributions
and
as input and transforms them into document-topic distributions
and
; (3). The prior matching module (right) aims to push the statistical distribution of document-topic distributions to the Dirichlet prior and calculate the representation learning objectives. The functionalities of these components will be introduced in the following subsections. Additionally, for the clarity of illustration, we list the symbols used in this paper and corresponding explanations in
Table 1.
3.1. Text Augmentation and Representation
Text augmentation plays an essential role in self-supervised representation learning, and its key idea is keeping the semantic consistency between augmented text and original text.
Thus, for each document with tf-idf representation , we conduct various random perturbations to generate augmented samples that share the same semantic meaning with . To carry out the augmentation, we first select L words (set to 5 in our experiments) that appear in with the lowest weights. Then, we conduct the following perturbations for each selected word:
Increase weight by 10% with p% probability (p is set to 15 in the experiment);
Decrease weight by 10% with p% probability (p is set to 15 in the experiment);
Set weight to zero with p% probability (p is set to 15 in the experiment).
Then, we represent the augmented texts with document-word distributions and , which can be obtained by dividing the summation of weights over vocabulary.
3.2. Topic Inference Network
The topic inference network I aims to capture the correlations between words and topics by building the projection from the document-word distribution to the document-topic distribution. It consists of one V-dimensional document-word distribution layer, two S-dimensional semantic representation layers, and one K-dimensional document-topic distribution layer.
In more detail, for each document-word distribution
, inference network
I transforms it into the
S-dimensional semantic space at first with the transformation defined as:
where
and
are weight matrices of semantic layers and
and
are bias terms.
,
,
, and
are hidden states and activation signals of the two semantic layers. Moreover,
denotes spectrum normalization [
34], and
is the Hardswish activation function [
35]. Then, the topic inference network will project the output signal
into a
K-dimensional document-topic distribution via the formulation below:
where
and
are weight matrix and bias terms,
means the hidden states of the document-topic distribution layer, and
denotes the inferred topic distribution.
3.3. Prior Matching in Topic Space
As Wallach et al. [
36] argued that multiple peaks of the Dirichlet density are suitable for mining multi-modality among texts, we model topics with the Dirichlet prior in latent topic space.
To match topic distributions to the pre-defined Dirichlet prior, we utilize the Maximum Mean Discrepancy (MMD) [
37], which is a widely used statistical distribution matching tool. MMD measures the distance between probability densities [
38] by mapping their mean embeddings into a Reproducing Kernel Hilbert Space (RKHS) [
39]. Concretely, given a set of inferred topic distributions
=
sampled from their statistical distribution
and a set of random samples
=
drawn from the Dirichlet prior distribution
, the distance between
and
can be calculated with:
where
denotes the kernel function. In this paper, we use the diffusion kernel [
40] and
M represents the batch size.
3.4. Training Objective
The principal idea of the proposed SSTM could be summarized as:
Utilizing the topic inference network to build the projection from the document-word distribution to the document-topic distribution and learn the high-quality topic distribution;
Matching the statistical distribution of inferred document-topic distributions to the Dirichlet prior with MMD to capture multiple patterns.
Thus, to make sure that the SSTM can mine high-quality concerns considering both interpretability and diversity, its training objective is designed as:
where
means the prior matching objective, which is calculated with the MMD distance defined in Equation (
7), and
denotes the coefficient of the prior matching regularizer.
On the other hand, to learn a high-quality projection and provide informative topic distributions, the representation learning objective should consider the requirements below:
Invariance: The inference network should map similar document-word distributions ( and ) to similar document-topic distributions ( and ).
Variance: Topic collapse should be prevented as it will map different document-word distributions to the same document-topic distribution and deteriorate topic diversity.
Covariance: Different dimensions of inferred document-topic distributions should be decorrelated, and each dimension should capture an independent semantic meaning lying behind texts.
Given two batches of inferred topic distributions
and
, we should first ensure that similar document-word distributions have similar document-topic distributions (invariance) by calculating the averaged squared Euclidean distance between topic distribution pairs. By minimizing the Euclidean distance between topic distributions of paired augmented documents, we could ensure that semantically similar documents have similar document-topic distributions and enhance the topic interpretability. The invariance objective function
is defined as:
where
M denotes the batch size.
Then, to prevent topic collapse, we incorporate a variance regularization term, and the variance objective function of
is formed as:
where
,
represents the vector composed of each value at the
k-th dimension in all topic distributions in
, and
is a small scalar which is set to
for numerical stability. This regularizer enforces a variance of 1 for each dimension within the current batch, helping to prevent all inputs from collapsing onto a single vector. Also, this term ensures the variance of each dimension of topic distribution and improves the topic diversity. The variance objective function of
and
can be summarized as
with the formulation:
where
follows the formula in Equation (
10).
Moreover, to disentangle the semantic meaning of topics, the covariance objective function is designed based on the covariance matrix which is formulated as:
where
. Aiming at decorrelating the dimensions of document-topic distributions, we follow VICREG [
41] and define a covariance objective function
with:
where
follows the calculation in Equation (
12). The covariance objective helps SSTM disentangle the dimensions of document-topic distributions and improve topic interpretability.
With the definition of invariance objective function
(Equation (
9)), variance objective function
(Equation (
11)), and covariance objective function
(Equation (
13)), the representation learning objective function
can be formulated with:
where
,
and
are coefficients of different regularization terms. The training procedure of the proposed SSTM is shown in Algorithm 1. For detailed configurations, refer to
Section 4.1.4.
Algorithm 1: Self-Supervised neural Topic Model |
![Mathematics 13 00183 i001]() |
3.5. Topic Generation
After model training, the inference network I builds the projection from the document-word distribution layer to the document-topic distribution layer, which could be utilized for topic mining.
By feeding into the diagonal matrix
(each row represents the one-hot encoding of a word in the vocabulary), we can obtain correlations between topics and words via the transformation below:
where
is the correlation matrix between words and topics. By performing column-wise normalization and ranking, we can obtain the topic-word distribution matrix
(
k-th column is the topic-word distribution of the
k-th topic), which can be used for topic extraction.
4. Experiments
In this section, we will first describe the experimental setup. Then, interpretability and diversity evaluation, a case study, hyper-parameter analysis, and an ablation study will be presented.
4.1. Experimental Setup
In this subsection, we will first provide descriptions of datasets and baselines. Following this, evaluation metrics and implementation details will be introduced.
4.1.1. Datasets
To evaluate the performance of the proposed model, we collect and process three publicly available datasets, which are the Twitter Posts (
https://www.kaggle.com/datasets/manishabhatt22/tweets-onchatgpt-chatgpt, accessed on 8 February 2024), Subreddit (
https://www.kaggle.com/datasets/armitaraz/chatgpt-reddit, accessed on 8 February 2024), and User Query (
https://www.kaggle.com/datasets/noahpersaud/89k-chatgpt-conversations, accessed on 8 February 2024) datasets. In more detail, the Twitter Posts dataset contains 58 K tweets with the hashtag ‘
#ChatGPT’ posted on Twitter from 30 November 2022 to 24 February 2023. The Subreddit dataset consists of nearly 18 K reviews regarding ChatGPT posted on the Reddit website. Both datasets encompass diverse user attitudes and opinions on this AI-dialogue system and are suitable for public opinion mining. When collecting these data, we define the scope based on the communities of ChatGPT, without considering factors such as the geographical location and age range of the commenters. On the other hand, the User Query dataset contains all the available conversations from chatlogs.net between users and ChatGPT until 20 April 2023. This dataset is collected from chatlogs.net using a custom web scraper (
https://github.com/P1ayer-1/chatlogs.net-scraper, accessed on 20 February 2024) and its content contains user-concerned aspects and expectations of this powerful tool. In our experiment, we conduct an anonymization process, such as removing and replacing personally identifiable information, to protect personal privacy. Additionally, to address data noise, we conduct data preprocessing before performing experiments. To remove tweet-specific noise (such as hashtags, URLs, numbers, etc.), we first discard the non-English tweets with
fastlid (
https://pypi.org/project/fastlid/, accessed on 25 February 2024). Then, we employ the
preprocessor (
https://github.com/s/preprocessor, accessed on 25 February 2024) and
twitter_preprocessor (
https://github.com/vasisouv/tweets-preprocessor, accessed on 25 February 2024) libraries for cleaning. We also employ
enchant (
https://github.com/AbiWord/enchant, accessed on 25 February 2024) and
spacy (
https://github.com/explosion/spaCy, accessed on 25 February 2024) for spell-checking and lemmatization. The stopwords and low-frequency words are removed.
Table 2 presents the statistics of the processed datasets. Likewise, we also provide the word-cloud visualizations of three datasets in
Figure 3.
4.1.2. Baselines
To validate the effectiveness of the SSTM, we choose the following methods as baselines:
LDA [
31], a probabilistic graphical model which views document is generated by a mixture of topics, and we employ the
Mallet (
https://mimno.github.io/Mallet/, accessed on 15 March 2024) implementation with suggested configurations.
NVDM [
32], a VAE-based neural topic model which models topics with the Gaussian prior in latent topic space, official implementation is used in the experiments (
https://github.com/ysmiao/nvdm, accessed on 15 March 2024).
GSM [
42], a variant of the NVDM, utilizes a Gaussian distribution followed by a softmax transformation to construct the topic distribution; the official implementation (
https://github.com/linkstrife/NVDM-GSM, accessed on 15 March 2024) is used.
WLDA [
43], a neural topic model built upon the Wasserstein Auto-Encoder, utilizing the Dirichlet distribution as the prior of the topic distribution; the official implementation (
https://github.com/awslabs/w-lda, accessed on 17 March 2024) is used.
ATM [
14], a neural topic model employing adversarial training; we implement the model according to the recommended configurations.
BAT [
10], an extension of ATM which incorporates an encoder network for topic inference; we implement it with the suggestions in the original paper.
BERTopic [
13], a topic extraction tool based on the pre-trained language model (BERT); the released source code is used (
https://github.com/MaartenGr/BERTopic, accessed on 22 March 2024) in our experiments.
CTMNeg [
11], a topic mining tool that incorporates contrastive learning and negative sampling to effectively capture the underlying structure and semantics; the official implementation (
https://github.com/adhyasuman/ctmneg, accessed on 22 March 2024) is used in this paper.
4.1.3. Evaluation Metrics
As interpretability and diversity are two crucial aspects of public concern extraction, we consider two types of metrics for performance evaluation. To evaluate the interpretability of extracted concerns, topic coherence [
44] (C_P, NPMI, UCI) is employed as [
45] argued that coherence metric agrees with human judgment. All these values are computed with Palmetto library (
https://github.com/dice-group/Palmetto, accessed on 12 April 2024). For evaluating concern diversity, we utilize the Unique Term Rate (UTR), which is calculated with
, where
denotes the number of unique words in the extracted topic words. Note that higher values refer to a better quality of concern.
4.1.4. Implementation Details
We adopt the training settings with 50 epochs and a batch size of 32 for different topic numbers
. The Dirichlet hyper-parameter
is set to
, and each semantic representation layer contains 400 units. Furthermore, the coefficient weight
,
and
are set to 1,
, and
K, respectively. Our proposed SSTM is optimized using the ADAM [
46] optimizer with a learning rate
of
. In our experiments, each topic is represented with a list of the top 10 words for interpretability and diversity evaluation.
Also, our source code relies on libraries:
PyTorch (
https://pytorch.org/, accessed on 8 March 2024),
SciPy (
https://scipy.org/, accessed on 8 March 2024),
NumPy (
https://numpy.org/, accessed on 10 March 2024),
torch_two_sample (
https://torch-two-sample.readthedocs.io/en/latest/, accessed on 10 March 2024), and
scikit-learn (
https://scikit-learn.org/stable/, accessed on 10 March 2024). All of our experiments are carried out on a Ubuntu machine equipped with a single 3090Ti GPU (CUDA version 11.7).
4.2. Interpretability and Diversity Evaluation
We have conducted a comprehensive set of experiments to assess the performance of the proposed SSTM. To assess the interpretability of the extracted concerns, we first compare the averaged topic coherence and provide a global view of comparisons. The results are listed in
Table 3. Each value is obtained by averaging the coherence values on 10, 20, 30, and 40 topic number settings. To clarify the comparison, the optimal result is marked in bold, and the values marked with underline perform the second best. The statistics reveal that the SSTM outperforms all the baselines on the Twitter Posts and User Query datasets. For the Subreddit dataset, the SSTM performs the best in terms of the NPMI and UCI metrics. However, it slightly falls behind CNTM on the C_P metric.
We can observe that approaches like CNTM and BERTopic perform well on the User Query dataset while performing worse on others. This may be attributed to the fact that they could not deal with the data sparsity of short-length social media posts. As queries often have more sufficient word-co-occurrence information compared with the Twitter and Reddit posts, CNTM and Bertopic perform well on the User Query dataset. With these observations, we could conclude that approaches are more prone to mining high quality topics/concerns from long text corpora.
In
Figure 4, we also present the comparison of interpretability (topic coherence) vs. different topic numbers on Twitter Posts, Subreddit, and User Query datasets. For the Twitter Posts dataset, we can observe that the SSTM surpasses almost all the baselines except for the 10-topic number setting on UCI and the 40-topic number setting on C_P metrics. Similarly, it achieves optimal results across all the topic settings on three metrics for the User Query dataset, except for 40 topics on the UCI metric. Moreover, the SSTM outperforms most baselines on the NPMI and UCI metrics across different topics, but it obtains slightly worse results on the C_P metric. To provide a more intuitive comparison of extracted concerns,
Table 4 presents the concern-specific representative words, obtained by the SSTM and baseline approaches. These concerns may be related to ‘
education’ and ‘
marketing’.
On the other hand, we also provide a comparison of concern diversity across various topic numbers in
Table 5. The experimental results demonstrate that our proposed SSTM surpasses the baselines in all topic number settings on the Twitter Posts dataset. It also performs the best except for 30 and 40 topic settings on Subreddit and User Query. Notably, GSM demonstrates favorable performance on both datasets. However, it sacrifices the performance for topic coherence metrics, as depicted in
Figure 4.
Generally, such improvements in topic coherence and diversity may be attributed to the incorporation of Dirichlet priors in latent space, invariance, and covariance regularizers. These comparisons also prove that the variance regularizer helps to address the topic collapse issue. Additionally, for the same metrics, approaches typically achieve better results on the User Query dataset, which may be caused by data characteristics such as longer text length and richer semantic information.
4.3. Hyper-Parameter Analysis
To explore whether the performance of SSTM exhibits variations with changes in hyper-parameters (e.g., the learning rate , the Dirichlet prior , the number of hidden units S, and the probability p in text augmentation), we conduct the hyper-parameter analysis in this subsection.
Specifically, considering practical issues such as experiment duration, we opted to conduct this experiment on the User Query dataset with the topic number set to 10. Likewise, C_P, UCI, and NPMI are utilized to measure the extracted concerns. Extensive experiments have demonstrated that the proposed SSTM performs stably when the parameters vary within the specified range, and experimental results are listed in
Table 6. Furthermore, to present a more straightforward comparison, we visualize the results in
Figure 5, and other parameters follow the default configurations. We will discuss the parameter analysis in more detail below.
Learning Rate (): To validate if the performance of the SSTM varies with different learning rates
, we conduct parameter analysis experiments on
, which is set to five values {
,
,
,
and
}. The topic coherence results, listed in
Table 6, show that the proposed SSTM outperforms baseline approaches with all five settings of learning rate. Additionally, it performs the best when
is set to
. Additionally, the comparison in
Figure 5a proves that the SSTM is insensitive to the choice of learning rate parameter.
Dirichlet Prior (): As the Dirichlet distribution plays a crucial role in mining multiple patterns for topic models, we also carry out the hyper-parameter
of the Dirichlet density, and its values are set to {0.005, 0.01, 0.05, 0.1, 0.2}.
Table 6 presents the results of the SSTM with various settings of
. We can observe that the SSTM performs stably across all five settings and it obtains the optimal coherence when
is set to
. The visualization comparison in
Figure 5b reveals the same finding.
Number of Hidden Units (S): Likewise, to investigate the influence of the number of hidden units
S on extraction performance, we employ five different settings {300, 350, 400, 450, 500} of
S and conduct the corresponding analysis. As illustrated in
Table 6, the SSTM with different numbers of hidden units outperforms the baseline approaches and performs the best when
S is set to 450. Moreover,
Figure 5c presents a more intuitive comparison.
Probability of Augmentation (p): To explore the impact of probability
p utilized in text augmentation operation, we conducted experiments with five different settings
. As shown by the statistics in
Table 6, through variations in text augmentation probabilities, the proposed SSTM exhibits remarkable stability and consistently surpasses the competitive baseline, and it performs the best when
p is set to 10. Likewise, the findings are also presented in
Figure 5d.
4.4. Human Evaluation
To assess the topic quality of the SSTM and baseline approaches, we also perform a human evaluation of the User Query dataset with 10 topic number settings. In detail, we recruit two experts to find the out-of-topic words, which are semantically inconsistent with the topic, and count the number of irrelevant words. Finally, we average the counts and conduct the comparison in
Table 7.
From the statistics, we could observe that topics extracted by the SSTM contain fewer out-of-topic words compared with other baselines. This further validates the superior performance of the SSTM in terms of topic coherence.
4.5. Case Study
We also carry out case studies and present concern examples in
Figure 6 to validate the effectiveness of the SSTM. From the examples, we can observe that the SSTM uncovers concerns of Twitter users about “Student Cheating with ChatGPT” and “Impact on the Investment Market”. Likewise, SSTM mines concerns about “Assitant on math” and “Solving Programming Puzzles using ChatGPT” from the Subreddit and User Query datasets.
To prove the faithfulness of extracted concerns, we use Sentence-Bert [
47] to design a retrieval scheme. Posts are selected according to the cosine similarity of contextualized embeddings of the topic words and text in the corpus.
We choose three distinct user concerns from three datasets and present the topic words along with corresponding posts or queries. The concerns in the Twitter Posts and Subreddit datasets primarily revolve around discussions regarding ChatGPT itself and its influence. On the other hand, the concerns in the User Query dataset highlight the specific tasks that users expect ChatGPT to address or complete. The above three results reveal the hot spots of public discussion on ChatGPT from various views.
4.6. Data Analysis
Due to the user privacy protection mechanism of social media platforms, we can not access user privacy information, such as age and gender. However, we conduct data analysis and present the differences in concerns extracted from different platforms (Twitter, Reddit, and Query). In this part, topic numbers are set to 10. And we present each mined concern with one keyphrase as shown in
Table 8.
It can be observed that several common concerns are discussed by users from different platforms, such as “academic writing tool” extracted from both Subreddit and User Query. Likewise, there are also some platform-specific user concerns like the “programming assistant” and “student cheating”. These extracted concerns/topics present a comprehensive understanding of public opinion on ChatGPT.
4.7. Ablation Study
As the regularization terms and document representation approaches play key roles in our proposed SSTM, we conduct the corresponding ablation in this subsection to explore their impacts on topic modeling performance.
4.7.1. Regularization Term Ablation
To explore the roles of invariance, variance, and covariance regularization terms in the SSTM, we add one more dataset (Subreddit) in this subsection and conduct the ablation study on the Subreddit and User Query datasets with the topic number set to 10. The corresponding results are listed in
Table 9;
,
and
mean variants that exclude invariance, variance and covariance terms.
Here, for a clearer comparison, we have retained the results of ETM and BAT, as their performance is quite competitive when the number of topics is set to 10. From the statistics, we observe that performs the worst among SSTM variants, and this may be attributed to the invariance objective having a high impact on topic quality. Similarly, we observe that performs slightly worse than the SSTM. This discrepancy can be attributed to the covariance regularizer, which aids in decorrelating dimensions of the topic distribution and segregating irrelevant words into different topics for interpretable topic extraction. Additionally, we find a decline in the UTR metric for the , which reveals that the variance regularizer has a positive impact on topic diversity. On the other hand, though these variants perform worse than the SSTM, they also surpass the competitive ETM on all the coherence and diversity metrics, which can be ascribed to the usage of the self-supervised learning framework and the Dirichlet prior.
4.7.2. Document Representation Ablation
To examine if incorporating external semantic knowledge will improve modeling ability and explore the impact of the document representation approaches, we conduct this experiment using the pre-trained language model for representing documents on the Subreddit dataset. Concretely, for each document that contains the word sequence
, we first generate the contextualized word representations
. Then, the document is represented with the averaged word-level representations, computed with
. The detailed comparison is listed in
Table 10.
As shown in
Table 10, the SSTM with the pre-trained language model
outperforms the SSTM and the competitive LDA across all the coherence and diversity metrics for all the topic number settings. For clarity, we only retain the LDA as the baseline as it performs the second best according to
Table 3. Notably, the comparison proves that injecting external semantic knowledge can indeed improve the modeling ability of the SSTM.
5. Discussion
This study is based on the rapid development of the ChatGPT dialogue system, which has sparked extensive discussions and generated a large volume of text data. Such user-generated texts contain potential public expectations and concerns, and understanding these is crucial for the development of ChatGPT. We collect and process relevant data from three different social media platforms to mine public concerns. Although self-supervised learning has been widely applied in natural language processing and computer vision, there has been limited exploration of topic models using representation learning techniques. Thus, we formulate topic modeling as a representation learning task and propose the Self-Supervised Neural Topic Model (SSTM). To ensure topic quality and diversity, three training objectives (invariance, variance and covariance) are incorporated into the SSTM. The experimental results demonstrate the superior performance of the SSTM in discovering interpretable concerns/topics. Also, experimental results show that the SSTM could extract higher quality concerns/topics from the User Query dataset than those of Twitter Posts and Subreddit, which is caused by the data characteristics. Specifically, queries often contain more words than tweets and posts and have sufficient co-occurence information which is crucial for topic modeling and pattern discovery. Thus, we could draw the conclusion that our proposed SSTM is prone to mine higher-quality concerns/topics from platforms that can provide long-length texts.
Due to the effectiveness of the SSTM in mining public concerns, practical implications could be strengthened by the SSTM. For example, we could provide technical support for companies and policymakers to discover public concerns about other topics. Also, we will explore designing new policies (such as privacy enhancement measures and transparency reports) to address the ethical concerns about these AI technologies, especially in sensitive areas like education and healthcare.
6. Conclusions
In this paper, we propose the Self-Supervised Neural Topic Model, a novel neural topic modeling approach based on representation learning, for mining high-quality concerns about ChatGPT. In more detail, the SSTM utilizes the Dirichlet prior in the latent topic space to capture multiple semantic patterns. Additionally, to ensure the quality of the extracted concerns, it incorporates three regularization terms, which are invariance, variance, and covariance, for model training. The extensive experiments on three ChatGPT-related datasets show that our proposal could mine higher quality public concerns about this advanced dialogue system in terms of interpretability and diversity. As the proposed, the SSTM only employs the bag-of-words information, neglecting the sparse characteristics of the review text and the contextual information, and our future work will involve integrating external semantic knowledge (such as word embeddings and the pre-trained language models) into the modeling process and designing the knowledge-enhanced topic model for public concern extraction task.