Next Article in Journal
Manipulation Game Considering No-Regret Strategies
Previous Article in Journal
Double Security Level Protection Based on Chaotic Maps and SVD for Medical Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

What Are the Public’s Concerns About ChatGPT? A Novel Self-Supervised Neural Topic Model Tells You

1
School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
2
Department of Computer Science, Youjiang Medical University for Nationalities, Baise 533000, China
3
Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, Nanjing 210023, China
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(2), 183; https://doi.org/10.3390/math13020183
Submission received: 14 September 2024 / Revised: 19 December 2024 / Accepted: 29 December 2024 / Published: 8 January 2025

Abstract

:
The recently released ChatGPT, an artificial intelligence conversational agent, has garnered significant attention in academia and real life. A multitude of early ChatGPT users have eagerly explored its capabilities and shared their opinions on social media, providing valuable feedback. Both user queries and social media posts have been instrumental in expressing public concerns regarding this advanced dialogue system. To comprehensively understand these public concerns, a novel Self-Supervised Neural Topic Model (SSTM), which formulates topic modeling as a representation learning procedure, is proposed in this paper. The proposed SSTM utilizes Dirichlet prior matching and three regularization terms for improved modeling performance. Extensive experiments on three publicly available text corpora (Twitter Posts, Subreddit and queries from ChatGPT users) demonstrate the effectiveness of the proposed approach in extracting higher-quality public concerns. Moreover, the SSTM performs competitively across all three datasets regarding topic diversity and coherence metrics. Based on the extracted topics, we could gain valuable insights into the public’s concerns regarding technologies like ChatGPT, enabling us to formulate effective strategies to address these issues.

1. Introduction

The emergence of ChatGPT [1] has significantly impacted dialogue generation with its remarkable performance. The public has widely embraced and actively participated in discussions about this chatbot [2,3] on social media like Twitter (http://twitter.com, accessed on 4 February 2024) and Reddit (http://reddit.com, accessed on 4 February 2024). For instance, Figure 1 shows the discussions regarding diverse concerns about ChatGPT which are posted by Twitter users. The distinct colors are marked to represent words for specific concerns (e.g., yellow means ‘Education’, green denotes ‘Security’, and blue represents ‘Technical Basis’). Also, a vast number of users eagerly interact with this conversational agent and explore its capabilities. Compared to texts from other sources, the review posts and queries about ChatGPT encapsulate the primary concerns more effectively, such as the services it can offer and user experiences. Therefore, mining these silent concerns lying behind the texts is valuable for academia [4] and public daily life.
However, the massive volume of review posts and user-generated queries, the wide range of concerns on this open domain topic, and the noisy nature (e.g., casual expressions, grammar mistakes) [5] of the user-generated unlabeled text make it challenging to mine high-quality concerns. Considering the characteristics of this extraction task, topic models could provide a promising way to extract these public concerns in an unsupervised manner. Furthermore, to ensure the quality of the extracted concerns, we should consider both of the following aspects: (1). Interpretability: extracted concerns should be semantically coherent and agree with human judgment (mined representative words should be semantically coherent and have the ability to represent the meaning of specific concerns, such as the highlighted words in Figure 1). (2). Diversity: a wide range of concerns on ChatGPT should be reflected in mined results (all the discussed concerns, like ‘Education’, ‘Security’ and ‘Technical Basis’ shown in Figure 1 should be discovered in the extracted results without neglecting any of them).
Neural topic modeling [6] addresses the inefficiency of the Gibbs sampler in traditional topic models [7], serving as a tool for unsupervised open information extraction [8] and offering a promising direction for mining public concern. Nevertheless, two streams of advanced models based on the Variational Auto-Encoder (VAE) [9] and Generative Adversarial Network (GAN) [10] could not fulfill both of our requirements. The improper prior of topic distribution used in VAE-based neural topic models [7,11,12,13] often sacrifices the interpretability, thereby diminshing the ability to represent specific focuses. Adversarial topic models [10,14] face topic collapse problems (fail in discovering diverse semantic patterns [7,15]), resulting in a reduction in concern diversity and the loss of crucial information.
Thus, to address the potential issues posed by the aforementioned approaches and mine high-quality public concerns about ChatGPT, the Self-Supervised Neural Topic Model (SSTM) is proposed in this paper. Compared to most existing models, topic modeling is formalized as representation learning to mitigate topic collapse [16]. Its principal idea is to build a one-way projection from the document-word distribution to the document-topic distribution, along with a text augmentation scheme, to capture the correlations between words and topics. Specifically, to ensure interpretability, we leverage an invariance regularizer to preserve topic information from augmented pairs and the maximum mean discrepancy to match the statistical distribution to the Dirichlet prior in the topic space. Additionally, a covariance regularizer is utilized to decorrelate dimensions of topic distributions and alleviate the generation of mixed topics. On the other hand, SSTM also incorporates the variance regularizer to prevent topic collapse and ensure the diversity of concerns/topics.
The main contributions of our work are as follows:
  • An end-to-end framework is proposed for mining public concerns about ChatGPT from social media posts and user queries, which is, to our best knowledge, the first attempt for this task.
  • A novel Self-Supervised Neural Topic Model (SSTM) based on representation learning is proposed to mine high-quality concerns/topics.
  • Experimental results on three publicly available datasets show that the SSTM could discover higher-quality concerns/topics than state-of-the-art approaches in terms of interpretability and diversity.
The practical significance of this research lies in designing the novel Self-Supervised Neural Topic Model for mining public concerns about the advanced AI-dialogue system ChatGPT. Additionally, it has the potential to generate high-quality public concerns when compared to existing state-of-the-art approaches. The remainder of the paper is organized as follows: Section 2 presents the related technical work, while Section 3 delves into the detailed architecture of the proposed SSTM. In Section 4, the experimental results and analysis, which include experimental setup (Section 4.1), interpretability and diversity evaluation (Section 4.2), hyper-parameter analysis (Section 4.3), human evaluation (Section 4.4), a case study (Section 4.5), data analysis (Section 4.6), and an ablation study (Section 4.7), are presented. Section 5 will discuss the results. At last, Section 6 will conclude this paper and provide future directions.

2. Related Work

Our work is closely related to two lines of researches, which are self-supervised learning and neural topic modeling.

2.1. Self-Supervised Learning

Self-supervised learning (SSL) is a family of techniques in machine learning that aims to learn high-quality representations from unlabeled data without relying on external supervision and human-labeled annotations. Generally, SSL designs pretext tasks [17] based on the inherent structure and patterns in the datasets to help models to capture high-quality representations [18].
SSL has made significant progress and is widely explored in the computer vision community due to its effectiveness in generating high-quality vision representations. Zbontar et al. [16] incorporated a redundancy reduction mechanism into SSL and proposed the Barlow Twins for vision representation learning. Bijoy et al. [19] developed an automated image cleaner in a few-shot learning scenario based on SSL. Fini et al. [20] incorporated the self-supervised objective into semi-supervised clustering to enforce the cluster centroids as class prototypes. Furthermore, Bardes et al. [21] proposed the VICRegL, which learns good global and local features simultaneously and yields an excellent performance on image segmentation.
Likewise, in the natural language processing (NLP) community, SSL has also been explored and has exhibited significant promise. Yan et al. [22] proposed the Contrastive framework for self-supervised SEntence Representation Transfer (ConSERT) for producing high-quality sentence representations. Zhu et al. [23] utilized the SSL for multi-modal sentiment analysis and image–text matching. Mu et al. [24] proposed an SSL-based approach for detecting synonym words from short texts. Luo et al. [25] used the GraphACL scheme to improve the performance of self-supervised graph representation learning. Yi et al. [26] propose a novel self-supervised deep graph clustering method (R2FGC) to explore the full semantic information of the graph-structured data. In addition, SSL has also made significant progress in other research on graph representation learning [27,28]. The latest advancements in self-supervised learning have significantly propelled the development of large language models (LLMs) [29,30], enabling them to learn from vast amounts of unlabeled data and improve their performance on various natural language processing tasks. However, no attempts have explored topic modeling using this advanced representation learning technique.

2.2. Neural Topic Modeling

Recently, Neural Topic Models (NTMs) have gained widespread attention as an efficient tool for mining high-quality topics from massive corpora. While traditional topic models like Latent Dirichlet Allocation (LDA) [31] face certain tricky issues, such as computational complexity and lack of scalability, NTMs offer a promising alternative.
The Variational Auto-Encoder (VAE) [9] has emerged as the mainstream framework for neural topic modeling. Inspired by the VAE, Miao et al. [32] proposed the Neural Variational Document Model (NVDM), which models latent topics with multivariate Gaussian priors and utilizes a decoder to reconstruct documents for text modeling. Likewise, Srivastava et al. [33] employed the Logistic-Normal distribution as the prior of topic space to mimic Dirichlet and proposed the ProdLDA based on the VAE. Along this line, Dieng et al. [12] modeled topics in the embedding space and proposed the Embedded Neural Topic Model, which could incorporate semantic relatedness in word embeddings into the modeling process. However, such VAE-based approaches have limitations in mining interpretable topics as they often use Gaussian and Logistic-Normal distributions as the prior distribution in latent topic space. These distributions could not capture multi-modality among texts and are unsuitable for text modeling.
On the other hand, Generative Adversarial Networks (GANs) have also been explored for neural topic modeling. Wang et al. [14] introduced adversarial learning to the topic model and proposed the Adversarial-neural Topic Model (ATM), which employs a generator to produce document-word distributions by sampling from a Dirichlet distribution, and a discriminator is utilized to differentiate real and generated documents. However, the ATM has limitations in obtaining the topic distribution for unseen documents. Building upon this, Wang et al. [10] proposed the Bidirectional Adversarial neural Topic (BAT) model. It incorporates an encoder to establish a bidirectional mapping between topic-word distributions and document-word distributions. Nevertheless, these adversarial topic models also have the following drawbacks: (1). GAN-based approaches often face topic collapse, which may result in a low level of topic diversity. (2). The alternate optimization procedure of adversarial training makes these models often encounter training instability problems.
Though neural topic modeling and self-supervised learning have been explored and made some progress, seldom have attempts been made to tackle the topic modeling problem using self-supervised learning. Thus, the proposed Self-Supervised Neural Topic Model (SSTM) differs from existing models in the following facets: (1). Unlike VAE-based approaches, which utilize improper prior distribution (such as Gaussian and Logistic-Normal distributions) in latent topic space, the SSTM models topics with the Dirichlet prior to ensure topic interpretibility. (2). Different from adversarial-based approaches that often face mode collapse and notoriously unstable training, our proposed SSTM formulates topic modeling as textual representation learning to mitigate topic collapse. (3). The SSTM incorporates variance and covariance regularizers to enhance topic diversity and alleviate the mixed topics by decorrelating the different dimensions in topic space.

3. Research Methodology

The proposed Self-Supervised Neural Topic Model (SSTM), shown in Figure 2, contains three main components: (1). The text augmentation and representation module (left), which defines the text augmentation and representation scheme. Given a batch of documents X, for each document x X , it outputs document-word distributions of augmented texts x a and x b ; (2). The topic inference network I (middle) utilizes document-word distributions x a and x b as input and transforms them into document-topic distributions θ a and θ b ; (3). The prior matching module (right) aims to push the statistical distribution of document-topic distributions to the Dirichlet prior and calculate the representation learning objectives. The functionalities of these components will be introduced in the following subsections. Additionally, for the clarity of illustration, we list the symbols used in this paper and corresponding explanations in Table 1.

3.1. Text Augmentation and Representation

Text augmentation plays an essential role in self-supervised representation learning, and its key idea is keeping the semantic consistency between augmented text and original text.
Thus, for each document with tf-idf representation x , we conduct various random perturbations to generate augmented samples that share the same semantic meaning with x . To carry out the augmentation, we first select L words (set to 5 in our experiments) that appear in x with the lowest weights. Then, we conduct the following perturbations for each selected word:
  • Increase weight by 10% with p% probability (p is set to 15 in the experiment);
  • Decrease weight by 10% with p% probability (p is set to 15 in the experiment);
  • Set weight to zero with p% probability (p is set to 15 in the experiment).
Then, we represent the augmented texts with document-word distributions x a and x b , which can be obtained by dividing the summation of weights over vocabulary.

3.2. Topic Inference Network

The topic inference network I aims to capture the correlations between words and topics by building the projection from the document-word distribution to the document-topic distribution. It consists of one V-dimensional document-word distribution layer, two S-dimensional semantic representation layers, and one K-dimensional document-topic distribution layer.
In more detail, for each document-word distribution x { x a , x b } , inference network I transforms it into the S-dimensional semantic space at first with the transformation defined as:
h 1 = SN ( W 1 x + b 1 )
o 1 = Hardswish ( h 1 )
h 2 = SN ( W 2 o 1 + b 2 )
o 2 = Hardswish ( h 2 )
where W 1 R S × V and W 2 R S × S are weight matrices of semantic layers and b 1 and b 2 are bias terms. h 1 , h 2 , o 1 , and o 2 are hidden states and activation signals of the two semantic layers. Moreover, SN ( · ) denotes spectrum normalization [34], and Hardswish ( · ) is the Hardswish activation function [35]. Then, the topic inference network will project the output signal o 2 into a K-dimensional document-topic distribution via the formulation below:
h t = BN ( W t o 2 + b t )
θ = softmax ( h t )
where W t R K × S and b t are weight matrix and bias terms, h t means the hidden states of the document-topic distribution layer, and θ { θ a , θ b } denotes the inferred topic distribution.

3.3. Prior Matching in Topic Space

As Wallach et al. [36] argued that multiple peaks of the Dirichlet density are suitable for mining multi-modality among texts, we model topics with the Dirichlet prior in latent topic space.
To match topic distributions to the pre-defined Dirichlet prior, we utilize the Maximum Mean Discrepancy (MMD) [37], which is a widely used statistical distribution matching tool. MMD measures the distance between probability densities [38] by mapping their mean embeddings into a Reproducing Kernel Hilbert Space (RKHS) [39]. Concretely, given a set of inferred topic distributions Θ = { θ a 1 , , θ a M , θ b 1 , , θ b M } sampled from their statistical distribution P Θ and a set of random samples Θ = { θ 1 , θ 2 , , θ 2 M } drawn from the Dirichlet prior distribution P Θ , the distance between P Θ and P Θ can be calculated with:
D ( P Θ , P Θ ) = 1 2 M ( 2 M 1 ) i j ( κ ( θ i , θ j ) + κ ( θ i , θ j ) ) 1 2 M 2 i = 1 2 M j = 1 2 M κ ( θ i , θ j )
where κ ( · , · ) denotes the kernel function. In this paper, we use the diffusion kernel [40] and M represents the batch size.

3.4. Training Objective

The principal idea of the proposed SSTM could be summarized as:
  • Utilizing the topic inference network to build the projection from the document-word distribution to the document-topic distribution and learn the high-quality topic distribution;
  • Matching the statistical distribution P Θ of inferred document-topic distributions to the Dirichlet prior P Θ with MMD to capture multiple patterns.
Thus, to make sure that the SSTM can mine high-quality concerns considering both interpretability and diversity, its training objective is designed as:
L = L R + β L M
where L M means the prior matching objective, which is calculated with the MMD distance defined in Equation (7), and β denotes the coefficient of the prior matching regularizer.
On the other hand, to learn a high-quality projection and provide informative topic distributions, the representation learning objective L R should consider the requirements below:
  • Invariance: The inference network should map similar document-word distributions ( x a and x b ) to similar document-topic distributions ( θ a and θ b ).
  • Variance: Topic collapse should be prevented as it will map different document-word distributions to the same document-topic distribution and deteriorate topic diversity.
  • Covariance: Different dimensions of inferred document-topic distributions should be decorrelated, and each dimension should capture an independent semantic meaning lying behind texts.
Given two batches of inferred topic distributions Θ a = { θ a 1 , , θ a M } and Θ b = { θ b 1 , , θ b M } , we should first ensure that similar document-word distributions have similar document-topic distributions (invariance) by calculating the averaged squared Euclidean distance between topic distribution pairs. By minimizing the Euclidean distance between topic distributions of paired augmented documents, we could ensure that semantically similar documents have similar document-topic distributions and enhance the topic interpretability. The invariance objective function L i n v is defined as:
L i n v = 1 M i = 1 M | | θ a i θ b i | | 2 2
where M denotes the batch size.
Then, to prevent topic collapse, we incorporate a variance regularization term, and the variance objective function of Θ a is formed as:
L v a r ( Θ a ) = 1 K k = 1 K max ( 0 , 1 ς ( Θ a k , ϵ ) )
where ς ( z , ϵ ) = Var ( z ) + ϵ , Θ a k represents the vector composed of each value at the k-th dimension in all topic distributions in Θ a , and ϵ is a small scalar which is set to 1 × 10 4 for numerical stability. This regularizer enforces a variance of 1 for each dimension within the current batch, helping to prevent all inputs from collapsing onto a single vector. Also, this term ensures the variance of each dimension of topic distribution and improves the topic diversity. The variance objective function of Θ a and Θ b can be summarized as L v a r with the formulation:
L v a r = 1 2 ( L v a r ( Θ a ) + L v a r ( Θ b ) )
where L v a r ( Θ b ) follows the formula in Equation (10).
Moreover, to disentangle the semantic meaning of topics, the covariance objective function is designed based on the covariance matrix which is formulated as:
C ( Θ a ) = 1 M 1 i = 1 M ( θ a i θ a ) ( θ a i θ a ) T
where θ a = 1 M i = 1 M θ a i . Aiming at decorrelating the dimensions of document-topic distributions, we follow VICREG [41] and define a covariance objective function L c o v with:
L c o v = 1 2 K ( i j [ C ( Θ a ) ] i , j 2 + i j [ C ( Θ b ) ] i , j 2 )
where C ( Θ b ) follows the calculation in Equation (12). The covariance objective helps SSTM disentangle the dimensions of document-topic distributions and improve topic interpretability.
With the definition of invariance objective function L i n v (Equation (9)), variance objective function L v a r (Equation (11)), and covariance objective function L c o v (Equation (13)), the representation learning objective function L R can be formulated with:
L R = λ L v a r + μ L c o v + η L i n v
where λ , μ and η are coefficients of different regularization terms. The training procedure of the proposed SSTM is shown in Algorithm 1. For detailed configurations, refer to Section 4.1.4.
Algorithm 1: Self-Supervised neural Topic Model
Mathematics 13 00183 i001

3.5. Topic Generation

After model training, the inference network I builds the projection from the document-word distribution layer to the document-topic distribution layer, which could be utilized for topic mining.
By feeding into the diagonal matrix W R V × V (each row represents the one-hot encoding of a word in the vocabulary), we can obtain correlations between topics and words via the transformation below:
Φ = I ( W )
where Φ R V × K is the correlation matrix between words and topics. By performing column-wise normalization and ranking, we can obtain the topic-word distribution matrix Φ (k-th column is the topic-word distribution of the k-th topic), which can be used for topic extraction.

4. Experiments

In this section, we will first describe the experimental setup. Then, interpretability and diversity evaluation, a case study, hyper-parameter analysis, and an ablation study will be presented.

4.1. Experimental Setup

In this subsection, we will first provide descriptions of datasets and baselines. Following this, evaluation metrics and implementation details will be introduced.

4.1.1. Datasets

To evaluate the performance of the proposed model, we collect and process three publicly available datasets, which are the Twitter Posts (https://www.kaggle.com/datasets/manishabhatt22/tweets-onchatgpt-chatgpt, accessed on 8 February 2024), Subreddit (https://www.kaggle.com/datasets/armitaraz/chatgpt-reddit, accessed on 8 February 2024), and User Query (https://www.kaggle.com/datasets/noahpersaud/89k-chatgpt-conversations, accessed on 8 February 2024) datasets. In more detail, the Twitter Posts dataset contains 58 K tweets with the hashtag ‘#ChatGPT’ posted on Twitter from 30 November 2022 to 24 February 2023. The Subreddit dataset consists of nearly 18 K reviews regarding ChatGPT posted on the Reddit website. Both datasets encompass diverse user attitudes and opinions on this AI-dialogue system and are suitable for public opinion mining. When collecting these data, we define the scope based on the communities of ChatGPT, without considering factors such as the geographical location and age range of the commenters. On the other hand, the User Query dataset contains all the available conversations from chatlogs.net between users and ChatGPT until 20 April 2023. This dataset is collected from chatlogs.net using a custom web scraper (https://github.com/P1ayer-1/chatlogs.net-scraper, accessed on 20 February 2024) and its content contains user-concerned aspects and expectations of this powerful tool. In our experiment, we conduct an anonymization process, such as removing and replacing personally identifiable information, to protect personal privacy. Additionally, to address data noise, we conduct data preprocessing before performing experiments. To remove tweet-specific noise (such as hashtags, URLs, numbers, etc.), we first discard the non-English tweets with fastlid (https://pypi.org/project/fastlid/, accessed on 25 February 2024). Then, we employ the preprocessor (https://github.com/s/preprocessor, accessed on 25 February 2024) and twitter_preprocessor (https://github.com/vasisouv/tweets-preprocessor, accessed on 25 February 2024) libraries for cleaning. We also employ enchant (https://github.com/AbiWord/enchant, accessed on 25 February 2024) and spacy (https://github.com/explosion/spaCy, accessed on 25 February 2024) for spell-checking and lemmatization. The stopwords and low-frequency words are removed. Table 2 presents the statistics of the processed datasets. Likewise, we also provide the word-cloud visualizations of three datasets in Figure 3.

4.1.2. Baselines

To validate the effectiveness of the SSTM, we choose the following methods as baselines:
  • LDA [31], a probabilistic graphical model which views document is generated by a mixture of topics, and we employ the Mallet (https://mimno.github.io/Mallet/, accessed on 15 March 2024) implementation with suggested configurations.
  • NVDM [32], a VAE-based neural topic model which models topics with the Gaussian prior in latent topic space, official implementation is used in the experiments (https://github.com/ysmiao/nvdm, accessed on 15 March 2024).
  • GSM [42], a variant of the NVDM, utilizes a Gaussian distribution followed by a softmax transformation to construct the topic distribution; the official implementation (https://github.com/linkstrife/NVDM-GSM, accessed on 15 March 2024) is used.
  • ProdLDA [33], a neural topic model which uses the Logistic-Normal distribution as prior. The official implementation (https://github.com/akashgit/autoencoding_vi_for_topic_models, accessed on 16 March 2024) is utilized.
  • WLDA [43], a neural topic model built upon the Wasserstein Auto-Encoder, utilizing the Dirichlet distribution as the prior of the topic distribution; the official implementation (https://github.com/awslabs/w-lda, accessed on 17 March 2024) is used.
  • ATM [14], a neural topic model employing adversarial training; we implement the model according to the recommended configurations.
  • BAT [10], an extension of ATM which incorporates an encoder network for topic inference; we implement it with the suggestions in the original paper.
  • ETM [12], a VAE-based approach that leverages topic embedding and word embeddings to capture semantic relationships; the official implementation (https://github.com/lffloyd/embedded-topic-model, accessed on 17 March 2024) is used.
  • CNTM [7], a neural topic modeling approach based on contrastive learning in latent topic space, and the official implementation is utilized (https://github.com/SZU-AdvTech-2022/160-Contrastive-Learning-for-Neural-Topic-Model, accessed on 20 March 2024).
  • BERTopic [13], a topic extraction tool based on the pre-trained language model (BERT); the released source code is used (https://github.com/MaartenGr/BERTopic, accessed on 22 March 2024) in our experiments.
  • CTMNeg [11], a topic mining tool that incorporates contrastive learning and negative sampling to effectively capture the underlying structure and semantics; the official implementation (https://github.com/adhyasuman/ctmneg, accessed on 22 March 2024) is used in this paper.

4.1.3. Evaluation Metrics

As interpretability and diversity are two crucial aspects of public concern extraction, we consider two types of metrics for performance evaluation. To evaluate the interpretability of extracted concerns, topic coherence [44] (C_P, NPMI, UCI) is employed as [45] argued that coherence metric agrees with human judgment. All these values are computed with Palmetto library (https://github.com/dice-group/Palmetto, accessed on 12 April 2024). For evaluating concern diversity, we utilize the Unique Term Rate (UTR), which is calculated with | U T | K × 10 , where | U T | denotes the number of unique words in the extracted topic words. Note that higher values refer to a better quality of concern.

4.1.4. Implementation Details

We adopt the training settings with 50 epochs and a batch size of 32 for different topic numbers K [ 10 , 20 , 30 , 40 ] . The Dirichlet hyper-parameter α is set to 0.01 , and each semantic representation layer contains 400 units. Furthermore, the coefficient weight μ , λ and η are set to 1, K / 2 , and K, respectively. Our proposed SSTM is optimized using the ADAM [46] optimizer with a learning rate l r of 2 × 10 5 . In our experiments, each topic is represented with a list of the top 10 words for interpretability and diversity evaluation.
Also, our source code relies on libraries: PyTorch (https://pytorch.org/, accessed on 8 March 2024), SciPy (https://scipy.org/, accessed on 8 March 2024), NumPy (https://numpy.org/, accessed on 10 March 2024), torch_two_sample (https://torch-two-sample.readthedocs.io/en/latest/, accessed on 10 March 2024), and scikit-learn (https://scikit-learn.org/stable/, accessed on 10 March 2024). All of our experiments are carried out on a Ubuntu machine equipped with a single 3090Ti GPU (CUDA version 11.7).

4.2. Interpretability and Diversity Evaluation

We have conducted a comprehensive set of experiments to assess the performance of the proposed SSTM. To assess the interpretability of the extracted concerns, we first compare the averaged topic coherence and provide a global view of comparisons. The results are listed in Table 3. Each value is obtained by averaging the coherence values on 10, 20, 30, and 40 topic number settings. To clarify the comparison, the optimal result is marked in bold, and the values marked with underline perform the second best. The statistics reveal that the SSTM outperforms all the baselines on the Twitter Posts and User Query datasets. For the Subreddit dataset, the SSTM performs the best in terms of the NPMI and UCI metrics. However, it slightly falls behind CNTM on the C_P metric.
We can observe that approaches like CNTM and BERTopic perform well on the User Query dataset while performing worse on others. This may be attributed to the fact that they could not deal with the data sparsity of short-length social media posts. As queries often have more sufficient word-co-occurrence information compared with the Twitter and Reddit posts, CNTM and Bertopic perform well on the User Query dataset. With these observations, we could conclude that approaches are more prone to mining high quality topics/concerns from long text corpora.
In Figure 4, we also present the comparison of interpretability (topic coherence) vs. different topic numbers on Twitter Posts, Subreddit, and User Query datasets. For the Twitter Posts dataset, we can observe that the SSTM surpasses almost all the baselines except for the 10-topic number setting on UCI and the 40-topic number setting on C_P metrics. Similarly, it achieves optimal results across all the topic settings on three metrics for the User Query dataset, except for 40 topics on the UCI metric. Moreover, the SSTM outperforms most baselines on the NPMI and UCI metrics across different topics, but it obtains slightly worse results on the C_P metric. To provide a more intuitive comparison of extracted concerns, Table 4 presents the concern-specific representative words, obtained by the SSTM and baseline approaches. These concerns may be related to ‘education’ and ‘marketing’.
On the other hand, we also provide a comparison of concern diversity across various topic numbers in Table 5. The experimental results demonstrate that our proposed SSTM surpasses the baselines in all topic number settings on the Twitter Posts dataset. It also performs the best except for 30 and 40 topic settings on Subreddit and User Query. Notably, GSM demonstrates favorable performance on both datasets. However, it sacrifices the performance for topic coherence metrics, as depicted in Figure 4.
Generally, such improvements in topic coherence and diversity may be attributed to the incorporation of Dirichlet priors in latent space, invariance, and covariance regularizers. These comparisons also prove that the variance regularizer helps to address the topic collapse issue. Additionally, for the same metrics, approaches typically achieve better results on the User Query dataset, which may be caused by data characteristics such as longer text length and richer semantic information.

4.3. Hyper-Parameter Analysis

To explore whether the performance of SSTM exhibits variations with changes in hyper-parameters (e.g., the learning rate l r , the Dirichlet prior α , the number of hidden units S, and the probability p in text augmentation), we conduct the hyper-parameter analysis in this subsection.
Specifically, considering practical issues such as experiment duration, we opted to conduct this experiment on the User Query dataset with the topic number set to 10. Likewise, C_P, UCI, and NPMI are utilized to measure the extracted concerns. Extensive experiments have demonstrated that the proposed SSTM performs stably when the parameters vary within the specified range, and experimental results are listed in Table 6. Furthermore, to present a more straightforward comparison, we visualize the results in Figure 5, and other parameters follow the default configurations. We will discuss the parameter analysis in more detail below.
  • Learning Rate ( l r ): To validate if the performance of the SSTM varies with different learning rates l r , we conduct parameter analysis experiments on l r , which is set to five values { 9 × 10 6 , 1 × 10 5 , 2 × 10 5 , 3 × 10 5 and 4 × 10 5 }. The topic coherence results, listed in Table 6, show that the proposed SSTM outperforms baseline approaches with all five settings of learning rate. Additionally, it performs the best when l r is set to 2 × 10 5 . Additionally, the comparison in Figure 5a proves that the SSTM is insensitive to the choice of learning rate parameter.
  • Dirichlet Prior ( α ): As the Dirichlet distribution plays a crucial role in mining multiple patterns for topic models, we also carry out the hyper-parameter α of the Dirichlet density, and its values are set to {0.005, 0.01, 0.05, 0.1, 0.2}. Table 6 presents the results of the SSTM with various settings of α . We can observe that the SSTM performs stably across all five settings and it obtains the optimal coherence when α is set to 0 .01 . The visualization comparison in Figure 5b reveals the same finding.
  • Number of Hidden Units (S): Likewise, to investigate the influence of the number of hidden units S on extraction performance, we employ five different settings {300, 350, 400, 450, 500} of S and conduct the corresponding analysis. As illustrated in Table 6, the SSTM with different numbers of hidden units outperforms the baseline approaches and performs the best when S is set to 450. Moreover, Figure 5c presents a more intuitive comparison.
  • Probability of Augmentation (p): To explore the impact of probability p utilized in text augmentation operation, we conducted experiments with five different settings p { 5 , 10 , 15 , 20 , 25 } . As shown by the statistics in Table 6, through variations in text augmentation probabilities, the proposed SSTM exhibits remarkable stability and consistently surpasses the competitive baseline, and it performs the best when p is set to 10. Likewise, the findings are also presented in Figure 5d.

4.4. Human Evaluation

To assess the topic quality of the SSTM and baseline approaches, we also perform a human evaluation of the User Query dataset with 10 topic number settings. In detail, we recruit two experts to find the out-of-topic words, which are semantically inconsistent with the topic, and count the number of irrelevant words. Finally, we average the counts and conduct the comparison in Table 7.
From the statistics, we could observe that topics extracted by the SSTM contain fewer out-of-topic words compared with other baselines. This further validates the superior performance of the SSTM in terms of topic coherence.

4.5. Case Study

We also carry out case studies and present concern examples in Figure 6 to validate the effectiveness of the SSTM. From the examples, we can observe that the SSTM uncovers concerns of Twitter users about “Student Cheating with ChatGPT” and “Impact on the Investment Market”. Likewise, SSTM mines concerns about “Assitant on math” and “Solving Programming Puzzles using ChatGPT” from the Subreddit and User Query datasets.
To prove the faithfulness of extracted concerns, we use Sentence-Bert [47] to design a retrieval scheme. Posts are selected according to the cosine similarity of contextualized embeddings of the topic words and text in the corpus.
We choose three distinct user concerns from three datasets and present the topic words along with corresponding posts or queries. The concerns in the Twitter Posts and Subreddit datasets primarily revolve around discussions regarding ChatGPT itself and its influence. On the other hand, the concerns in the User Query dataset highlight the specific tasks that users expect ChatGPT to address or complete. The above three results reveal the hot spots of public discussion on ChatGPT from various views.

4.6. Data Analysis

Due to the user privacy protection mechanism of social media platforms, we can not access user privacy information, such as age and gender. However, we conduct data analysis and present the differences in concerns extracted from different platforms (Twitter, Reddit, and Query). In this part, topic numbers are set to 10. And we present each mined concern with one keyphrase as shown in Table 8.
It can be observed that several common concerns are discussed by users from different platforms, such as “academic writing tool” extracted from both Subreddit and User Query. Likewise, there are also some platform-specific user concerns like the “programming assistant” and “student cheating”. These extracted concerns/topics present a comprehensive understanding of public opinion on ChatGPT.

4.7. Ablation Study

As the regularization terms and document representation approaches play key roles in our proposed SSTM, we conduct the corresponding ablation in this subsection to explore their impacts on topic modeling performance.

4.7.1. Regularization Term Ablation

To explore the roles of invariance, variance, and covariance regularization terms in the SSTM, we add one more dataset (Subreddit) in this subsection and conduct the ablation study on the Subreddit and User Query datasets with the topic number set to 10. The corresponding results are listed in Table 9; S S T M i , S S T M v and S S T M c mean variants that exclude invariance, variance and covariance terms.
Here, for a clearer comparison, we have retained the results of ETM and BAT, as their performance is quite competitive when the number of topics is set to 10. From the statistics, we observe that S S T M i performs the worst among SSTM variants, and this may be attributed to the invariance objective having a high impact on topic quality. Similarly, we observe that S S T M c performs slightly worse than the SSTM. This discrepancy can be attributed to the covariance regularizer, which aids in decorrelating dimensions of the topic distribution and segregating irrelevant words into different topics for interpretable topic extraction. Additionally, we find a decline in the UTR metric for the S S T M v , which reveals that the variance regularizer has a positive impact on topic diversity. On the other hand, though these variants perform worse than the SSTM, they also surpass the competitive ETM on all the coherence and diversity metrics, which can be ascribed to the usage of the self-supervised learning framework and the Dirichlet prior.

4.7.2. Document Representation Ablation

To examine if incorporating external semantic knowledge will improve modeling ability and explore the impact of the document representation approaches, we conduct this experiment using the pre-trained language model for representing documents on the Subreddit dataset. Concretely, for each document that contains the word sequence { w 1 , w 2 w n } , we first generate the contextualized word representations { e 1 , e 2 e n } . Then, the document is represented with the averaged word-level representations, computed with 1 n i = 1 n e i . The detailed comparison is listed in Table 10.
As shown in Table 10, the SSTM with the pre-trained language model S S T M P outperforms the SSTM and the competitive LDA across all the coherence and diversity metrics for all the topic number settings. For clarity, we only retain the LDA as the baseline as it performs the second best according to Table 3. Notably, the comparison proves that injecting external semantic knowledge can indeed improve the modeling ability of the SSTM.

5. Discussion

This study is based on the rapid development of the ChatGPT dialogue system, which has sparked extensive discussions and generated a large volume of text data. Such user-generated texts contain potential public expectations and concerns, and understanding these is crucial for the development of ChatGPT. We collect and process relevant data from three different social media platforms to mine public concerns. Although self-supervised learning has been widely applied in natural language processing and computer vision, there has been limited exploration of topic models using representation learning techniques. Thus, we formulate topic modeling as a representation learning task and propose the Self-Supervised Neural Topic Model (SSTM). To ensure topic quality and diversity, three training objectives (invariance, variance and covariance) are incorporated into the SSTM. The experimental results demonstrate the superior performance of the SSTM in discovering interpretable concerns/topics. Also, experimental results show that the SSTM could extract higher quality concerns/topics from the User Query dataset than those of Twitter Posts and Subreddit, which is caused by the data characteristics. Specifically, queries often contain more words than tweets and posts and have sufficient co-occurence information which is crucial for topic modeling and pattern discovery. Thus, we could draw the conclusion that our proposed SSTM is prone to mine higher-quality concerns/topics from platforms that can provide long-length texts.
Due to the effectiveness of the SSTM in mining public concerns, practical implications could be strengthened by the SSTM. For example, we could provide technical support for companies and policymakers to discover public concerns about other topics. Also, we will explore designing new policies (such as privacy enhancement measures and transparency reports) to address the ethical concerns about these AI technologies, especially in sensitive areas like education and healthcare.

6. Conclusions

In this paper, we propose the Self-Supervised Neural Topic Model, a novel neural topic modeling approach based on representation learning, for mining high-quality concerns about ChatGPT. In more detail, the SSTM utilizes the Dirichlet prior in the latent topic space to capture multiple semantic patterns. Additionally, to ensure the quality of the extracted concerns, it incorporates three regularization terms, which are invariance, variance, and covariance, for model training. The extensive experiments on three ChatGPT-related datasets show that our proposal could mine higher quality public concerns about this advanced dialogue system in terms of interpretability and diversity. As the proposed, the SSTM only employs the bag-of-words information, neglecting the sparse characteristics of the review text and the contextual information, and our future work will involve integrating external semantic knowledge (such as word embeddings and the pre-trained language models) into the modeling process and designing the knowledge-enhanced topic model for public concern extraction task.

Author Contributions

Conceptualization, R.W. and Z.H.; methodology, R.W. and X.L.; formal analysis and investigation, P.R. and S.C.; Writing—original draft preparation, X.L.; Writing—review and editing, P.R. and S.C.; funding acquisition, R.W. and Z.H.; visualization, X.L. and H.H.; supervision, G.S. and Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Nos. 62102192, 62162063), the Science and Technology Base and Talent Program of Guangxi under Grant number 2021AC19308 (Project contract number GuikeAD22035110), the Innovation and Entrepreneurship Program of Jiangsu Province (JSSCBS20210530), the fellowship of China Postdoctoral Science Foundation (2022M710071), and the Introduction of Talent Research and Research Fund of Nanjing University of Posts and Telecommunications (NY220132), the Fundamental Research Funds for the Central Universities (aiia-24-01).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We extend our gratitude to the anonymous reviewers for their valuable feedback and beneficial recommendations. We strictly use the datasets referenced in the paper within the scope of their licenses, and we have removed personal information during the data preprocessing stage to protect privacy. Furthermore, we will only apply the data for academic research purposes.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
SSTMSelf-Supervised Neural Topic Model
VAEVariational Auto-Encoder
GANsGenerative Adversarial Networks
SSLSelf-Supervised Learning
NTMsNeural Topic Models
LDALatent Dirichlet Allocation
NVDMNeural Variational Document Mode
BATBidirectional Adversarial-neural Topic

References

  1. Liu, Y.; Han, T.; Ma, S.; Zhang, J.; Yang, Y.; Tian, J.; He, H.; Li, A.; He, M.; Liu, Z.; et al. Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models. arXiv 2023, arXiv:2304.01852. [Google Scholar]
  2. Abdullah, M.; Madain, A.; Jararweh, Y. ChatGPT: Fundamentals, Applications and Social Impacts. In Proceedings of the Ninth International Conference on Social Networks Analysis, Management and Security, SNAMS 2022, Milan, Italy, 29 November–1 December 2022; IEEE: Piscataway, HJ, USA, 2022; pp. 1–8. [Google Scholar] [CrossRef]
  3. Zheng, S.; Yahya, Z.; Wang, L.; Zhang, R.; Hoshyar, A.N. Multiheaded deep learning chatbot for increasing production and marketing. Inf. Process. Manag. 2023, 60, 103446. [Google Scholar] [CrossRef]
  4. Hill-Yardin, E.; Hutchinson, M.; Laycock, R.; Spencer, S. A Chat (GPT) about the future of scientific publishing. Brain Behav. Immun. 2023, 110, 152–154. [Google Scholar] [CrossRef] [PubMed]
  5. Li, Z.; Hu, H.; Wang, H.; Cai, L.; Zhang, H.; Zhang, K. Why does the president tweet this? Discovering reasons and contexts for politicians’ tweets from news articles. Inf. Process. Manag. 2022, 59, 102892. [Google Scholar] [CrossRef]
  6. Zhao, H.; Phung, D.Q.; Huynh, V.; Jin, Y.; Du, L.; Buntine, W.L. Topic Modelling Meets Deep Neural Networks: A Survey. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Montreal, QC, Canada, 19–27 August 2021; pp. 4713–4720. [Google Scholar] [CrossRef]
  7. Nguyen, T.; Luu, A.T. Contrastive Learning for Neural Topic Model. In Advances in Neural Information Processing Systems 34, Proceedings of the Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, Virtual, 6–14 December 2021; Neural Information Processing Systems Foundation, Inc.: La Jolla, CA, USA, 2021; pp. 11974–11986. [Google Scholar]
  8. Wang, R.; Zhou, D.; He, Y. Open Event Extraction from Online Text using a Generative Adversarial Network. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 282–291. [Google Scholar] [CrossRef]
  9. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  10. Wang, R.; Hu, X.; Zhou, D.; He, Y.; Xiong, Y.; Ye, C.; Xu, H. Neural Topic Modeling with Bidirectional Adversarial Training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 340–350. [Google Scholar] [CrossRef]
  11. Adhya, S.; Lahiri, A.; Sanyal, D.K.; Das, P.P. Improving Contextualized Topic Models with Negative Sampling. In Proceedings of the ICON, International Conference on Cognitive Neuroscience, Barcelona, Spain, 24–28 April 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 128–138. [Google Scholar]
  12. Dieng, A.B.; Ruiz, F.J.R.; Blei, D.M. Topic Modeling in Embedding Spaces. Trans. Assoc. Comput. Linguist. 2020, 8, 439–453. [Google Scholar] [CrossRef]
  13. Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
  14. Wang, R.; Zhou, D.; He, Y. ATM: Adversarial-neural Topic Model. Inf. Process. Manag. 2019, 56, 102098. [Google Scholar] [CrossRef]
  15. Zhou, X.; Bu, J.; Zhou, S.; Yu, Z.; Zhao, J.; Yan, X. Improving topic disentanglement via contrastive learning. Inf. Process. Manag. 2023, 60, 103164. [Google Scholar] [CrossRef]
  16. Zbontar, J.; Jing, L.; Misra, I.; LeCun, Y.; Deny, S. Barlow Twins: Self-Supervised Learning via Redundancy Reduction. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual, 18–24 July 2021; Volume 139, pp. 12310–12320. [Google Scholar]
  17. Misra, I.; van der Maaten, L. Self-Supervised Learning of Pretext-Invariant Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6706–6716. [Google Scholar] [CrossRef]
  18. Hendrycks, D.; Mazeika, M.; Kadavath, S.; Song, D. Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty. In Proceedings of the Neural Information Processing Systems NeurIPS, Vancouver, BC, Canada, 8–14 December 2019; pp. 15637–15648. [Google Scholar]
  19. Bijoy, M.B.; Pebbeti, B.P.; Akondi, S.M.; Fathaah, S.A.; Raut, A.; Pournami, P.N.; Jayaraj, P.B. Deep Cleaner—A Few Shot Image Dataset Cleaner Using Supervised Contrastive Learning. IEEE Access 2023, 11, 18727–18738. [Google Scholar] [CrossRef]
  20. Fini, E.; Astolfi, P.; Alahari, K.; Alameda-Pineda, X.; Mairal, J.; Nabi, M.; Ricci, E. Semi-supervised learning made simple with self-supervised clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 3187–3197. [Google Scholar] [CrossRef]
  21. Bardes, A.; Ponce, J.; LeCun, Y. VICRegL: Self-Supervised Learning of Local Visual Features. In Advances in Neural Information Processing Systems 35, Proceedings of the Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Neural Information Processing Systems Foundation, Inc.: La Jolla, CA, USA, 2022. [Google Scholar]
  22. Yan, Y.; Li, R.; Wang, S.; Zhang, F.; Wu, W.; Xu, W. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Bangkok, Thailand, 1–6 August 2021; Volume 1, pp. 5065–5075. [Google Scholar]
  23. Zhu, H.; Zheng, Z.; Soleymani, M.; Nevatia, R. Self-Supervised Learning for Sentiment Analysis via Image-Text Matching. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Singapore, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1710–1714. [Google Scholar] [CrossRef]
  24. Mu, L.; Jin, P.; Zhang, Y.; Zhong, H.; Zhao, J. Synonym recognition from short texts: A self-supervised learning approach. Expert Syst. Appl. 2023, 224, 119966. [Google Scholar] [CrossRef]
  25. Luo, X.; Ju, W.; Gu, Y.; Mao, Z.; Liu, L.; Yuan, Y.; Zhang, M. Self-supervised Graph-level Representation Learning with Adversarial Contrastive Learning. ACM Trans. Knowl. Discov. Data 2024, 18, 1–23. [Google Scholar] [CrossRef]
  26. Yi, S.; Ju, W.; Qin, Y.; Luo, X.; Liu, L.; Zhou, Y.; Zhang, M. Redundancy-Free Self-Supervised Relational Learning for Graph Clustering. arXiv 2023, arXiv:2309.04694. [Google Scholar] [CrossRef]
  27. Ju, W.; Wang, Y.; Qin, Y.; Mao, Z.; Xiao, Z.; Luo, J.; Yang, J.; Gu, Y.; Wang, D.; Long, Q.; et al. Towards Graph Contrastive Learning: A Survey and Beyond. arXiv 2024, arXiv:2405.11868. [Google Scholar] [CrossRef]
  28. Ju, W.; Gu, Y.; Mao, Z.; Qiao, Z.; Qin, Y.; Luo, X.; Xiong, H.; Zhang, M. GPS: Graph Contrastive Learning via Multi-scale Augmented Views from Adversarial Pooling. arXiv 2024, arXiv:2401.16011. [Google Scholar] [CrossRef]
  29. Yang, J.; Xu, H.; Mirzoyan, S.; Chen, T.; Liu, Z.; Liu, Z.; Ju, W.; Liu, L.; Xiao, Z.; Zhang, M.; et al. Poisoning medical knowledge using large language models. Nat. Mac. Intell. 2024, 6, 1156–1168. [Google Scholar] [CrossRef]
  30. Huang, J.; Chen, L.; Guo, T.; Zeng, F.; Zhao, Y.; Wu, B.; Yuan, Y.; Zhao, H.; Guo, Z.; Zhang, Y.; et al. MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation. arXiv 2023, arXiv:2407.00468. [Google Scholar]
  31. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
  32. Miao, Y.; Yu, L.; Blunsom, P. Neural Variational Inference for Text Processing. In Proceedings of the 33rd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, 19–24 June 2016; Volume 48, pp. 1727–1736. [Google Scholar]
  33. Srivastava, A.; Sutton, C. Autoencoding Variational Inference For Topic Models. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
  34. Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral Normalization for Generative Adversarial Networks. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  35. Howard, A.; Pang, R.; Adam, H.; Le, Q.V.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.; Tan, M.; Chu, G.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, HJ, USA, 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
  36. Wallach, H.M.; Mimno, D.M.; McCallum, A. Rethinking LDA: Why Priors Matter. In Advances in Neural Information Processing Systems 22, Proceedings of the 23rd Annual Conference on Neural Information Processing Systems 2009, Vancouver, BC, Canada, 7–10 December 2009; Curran Associates, Inc.: Newry, UK, 2009; pp. 1973–1981. [Google Scholar]
  37. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A.J. A Kernel Method for the Two-Sample-Problem; MIT Press: Cambridge, MA, USA, 2006; pp. 513–520. [Google Scholar]
  38. Baktashmotlagh, M.; Harandi, M.T.; Salzmann, M. Distribution-Matching Embedding for Visual Domain Adaptation. J. Mach. Learn. Res. 2016, 17, 1–30. [Google Scholar]
  39. Berlinet, A.; Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  40. Lafferty, J.D.; Lebanon, G. Diffusion Kernels on Statistical Manifolds. J. Mach. Learn. Res. 2005, 6, 129–163. [Google Scholar]
  41. Bardes, A.; Ponce, J.; Lecun, Y. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. In Proceedings of the ICLR 2022—International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  42. Miao, Y.; Grefenstette, E.; Blunsom, P. Discovering Discrete Latent Topics with Neural Variational Inference. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017; PMLR: Breckenridge, CO, USA, 2017; Volume 70, pp. 2410–2419. [Google Scholar]
  43. Tolstikhin, I.O.; Bousquet, O.; Gelly, S.; Schölkopf, B. Wasserstein Auto-Encoders. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  44. Röder, M.; Both, A.; Hinneburg, A. Exploring the Space of Topic Coherence Measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM 2015, Shanghai, China, 2–6 February 2015; ACM: New York, NY, USA, 2015; pp. 399–408. [Google Scholar] [CrossRef]
  45. Chang, J.D.; Boyd-Graber, J.L.; Gerrish, S.; Wang, C.; Blei, D.M. Reading Tea Leaves: How Humans Interpret Topic Models. In Advances in Neural Information Processing Systems 22, Proceedings of the 23rd Annual Conference on Neural Information Processing Systems 2009, Vancouver, BC, Canada, 7–10 December 2009; Curran Associates, Inc.: Newry, UK, 2009; pp. 288–296. [Google Scholar]
  46. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  47. Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3980–3990. [Google Scholar] [CrossRef]
Figure 1. Three tweet examples of ChatGPT which convey different concerns; colored words are representative words of specific concerns, where yellow means ‘Education’, green denotes ‘Security’, and blue represents ‘Technical Basis’.
Figure 1. Three tweet examples of ChatGPT which convey different concerns; colored words are representative words of specific concerns, where yellow means ‘Education’, green denotes ‘Security’, and blue represents ‘Technical Basis’.
Mathematics 13 00183 g001
Figure 2. The overall framework of proposed Self-Supervised Neural Topic Model.
Figure 2. The overall framework of proposed Self-Supervised Neural Topic Model.
Mathematics 13 00183 g002
Figure 3. Word-cloud visualizations of three datasets.
Figure 3. Word-cloud visualizations of three datasets.
Mathematics 13 00183 g003
Figure 4. The comparison of interpretability (topic coherence) with different topic numbers on Twitter Posts, Subreddit, and User Query datasets.
Figure 4. The comparison of interpretability (topic coherence) with different topic numbers on Twitter Posts, Subreddit, and User Query datasets.
Mathematics 13 00183 g004
Figure 5. Performance analysis on various l r , α , S, and p. (We only retain ETM (second best for four topic number settings according to Table 3) and BAT (performs the second best with the topic number set to 10) for clarity).
Figure 5. Performance analysis on various l r , α , S, and p. (We only retain ETM (second best for four topic number settings according to Table 3) and BAT (performs the second best with the topic number set to 10) for clarity).
Mathematics 13 00183 g005
Figure 6. Concern examples and representative posts/queries.
Figure 6. Concern examples and representative posts/queries.
Mathematics 13 00183 g006
Table 1. The notations used in this paper and their descriptions.
Table 1. The notations used in this paper and their descriptions.
SymbolDescription
Data Representation and Distribution
Xa batch of documents
x a document represented with tf-idf
x a , x b augmented documents of x
θ { θ a , θ b } the inferred document-topic distributions for augmented documents
Θ a set of inferred document-topic distributions
Θ a set of random samples drawn from the Dirichlet prior
α the hyper-parameter of the Dirichlet prior distribution
P Θ the statistical distribution of inferred document-word distributions
P Θ the Dirichlet prior distribution
Model Parameter
Ithe inference network
K , V , S topic number, vocabulary size and the number of units in semantic representation layer
W 1 , W 2 , W t the weight matrices of different layers in I
b 1 , b 2 , b t the bias terms in I
h 1 , h 2 , h t hidden states of different layers in I
o 1 , o 2 activated signals of hidden layers in I
SN ( · ) , BN ( · ) the spectrum normalization and the batch normalization
Hardswish ( · ) the Hardswish activation function
κ ( · , · ) the kernel function
L i n v the invariance objective
L v a r the variance objective
L c o v the covariance objective
L R the representation learning objective
L M the prior matching objective
Parameter of Training Procedure
E , M , B number of epoch, batch size and number of batch
WV-dimensional diagonal matrix
β , λ , μ , η the coefficient of regularizers
Φ correlation matrix between words and topics
Φ topic-word distributions matrix
pprobability used in text augmentation
Table 2. Statistics of the processed datasets.
Table 2. Statistics of the processed datasets.
Dataset#DocumentsAvg LengthVocab Size
Twitter Posts57,72414.306437
Subreddit17,43825.314221
User Query18,25866.565055
Table 3. Averaged topic coherence with four topic settings [10, 20, 30, 40] and the relative improvement rate. (The relative improvement is computed as R b R s R b R w % . Define R b as the best result, R s as the second-best result, and R w as the worst result).
Table 3. Averaged topic coherence with four topic settings [10, 20, 30, 40] and the relative improvement rate. (The relative improvement is computed as R b R s R b R w % . Define R b as the best result, R s as the second-best result, and R w as the worst result).
ModelTwitter PostsSubredditUser Query
C_PNPMIUCIRankC_PNPMIUCIRankC_PNPMIUCIRank
LDA0.17960.02740.036420.18160.03340.196720.21030.03950.30963
NVDM0.08680.0075−0.035780.09430.01840.137360.11660.0107−0.02088
WLDA0.16070.0131−0.4369100.18920.0151−0.580680.20080.0174−0.45389
GSM0.0210−0.0209−0.6608120.02240.0121−0.5007110.0495−0.0153−0.568211
ProdLDA0.16360.0196−0.222260.17110.0172−0.285670.13500.0041−0.605010
ATM0.10590.0050−0.2510110.09920.0110−0.1073120.16290.0189−0.08157
BAT0.17380.0223−0.079140.18470.02850.055940.21230.0246−0.22016
ETM0.14390.0188−0.288590.14340.0234−0.213550.22470.03970.09852
CNTM0.15890.0176−0.225770.25190.0328−0.250930.28340.0394−0.07794
BERTopic0.21770.0201−0.604850.11170.0144−1.0330100.27770.0349−0.29775
CTMNeg0.19270.0285−0.036330.03870.0369−1.459990.0745−0.0271−1.666812
SSTM0.2196
1 %
0.0401
19 %
0.1555
14.6 %
10.2321
8.6 %
0.0430
30 %
0.2401
2.6 %
10.2983
6 %
0.0581
21.6 %
0.4084
4.8 %
1
Table 4. Concern (topic) examples extracted by SSTM and baselines from User Query dataset. Italics indicate out-of-topic word.
Table 4. Concern (topic) examples extracted by SSTM and baselines from User Query dataset. Italics indicate out-of-topic word.
ATMBATCNTMBertopicSSTM
EducationMarketingEducationMarketingEducationMarketingEducationMarketingEducationMarketing
studentproductstudentmarketstudentbankstudentteamstudentproduct
schoolmonthschoolcompanyschoolfundtopicmanagementschoolbusiness
studyserviceexperiencemonthlearnratesentencebusinessresearchcompany
universitycompanyskillbusinessacademicbankingquestionmarketingpaperteam
teacherplatformlearninvestmentcoursefinancialessaycustomeressaycustomer
learnteamteamwaterteachermarketlearnproductcoursemarket
sciencebusinessresearchcostresearchpaymentattributiondevelopmentlearnservice
subjectmarketingdevelopmenthomeeducationinvestmentcourseplatformstudymarketing
coursedigitalprojectteamuniversityinflationthesisserviceskillsale
teachclientdesigninsuranceclassroominvestlanguagecompanyteachermanagement
Table 5. The comparison of diversity (Unique Term Rate). The best result is marked in bold, and the second best is marked with an underline.
Table 5. The comparison of diversity (Unique Term Rate). The best result is marked in bold, and the second best is marked with an underline.
ModelTwitter PostsSubredditUser Query
102030401020304010203040
LDA0.9600.9000.9130.8870.9200.9750.9600.8870.9600.9600.9730.948
NVDM0.6600.6000.6670.6080.5300.4300.3700.3950.7100.4800.3770.343
WLDA0.9000.7450.5500.4050.9100.6150.5800.5200.9700.8000.6030.530
GSM1.0001.0000.9700.9631.0000.9800.9800.9951.0001.0001.0000.998
ProdLDA1.0000.9650.8870.7900.9200.7650.5870.5220.9900.8700.8130.740
ATM0.9300.7550.8030.7500.9200.7550.7300.6750.9800.8550.8030.733
BAT0.7200.5900.6000.6450.7800.7500.6070.6080.8300.7850.7170.647
ETM0.9500.9600.9430.8550.9100.8800.7800.6950.9600.9350.8630.802
CNTM0.9900.5300.4170.3550.9500.9100.8700.7851.0000.8900.7870.703
BERTopic0.8300.9000.8470.9100.8400.8150.8070.7670.9000.9150.9170.902
CTMNeg0.9700.8100.7130.4920.9300.8800.7530.7130.9600.8250.8170.860
SSTM1.0001.0000.9730.9781.0001.0000.9670.9221.0001.0000.9900.958
Table 6. Topic coherence analysis of SSTM with various l r , α , S, and p. (We only retain ETM (second best for four topic number settings according to Table 3) and BAT (performs the second best with the topic number set to 10) for clarity).
Table 6. Topic coherence analysis of SSTM with various l r , α , S, and p. (We only retain ETM (second best for four topic number settings according to Table 3) and BAT (performs the second best with the topic number set to 10) for clarity).
Learning Rate  lr
Setting 9 × 10 6 1 × 10 5 2 × 10 5 3 × 10 5 4 × 10 5 ETMBAT
C_P0.29200.29550.30240.30810.32120.23490.2509
NPMI0.05490.06150.06720.06060.06000.03590.0487
UCI0.45970.51240.69980.60090.5568−0.00160.4192
Dirichlet Prior  α
Setting0.0050.010.050.10.2ETMBAT
C_P0.32360.30240.28190.31050.29560.23490.2509
NPMI0.06000.06720.06070.06040.05480.03590.0487
UCI0.46030.69980.52400.45070.4491−0.00160.4192
Hidden Layer S
Setting300350400450500ETMBAT
C_P0.31900.29290.30240.33600.32580.23490.2509
NPMI0.06410.06400.06720.07480.06420.03590.0487
UCI0.47300.51500.69980.66240.5118−0.00160.4192
Augmentation probability p
Setting510152025ETMBAT
C_P0.31810.31720.30240.31610.30410.23490.2509
NPMI0.06900.07160.06720.06790.06770.03590.0487
UCI0.65740.68560.69990.54890.6003−0.00160.4192
Table 7. Comparison results of human evaluation; a smaller number indicates better performance.
Table 7. Comparison results of human evaluation; a smaller number indicates better performance.
ModelLDAWLDAATMBATETMCNTMBERTopicCTMNegSSTM
Count11.51416.513.519.510.51819.56
Table 8. Concerns mined by SSTM from Twitter Posts, Subreddit and User Query datasets.
Table 8. Concerns mined by SSTM from Twitter Posts, Subreddit and User Query datasets.
DatasetSummary
Twitter Postsmachine learning, student cheating, search engine, correct solution, music experience, marketing investment, medical information bias, dreams of freedom, surging ChatGPT user, digital marketing content
Subreddithumanization about ChatGPT, assistant on math, search engine, tech industry cover, plugins of ChatGPT, content creation, engaging replies, academic writing tool, chatbot interaction, freedom and governance
User Querymedical diagnosis, market strategy, response protocol, academic writing tool, research and citation, programming assistant, customer engagement, character development, server application, blog post
Table 9. Comparison of ablation on regularization terms (invariance, variance, and covariance) utilized in SSTM.
Table 9. Comparison of ablation on regularization terms (invariance, variance, and covariance) utilized in SSTM.
SubredditUser Query
SettingsC_PNPMIUCIUTRC_PNPMIUCIUTR
S S T M i 0.24930.05310.43411.0000.29470.05930.44951.000
S S T M v 0.25840.05550.43270.9900.29650.06200.43530.990
S S T M c 0.25600.05630.48741.0000.29700.06340.50211.000
S S T M 0.27190.05750.53091.0000.30240.06720.69981.000
E T M 0.13880.0277−0.17820.9100.23490.0359−0.00160.960
B A T 0.19010.02860.06550.7800.25090.04870.41920.830
Table 10. Comparison of ablation on document representation approaches.
Table 10. Comparison of ablation on document representation approaches.
SettingsKC_PNPMIUCIUTR
S S T M P 100.29820.06650.54881.000
200.25500.05210.36611.000
300.24180.04290.19360.970
400.24210.04600.23350.950
S S T M 100.27190.05750.53091.000
200.23740.04630.33681.000
300.22610.03710.04100.967
400.19330.03130.05190.922
L D A 100.23910.0460−2.10310.920
200.19310.0389−2.11160.975
300.15800.0283−2.62910.960
400.13640.0204−2.57510.887
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, R.; Liu, X.; Ren, P.; Chang, S.; Huang, Z.; Huang, H.; Sun, G. What Are the Public’s Concerns About ChatGPT? A Novel Self-Supervised Neural Topic Model Tells You. Mathematics 2025, 13, 183. https://doi.org/10.3390/math13020183

AMA Style

Wang R, Liu X, Ren P, Chang S, Huang Z, Huang H, Sun G. What Are the Public’s Concerns About ChatGPT? A Novel Self-Supervised Neural Topic Model Tells You. Mathematics. 2025; 13(2):183. https://doi.org/10.3390/math13020183

Chicago/Turabian Style

Wang, Rui, Xing Liu, Peng Ren, Shuyu Chang, Zhengxin Huang, Haiping Huang, and Guozi Sun. 2025. "What Are the Public’s Concerns About ChatGPT? A Novel Self-Supervised Neural Topic Model Tells You" Mathematics 13, no. 2: 183. https://doi.org/10.3390/math13020183

APA Style

Wang, R., Liu, X., Ren, P., Chang, S., Huang, Z., Huang, H., & Sun, G. (2025). What Are the Public’s Concerns About ChatGPT? A Novel Self-Supervised Neural Topic Model Tells You. Mathematics, 13(2), 183. https://doi.org/10.3390/math13020183

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop