**1. Introduction**

Due to the rapid developments of computing and communication technologies and the widespread use of internet, people are gradually becoming accustomed to communicating through various online social platforms, such as microblogs, Twitter, webpages, Facebook, etc. These messages over web and social networking sites contain important information regarding current social situations and trends, people's opinions on different products and services, advertisements, and announcements of governmen<sup>t</sup> policies, etc. An efficient textprocessing technique is needed to automatically analyze these huge amounts of messages for extracting information. In the area of traditional natural language processing, a topicmodeling algorithm is considered an effective technique for the semantic understanding of

**Citation:** Murakami, R.; Chakraborty, B. Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts . *Sensors* **2022**, *22*, 852. https://doi.org/10.3390/s22030852

Academic Editors: Suparna De and Klaus Moessner

Received: 26 December 2021 Accepted: 16 January 2022 Published: 23 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

text documents. Conventional topic models, such as pLSA [1] or LDA [2] and their various variants, are considerably good at extracting latent semantic structures from a text corpus without prior annotations and are widely used in emerging topic detection, document classification, comment summarizing, or event tracking. In these models, documents are viewed as a mixture of topics, while each topic is viewed as a particular distribution over all the words. Statistical tools are used to determine the latent topic distribution of each document, while higher-order word co-occurrence patterns are used to characterize each topic [3]. The efficient capture of document-level word co-occurrence patterns leads to the success of topic modeling.

The messages posted on various social network sites are generally short compared to the length of relatively formal documents such as newspapers or scientific articles. The main characteristics of these short texts are: (1) a limited number of words in one document, (2) the use of new and informal words, (3) meanings and usages of words that may change greatly depending on the posting, (4) spam posts, and (5) the restricted length of posts, such as API restrictions on Twitter. The direct application of traditional topic models for short-text analysis results in poor performance due to lack of word co-occurrence information in each short text document, originating from the above characteristics of short texts [4]. Earlier research on topic-modeling for short texts with traditional topic models used external, large-scale datasets such as Wikipedia, or related long-text datasets for a better estimation of word co-occurrences across short texts [5,6]. However, these methods work well only when the external dataset closely matches the original short-text data.

To cope with the problems of short-text topic-modeling by traditional topic models, three main categories of algorithms are found in the literature [7]. A simple solution is to aggregate a number of short texts into a long pseudo-document before training a standard topic model to improve word co-occurrence information. In [8], the tweets of an individual user are aggregated in one document. In [9,10], a short text is viewed as sampled from unobserved, long pseudo-documents, and topics are inferred from them. However, the performance of these methods depends on efficient aggregation and data type. When short texts of different semantic contents are aggregated to a long document, non-semantic word co-occurrence information can produce incoherent topics. In the second category, each short text document is assumed to consist of only a single topic. Based on this assumption, Direchlet Multinomial Mixture (DMM) model-based topic-modeling methods have been developed for short texts in [11–13]. Although this simple assumption eliminates data-sparsity problems to some extent, they fail to capture multiple topic elements in a document, which makes the model prone to over-fitting. Moreover, "shortness" is subjective and data-dependent; a single-topic assumption might be too strong for some datasets. A poisson-based DMM model (PDMM) [14] considers a small number of topics associated with each short text instead of only one. The third category of algorithms consider global word co-occurrence patterns for inferring latent topics. According to the usage, two types of models are developed. In [15], global word co-occurrence is directly used, while in [16], a word co-occurrence network is first constructed using global word cooccurrence, and then latent topics are inferred from this network. In the present work, we explored methods of exploiting this category for further improvement in the development of algorithms for extracting interpretable topics from short texts.

Another limitation of the above models for short-text analysis is that the context or background information is not used, resulting in the generation of not-so-coherent topics. The statistical information of words in the text document cannot fully capture words that are semantically correlated but that rarely co-occur. Recent advances in word embedding [17] provides an effective way of learning semantic word relations from a large corpus, which can help to develop models for generating more interpretable and coherent topics. Word embedding uses one-hot representation of words with vocabulary length vectors of zeroes with a single one, and words that are similar in semantics are close in a lower-dimensional vector space. An embedded topic model (ETM), a combination of LDA and word embedding which enjoys both the advantages of topic model and word embedding, has been proposed in [18]. Traditional topic models with word embedding for documents are explored in several other research works, cited in [19]. In [20], word embedding is combined with LDA to accelerate the inference process, resulting in the enhanced interpretability of topics. For short texts, models incorporating word embedding into DMM are proposed in [21,22]. In [23,24], short texts are merged into long pseudodocuments using word embedding. Word embedding in conjunction with conventional topic models seems to be a better technique for generating coherent topics.

The increasing complexity of inference processes in conventional topic models on large-text data, along with the recent developments of deep neural networks, has led to the emergence of neural-topic models (NTM). These models combine the performance, efficacy, scalability, and ease of leveraging parallel computing facilities, such as GPU, to probabilistic topic-modeling [25]. Neural-topic models are considered to be computationally simpler and easier for implementation, compared to traditional LDA models, and are increasingly used in various natural language processing tasks in which conventional topic models are difficult to use. A systematic study on the performances of several neural-topic models has been reported in [26]. Although various neural-topic models have been proposed, and although the reported experimental results on topic generation seem to be better than conventional topic models for long and formal texts, little research has been conducted on neural-topic models for effective analysis of short texts [27]. Most of the research works of topic modeling on short texts are based on extensions of Bayesian probabilistic topic models (BPTM) such as LDA.

The objective of the present research is to explore computationally easy and efficient techniques for improving the interpretability of generated topics from real-world short texts using neural-topic models. However, learning context information is the most challenging issue of topic-modeling for short texts, and incorporating pretrained word embedding into a topic model seems to be one of the most efficient ways of explicitly enriching the content information. Neural-topic models with pretrained word embedding for short-text analysis has not been extensively explored yet, compared to its long-text counterparts. In [28], we presented our preliminary analysis of short-text data (benchmark and real-world) with neural-topic models using pretrained word embedding. We found that although pretrained word embedding enhances the topic coherence of short texts that are similar to long and formal texts, the generated topics were often comprised of words having common meanings (which are found in the large external corpus used for pretraining) instead of the particular short-text-specific semantics of the word, which is especially important for real-world datasets. In other words, the learning of topic centroid vectors is influenced by pretraining the text corpus and fails to discover the important words of the particular short text. Our proposal is that this gap can be filled by adding a fine-tuning stage to the training of the topic model with the particular short-text corpus to be analysed. In this work, we have completed an extensive study to investigate the performance of recent neural-topic models with and without word embedding, and also with the proposed finetuning stage, for generating interpretable topics from short texts in terms of a number of performance metrics by simulation experiments on several datasets. We have also studied the performance of the NTM with pretrained word embedding added, with a fine-tuning stage for classification and clustering tasks. As a result of our experiments, we can confirm that the addition of a fine-tuning stage indeed enhances the topic quality of short texts in general, and generates topics with corpus-specific semantics.

In summary, our contributions in this paper are as follows:


• A performance evaluation of the proposed fine-tuned neural-topic models for classification and clustering tasks.

In the next section, neural-topic models are introduced in brief, followed by a short description of related works on neural-topic models (NTM), especially NTMs for short texts. The following sections contain our proposal, followed by simulation experiments and results. The final section presents the conclusion.

### **2. Neural-Topic Models and Related Works**

The most popular neural-topic models (NTMs) are based on a variational autoencoder (VAE) [29], a deep generative model, and amortised variational inferences (AVI) [30]. The basic framework of VAE-based NTMs is described in the next section, in which generative and inference processes are modeled by a neural network-based decoder and encoder, respectively. Compared to the traditional Bayesian probabilistic topic models (BPTM), inference in neural-topic models is computationally simpler, their implementation is easier due to many existing deep learning frameworks, and NTMs are easy to be integrated with pretrained word embeddings for prior-knowledge acquisition. Several categories of VAE-based NTMs have been proposed. To name a few, there is the Neural Variational Document Model (NVDM) [31], Neural Variational Latent Dirichlet Allocation (NVLDA) [32], the Dirichlet Variational Autoencoder topic model (DVAE) [33], the Dirichlet Variational Autoencoder (DirVAE) [34], the Gaussian Softmax Model (GSM) [35], and iTM-VAE [36]. This list is not exhaustive and is still growing.

In addition to VAE-based NTMs, there are a few other frameworks for NTMs. In [37], an autoregressive NTM, named DocNade, has been proposed. Consequently, some extensions of DocNADE are found in the literature. Recently, some attempts have been made to use a GAN (Generative Adversarial Network) framework for topic-modeling [38,39]. Instead of considering a document as a sequence or a bag of words, a graph representation of a corpus of documents can be considered. In [40], a bipartite graph, with documents and words as two separate partitions and connected by word occurrences in documents as the weights, is used. Ref. [41] uses the framework of Wasserstein auto-encoders (WAEs), which minimizes the Wasserstein distance between reconstructed documents from the decoder and the real documents, similar to a VAE-based NTM. In [42], a NTM based on optimal transport that directly minimizes the optimal transport distance between the topic distribution learned by an encoder and the word distribution of a document has been introduced.

### *Neural-Topic Models for Short-Text Analysis*

In order to generate coherent, meaningful, and interpretable topics from short texts by incorporating semantic and contextual information, a few researchers used NTMs in lieu of conventional topic models. In [43,44], a combination of NTM and either a recurrent neural network (RNN) or a memory network has been used, in which topics learned by the NTM are utilized for classification by a RNN or a memory network. In both works, the NTM shows better performance than conventional topic models in terms of topic coherence and a classification task. To enhance the discreteness of multiple topic distributions in a short text, in [27], the authors used Archimedean copulas. In [45], the authors introduced a new NTM with a topic-distribution quantization approach, producing peakier distributions, and also proposed a negative sampling decode, learning to minimize repetitive topics. As a result, the proposed model outperforms conventional topic models. In [46], the authors aggregated short texts into long documents and incorporated document embedding to provide word co-occurrence information. In [47], a variational autoencoder topic model (VAETM) and a supervised version (SVAETM) of it have been proposed by combining embedded representations of words and entities by employing an external corpus. To enhance contextual information, the authors in [48] proposed a graph neural network as the encoder of NTM, which accepts a bi-term graph of the words as inputs and produces the topic distribution of the corpus as the output. Ref. [49] proposed a context-reinforced

neural-topic model with the assumption of a few salient topics for each short text, informing the word distributions of the topics using pretrained word embedding.

#### **3. Proposal for Fine-Tuning of Neural-Topic Models for Short-Text Analysis**

From the analysis of present research works on neural-topic models on short-text analysis, it seems that incorporating auxiliary information from an external corpus is one of the most popular and effective techniques for dealing with sparsity in short texts. As mentioned in the introduction, in our previous work [28], we found that although pretrained word embedding with a large external corpus helps with generating coherent topics from a short-text corpus, the generated topics lack the semantics expressed by the corpus-specific meaning of words. If the domain of the short-text corpus and the external corpus vary too much, the topic semantics become poor. This fact is also noted by other researchers [25].

In this work, we propose an additional fine-tuning stage, using the original shorttext corpus, along with the pretrained word embedding and a large external corpus. For pretrained word embedding, we decided to use GloVe [50] after some preliminary experiments with two other techniques, namely, Word2Vec and Fast Text, as GloVe provided consistent results. Here, we have completed an extensive comparative study to evaluate the effect of pretrained embedding with our proposed additional fine-tuning stage using several short-text corpora and neural-topic models. The proposed study setting is represented in Figure 1.

**Figure 1.** Proposed Study.

Here, pretrained word embedding is denoted as PWE. We have performed three sets of experiments for topic extraction, using only neural-topic models (NTM), neural-topic models with pretrained word embedding (NTM-PWE), and neural-topic models with pretrained embedding and the proposed fine-tuning step (NTM-PWE/fine-tuning). In all the cases, the data corpus is first pre-processed, and in NTM-PWE, word embedding vectors are replaced by PWE after the model parameters are initialized; the weights of PWE are not updated during the training step. In our proposed PWE/fine-tuning or simple fine-tuning (as mentioned in the text), the weights are gradually updated in the training step after replacing the word embedding vectors, as in PWE. In this case, it is possible to update the parameters at the same learning rate that is set to update the entire model, but experiments have shown that updating the PWE values at a large learning rate can easily over-fit the training data. Therefore, in the simulation experiments, we have set the learning rate of the word embedding vectors to a smaller value than the learning rate of the entire model.

We have used popular VAE-based neural-topic models with a few similar WAE (Wasserstein autoencoder)-based models, and ten popular benchmark datasets, for our simulation experiments. The performance of each neural-topic model with no word embedding, pretrained word embedding, and additional fine-tuning has been evaluated by the generated topic quality using different evaluation metrics of topic coherence and topic diversity. The neural-topic models, datasets, and evaluation metrics used in this study are described below.

### *3.1. Neural-Topic Models for Evaluation*

In this section, the neural-topic models used in this study are described briefly. Table 1 describes the meaning of the notations used for description of the models.



Figures 2 and 3 describe the generalized architecture of the Variational autoencoder (VAE)- and Wasserstein autoencoder (WAE)-based neural-topic models, respectively. In both the models, the part of the network that generates *θ* is known as the encoder, which maps the input bag-of-words (BoW) to a latent document–topic vector, and the part that receives *θ* and outputs *p*(*x*) is called the decoder, which maps the document–topic vector to a discrete distribution over the words in the vocabulary. They are called autoencoders because the decoder aims to reconstruct the word distribution of the input. In VAE, *h* is sampled by Gaussian distribution, and *θ* is created by performing some transformation on it. WAE, on the other hand, uses the Softmax function directly to create *θ*, so no sampling is required. Evidence lower bound (ELBO), the objective function of VAE, is defined below [29]:

$$\begin{split} \mathcal{L}\_d &= \mathbb{E}\_{q(\boldsymbol{\theta}|\boldsymbol{d})} \left[ \sum\_{n=1}^{N\_d} \log \sum\_{z\_n} [p(w\_n \mid \boldsymbol{\beta}\_{z\_n}) p(z\_n \mid \boldsymbol{\theta})] \right] \\ &- D\_{KL} \Big[ q(\boldsymbol{\theta} \mid \mathbf{x}\_d) || p\left(\boldsymbol{\theta} \mid \boldsymbol{\mu}\_{0\prime} \boldsymbol{\sigma}\_0^2\right) \Big] \end{split} \tag{1}$$

**Figure 2.** VAE-based model.

**Figure 3.** WAE-based model.

It is empirically known that maximizing this ELBO alone will result in smaller (worse) topic diversity. In order to solve this problem, some NTMs use a regularization term to increase the topic diversity [51]:

$$\begin{aligned} a(\boldsymbol{a}\_{i}, \boldsymbol{a}\_{j}) &= \arccos\left(\frac{|\boldsymbol{a}\_{i} \cdot \boldsymbol{a}\_{j}|}{||\boldsymbol{a}\_{i}|| \cdot ||\boldsymbol{a}\_{j}|| }\right) \\ \mathcal{J} &= \frac{1}{K^{2}} \sum\_{i}^{K} \sum\_{j}^{K} a\left(\boldsymbol{a}\_{i}, \boldsymbol{a}\_{j}\right) \\ \nu &= \frac{1}{K^{2}} \sum\_{i}^{K} \sum\_{j}^{K} \left(a\left(\boldsymbol{a}\_{i}, \boldsymbol{a}\_{j}\right) - \boldsymbol{\zeta}\right)^{2} \\ \mathcal{J} &= \mathcal{L} + \lambda\left(\boldsymbol{\zeta} - \nu\right) \end{aligned} \tag{2}$$

where *λ* is a hyper-parameter that manipulates the influence of the regularization term; 10 was adopted here. This value was determined empirically. The VAE-based models in this paper use this regularization term.

The particular NTMs used in our study are mentioned in the next subsections.

### 3.1.1. Neural Variational Document Model (NVDM)

NVDM [31] is, to our knowledge, the first VAE-based document model proposed with the encoder implemented by a multilayer perceptron. This model uses the sample *h* from the Gaussian distribution as an input for the decoder, and variational inference is based on minimizing KL divergence. While most of the NTMs proposed after this one transform *h* to treat *θ* as a topic proportion vector, NVDM is a general VAE.

### 3.1.2. Neural Variational Latent Dirichlet Allocation (NVLDA)

NVLDA [32], another variant of NVDM, is a model that uses Neural Variational Inference to reproduce LDA. Here, the Softmax function is used to convert z to *θ*. The probability distribution that maps samples from a Gaussian distribution to the Softmax basis is called the Logisitic–Normal distribution, which is used as a surrogate for the Dirichlet distribution. Additionally, the decoder is *p*(*x*) = *so f tmax*(*β*) · *θ*. Unlike the NVDM, where both the topic proportions and the topic–word distribution are in the form of probability distributions, this model is a topic model. Logisitic–Normal distribution is defined as follows:

$$h \sim \text{Normal}(\mu, \sigma^2) \tag{3}$$

$$
\theta = \operatorname{softmax}(h) \tag{4}
$$

### 3.1.3. Product-of-Experts Latent Dirichlet Allocation (ProdLDA)

ProdLDA [32] is an extension of NVLDA in which the decoder is designed by following the product of the expert model, and the topics-word distribution are not normalized.

### 3.1.4. Gaussian Softmax Model (GSM)

GSM [35] converts *h* to *θ* using Gaussian Softmax, as defined below:

$$h \sim \text{Normal}(\mu, \sigma^2) \tag{5}$$

$$\theta = \text{softmax}(\mathcal{W}\_1^T h) \tag{6}$$

where *W*1 ∈ <sup>R</sup>*K*×*K*, a linear transformation, is the trainable parameters used as the connection weights.

### 3.1.5. Gaussian Stick-Breaking Model (GSB)

GSB [35] converts *h* to *θ* by Gaussian Stick-Breaking construction, which is defined as follows:

$$h \sim \text{Normal}(\mu, \sigma^2) \tag{7}$$

$$\eta = \text{sigmoid}(\mathsf{W}\_2^T \mathsf{h})\tag{8}$$

$$\theta = f\_{SB}(\eta) \tag{9}$$

where *W*2 ∈ R*K*×*K*−<sup>1</sup> is the trainable parameters used as connection weights, and the stick-breaking function *fSB* is described by Algorithm 1:

### **Algorithm 1** Stick-Breaking Process (*fSB*)

**Input:** Return value from sigmoid function *η η* ∈ <sup>R</sup>*K*+, where ∀*ηk* ∈ [0, 1] **Output:** Topic proportion vector *θ θ* ∈ <sup>R</sup>*K*+, where ∑*k*(*<sup>θ</sup>k*) = 1 1: Assign *η*1 to the first element of the topic proportion vector *θ*1. *θ*1 = *η*1 2: **for** *k* = 2, . . . *K* − 1 **do** *θk* = *ηk k*−1 ∏ *i*=1 (1 − *ηi*) 3: **end for** 4: *θK* = *K*−1 ∏ *i*=1(1 − *ηi*)

3.1.6. Recurrent Stick-Breaking Model (RSB)

RSB [35] converts *h* to *θ* by recurrent Stick-Breaking construction, as defined below. Here, the stick-breaking construction is considered as a sequential draw from a recurrent neural network (RNN).

$$h \sim \text{Normal}(\mu, \sigma^2) \tag{10}$$

$$\eta = \mathbf{f}\_{\text{RNN}}(h) \tag{11}$$

$$\theta = \mathbf{f}\_{\text{SB}}(\eta) \tag{12}$$

where *fRNN* is decomposed as:

$$h\_k = \text{RNN}\_{\text{SB}}(h\_{k-1}) \tag{13}$$

$$\eta\_k = \text{sigmoid}(h\_{k-1}^T z) \tag{14}$$

and *fSB*(*η*) is the same as in GSB.

### 3.1.7. Wasserstein Latent Dirichlet Allocation (WLDA)

WLDA [41] is a topic model based on a Wasserstein autoencoder (WAE). Though various probability distributions can be used for the prior distribution of *θ*, in this paper, we use the Dirichlet distribution, which we believe is the most basic. In WAE, two training methods are available, GAN (Generative Adversarial Network)-based training and MMD (Maximum Mean Discrepancy)-based training, but in WLDA, MMD is used because of the ease of convergence of training loss. In VAE, the loss function is composed of the KL Divergence used as the regularization term for *θ* and the reconstruction error, while in WLDA, MMD is used as the regularization term.

If *P*Θ is a *θ*'s prior distribution, and *Q*Θ is a fake samples's prior distribution, maximum mean discrepancy (MMD) is defined as:

$$\text{MMD}\_{\mathbf{k}}(Q\_{\Theta}, P\_{\Theta}) = \left\| \int\_{\Theta} \mathbf{k}(\theta, \cdot) dP\_{\Theta}(\theta) - \int\_{\Theta} \mathbf{k}(\theta, \cdot) dQ\_{\Theta}(\theta) \right\|\_{\mathcal{H}\_{\mathbf{k}}} \tag{15}$$

where H means the reproducing kernel Hilbert space (RKHS) of real-valued functions mapping Θ to R, and *k* is the kernel function.

$$\mathbf{k}\left(\theta,\theta'\right) = \exp\left(-\arccos^2\left(\sum\_{k=1}^{K} \sqrt{\theta\_k \theta\_k'}\right)\right) \tag{16}$$

### 3.1.8. Neural Sinkhorn Topic Model (NSTM)

NSTM [52] is trained using optimal transport [42], as in WLDA. Since we assume that *θ* encodes *x* into a low-dimensional latent space while preserving sufficient information about *x*, the optimal transport distance between *θ* and *x* is calculated by the Sinkhorn Algorithm. The sum of this optimal transport distance and the negative log likelihood is used as the loss function.
