*3.2. Datasets*

Table 2 presents the details of the benchmark datasets used in this work. The first column represents the name of the dataset, followed by the number of documents (|*D*|), vocabulary size (|*V*|), the total number of tokens ( ∑ *X*), average document length (ave dL), maximum document length (max dL), sparsity, number of classes (C,) and the source of the data in the respective columns. In the source column, 1 and 2 represent OCCITS and STTM, respectively.

1. OCTIS: https://aclanthology.org/2021.eacl-demos.31/ (accessed on 14 January 2022). 2.STTM:https://arxiv.org/pdf/1701.00185.pdf(accessedon14January2022).

The first two datasets fall in the category of long documents, and the other eight datasets can be considered as the short-text corpus, as the average document length is quite short compared to the long documents.

**Table 2.** Details of Data set.


The datasets shown in the table are pre-processed. HTML tags and other symbols have been removed from each dataset, and all words have been lowercased. Then, the stopwords were removed and lemmatized. From these datasets, 80% of the total documents was used as the training data and the rest as the test data. These pre-processed corpora are then converted into a BoW (Bag-of-Words), which basically has word frequency as an element, to be used as input data for the NTM. However, for the NSTM, the vector corresponding to each document in the BoW is divided by the total value of the vectors, as in the original paper.

### *3.3. Evaluation of Topic Quality*

It is quite challenging to evaluate the performance of topic models, including NTMs, according to the quality of the generated topics. Topics generated by topic models can be considered as soft clusters of words. Under the constraints of the topic model, this is a probability distribution that collects the probability of word generation for each topic; the same is true for NTM, but this may not be in the form of a probability distribution for document models that impose even weaker constraints than the topic model. Either way, a topic here is a topic–word distribution, and each distribution has as many dimensions as the number of lexemes that occur in the corpus. It is very difficult to understand the goodness of a topic by directly comparing them with human topics. Therefore, in practice,

analysts check a list of N words characteristic of a topic based on the values of the word distributions. In most cases, the list of the top-N words in terms of the large probability values in the word distribution is used.

Various metrics have been proposed to evaluate the quality of the top-N words with two main directions. One is to check whether the meaning of words belonging to the top-N words are consistent with each other, defined as topic coherence (TC). The other is to measure the diversity of the top-N words of each pair of topics, defined as topic diversity (TD) or topic uniqueness. Topics with high TC may have low TD. In this case, the top-N words of most topics will be nearly the same, which is not desirable. So, to evaluate the quality of a topic for human-like interpretability, it should have high TC as well as high TD.

### 3.3.1. Topic Coherence (TC)

For computing TC, general coherence between two sets of words are estimated based on word co-occurrence counts in a reference corpus [53]. The choices are (1) the training corpus for topic modeling; (2) a large external corpus (e.g., Wikipedia); (3) word embedding vectors trained on a large external corpus (e.g., Wikipedia). The scores may differ according to different computations. Choice 1 is easy, but the results are affected by the size of the training corpus. Choices 2 and 3 are more popular, although choice 2 is computationally costly. However, if the domain gap of the training corpus and the external corpus is high, the evaluation is not proper. In this work, we have used the following metrics for computation of topic coherence:

•Normalized Point-Wise Mutual Information (NPMI) [54]: NPMI is a measure of the semantic coherence of a group of words. It is considered to have the largest correlations with human ratings, and is defined by the following equation:

$$NPMI(w) = \frac{1}{N(N-1)} \sum\_{j=2}^{N} \sum\_{i=1}^{j-1} \frac{\log \frac{P(w\_i, w\_j)}{P(w\_i)P(w\_j)}}{-\log P(w\_i, w\_j)} \tag{17}$$

where *w* is the list of the top-N words for a topic. N is usually set to 10. For K topics, averages of NPMI over all topics are used for evaluation;

•Word Embeddings Topic Coherence (WETC) [55]: WETC represents word embeddingbased topic coherence, and pair-wise WETC for a particular topic is defined as:

$$\text{WETC}\_{\text{PW}}(E^{(k)}) = \frac{1}{N(N-1)} \sum\_{j=2}^{N} \sum\_{i=1}^{j-1} \left\langle E\_{i,:}^{(k)}, E\_{j,:}^{(k)} \right\rangle \tag{18}$$

where ., . denotes the inner product. For the calculation of the WETC score, pretrained weights of GloVe [50] have been used, and *E*(*k*) is the word embedding vector sequence of GloVe corresponding to the top-N words for topic *k*; *E*(*k*) *i* means and all vectors are normalized as follows: ||*E*(*k*) *i*,: || = 1, *N* is taken as 10. WETC*c* (centroid WETC) is defined as follows:

$$\begin{aligned} \text{WETC}\_{\mathsf{c}}(E^{(k)}) &= \frac{1}{N} \sum\_{n=1}^{N} E^{(k)} t \\ t &= \frac{\alpha\_{:,k}}{||\alpha\_{:,k}||} \end{aligned} \tag{19}$$
