*4.1. Experimental Configuration*

For the purpose of comparison and evaluation, the experimental setting should be similar for all the neural-topic models and all the datasets. At the beginning, we completed some trial experiments, and determined that the optimum topic size parameter should be set at *K* = 50, based on topic coherence and perplexity, so that there are a sufficient number of topics without becoming very large, considering the length of short text. This value is also in accordance with the value used for related experiments in similar research works. The number of dimensions of the word embeddings was fixed at *L* = 300. This is in accordance with the GloVe's Common Crawl-based trained word embedding vectors, publicly available at https://nlp.stanford.edu/projects/glove/ (accessed on 14 January 2022), which cover largest number of vocabularies.

The other experimental parameters are set as follows: number of units of the encoder's hidden layers: *H*(1) = 500, *H*(2) = 500; Dropout rate: *p*dropout = 0.2; Minibatch size: 256; Max epochs: 200; Learning rate for the encoder network: 0.005; Learning rate for the decoder network: 0.001. We employ Adam as the optimizer and Softplus as the activation function of the encoder networks.

On WLDA, we employ Dirichlet as the prior distribution for topic proportion-generation, using MMD for this model's training. On NSTM, Sinkhorn algorithm's max number of updates is 2000, and the threshold value for updating termination condition is 0.05, constant value *α*Sinkhorn = 20.

### *4.2. Results for Topic Coherence*

Tables 3–12 represent the detail results of different topic coherence metrics (NPMI and WETC) for different neural models and different datasets, respectively. Values in bold faces indicate the best results. We have used two versions of GloVe, differing in terms of the size of the corpus.

For many datasets, NVLDA-PWE/fine-tuning has the highest TC. One of the challenges of using PWE without fine-tuning is that the high domain gap between the PWE training corpus and the corpus for topic modeling has a negative impact. In many cases, our proposal produces better results, but not for all the datasets or for all the models. The dataset "GoogleNews" often has the best TC with PWE, and does not show better performance with additional fine-tuning. This is probably because this corpus has a similar domain as the training data for PWE. In a few datasets, the best performance is noticed when no pretraining word embedding is used. It is verified that for those datasets, the original corpus contains sufficient word co-occurrence information.

However, we noted that the TC value changes significantly depending on the type of word embedding. This result suggests that the quality of the word embeddings may have a significant impact on the training of the topic model. In particular, whether or not the unique words in the training corpus are included in the unique words in the PWE has a significant impact. If the coverage of this word dictionary is large, the PWE can be used for

evaluation, but if there are many missing words, the reliability of the evaluation value will be greatly compromised.

Figure 4 presents the summary of topic coherence over all the neural-topic models for the long-text corpus (2 datasets) and the short-text corpus (8 datasets), which shows the overall trend. In the case of long texts, the scores of the PWE/fine-tuning metrics for TC are either a little worse or the same as the others. One of the reasons for this is that the long-text corpus used in this study is composed of relatively formal documents, which is close to the domain of PWE. In contrast, the short-text corpus shows better performance in all metrics. The overall trend is none < PWE < PWE/finetuning.

 **Figure4.**Summary ofTCresults.

 **Table 3.** Topic Coherence on 20NewsGroups.



**Table 4.** Topic Coherence on BBCNews.

**Table 5.** Topic Coherence on Biomedical.



**Table 6.** Topic Coherence on DBLP.

**Table 7.** Topic Coherence on GoogleNews.



**Table 8.** Topic Coherence on M10.

**Table 9.** Topic Coherence on PascalFlicker.



**Table 10.** Topic Coherence on SearchSnippets.

**Table 11.** Topic Coherence on StackOverflow.



**Table 12.** Topic Coherence on TrecTweet.

### *4.3. Results for Topic Diversity*

Tables 13–22 represent the detailed results of different metrics (TopicDiversity, Inverted RBO, and MSCD) expressing a diversity of topics for different neural-topic models and different datasets, respectively. Values in bold indicate the best results. InvertedRBO focuses on the weight of the top-N words. It shows the highest values in almost all of the cases, from which it can be inferred that we were able to construct the topics with high diversity. This result shows that it was useful to add a regularization term that maximizes the distance between topic-centroid vectors, resulting in highly diverse topics.

Furthermore, WLDA and NSTM show similar results without this regularization term, indicating that these models are able to learn without compromising topic diversity in their raw form. To check if this regularization term is working well, we have added TopicCentroidDistance (TCD) in the tables. The larger this metric is, the better, but the values are almost the same for all cases. This metric was evaluated based on two PWEs, and since the values varied, we can infer that the quality of the embedding has a significant impact on the evaluation of the topic model.

Although the results of TopicDiversity varied greatly depending on model and dataset, when checked individually, the scores were sufficiently better in many cases. However, as in the case of Biomedical's NVLDA-pwe/fine-tuning results, there were cases where the TC showed good scores but the TD showed bad scores. In this respect, InvertedRBO also shows a good score, but MSCD, which is an evaluation using the entire topic–word distribution, shows a relatively large value (i.e., a bad score), indicating that the topics are relatively tangled. Metrics such as TopicDiversity and InvertedRBO, which are based on the top-N words, are useful for evaluating topic diversity, but it is also important to evaluate the entire topic–word distribution.

Figure 5 presents the summary of topic diversity results over all the neural-topic models for the long-text corpus (2 datasets) and the short-text corpus (8 datasets), which shows the overall trend. Among the metrics related to TD, the InvertedRBO score is almost the highest in all cases. This indicates that there is sufficient diversity in all conditions. However, for the other scores, the performance is slightly worse for PWE and PWE/fine-tuning.

**Figure 5.** Summary of TD Results.




**Table 14.** Topic Diversity on BBCNews.

**Table 15.** Topic Diversity on Biomedical.



**Table 16.** Topic Diversity on DBLP.

**Table 17.** Topic Diversity on GoogleNews.



**Table 18.** Topic Diversity on M10.

**Table 19.** Topic Diversity on PascalFlicker.



**Table 20.** Topic Diversity on SearchSnippets.

**Table 21.** Topic Diversity on StackOverflow.



**Table 22.** Topic Diversity on TrecTweet.

### *4.4. Classification and Clustering Performance*

Tables 23–32 represent the classification and clustering performance of all models and all datasets, respectively. Values in the bold face represent best results. For the TrecTweet dataset, the classification results could not be obtained, possibly due to some technical problem. Figure 6 presents the average classification and clustering performance of the models over long- and short-text datasets. Classification has been performed by a SVM (Support Vector Machine) with linear and rbf kernels. Classification accuracy, precision, recall, and F1 scores have been used for performance assessment and for supervised classification, and NMI (Normalized Mutual Information) and Purity have been used for unsupervised classification.

For classification, VAE-based models, such as NVDM and GSM, exhibit good performance, while WAE-based models, such as WLDA and NSTM, show relatively poor performance. NSTM shows good performance in TC and TD, especially in TD, without adding any regularization term. However, the application to downstream tasks using WAE variants remains a challenge. Considering the overall trend, for long texts, PWE with fine-tuning improves all the scores, but for short texts, the performance is the best for the cases without embedding. Although, after fine-tuning, the scores go<sup>t</sup> better than those obtained with pretrained embedding only.

For clustering results, the large NMI and Purity scores for all models and all datasets for both long and short texts indicate that there is a concentration of documents with the same label around the topic-centroid vector, which proves that the proposal of PWE/fine-tuning improves topic cohesion. Therefore, we can see that our proposal of PWE/fine-tuning contributes to narrowing the domain gap between the training corpus and PWE.

 **Figure 6.** Classification and clustering performance.




**Table 24.** Classification and Clustering performance on BBCNews.

**Table 25.** Classification and Clustering performance on Biomedical.



**Table 26.** Classification and Clustering performance on DBLP.

**Table 27.** Classification and Clustering performance on GoogleNews.



**Table 28.** Classification and Clustering performance on M10.

**Table 29.** Classification and Clustering performance on PascalFlicker.



**Table 30.** Classification and Clustering performance on SearchSnippets.

**Table 31.** Classification and Clustering performance on StackOverflow.



**Table 32.** Classification and Clustering performance on TrecTweet.
