**5. Conclusions**

Short-text data are now becoming ubiquitous in the real world through various social networking sites. The importance of analysing these short messages is also growing day by day. Unlike long texts or documents, short texts suffer from a lack of word co-occurrence information due to their restricted lengths, posing a difficulty in generating coherent and interpretable topics with popular topic-model techniques.

The use of pretrained word embedding in neural-topic models is a good choice to easily increase the generated topic quality as measured by topic coherence and topic diversity. This is effective for both long and short texts, and reduces the number of trainable parameters, thus shortening the training step time. However, to achieve better topic coherence, especially in short texts, or to make the top-N words of a topic more relevant to the real semantic contents of the training corpus, the additional fine-tuning stage proposed in this work is indeed necessary. The extensive study in this work with several neural-topic models and benchmark datasets justifies our proposal.

However, the use of pretrained word embedding (PWE) has its inherent limitations, which may affect the quality of the extracted topics from short texts. The short-text corpus to be analyzed may contain words that are not included in the vocabulary covered by the corpus used for pretrained word embedding. In this case, NTM-PWE uses a vector initialized with zero. As the vocabulary coverage increases, the performance is likely to deteriorate. Moreover, in the case of NTM-PWE/fine-tuning, there is a possibility that the number of parameter updates will increase until the loss function converges, resulting in an increase in training time. If the time difference between the corpus used for PWE training and the corpus to be analyzed is too large, the meanings of words may change with time, which may have a negative impact on the production of interpretable topics.

It is also seen that the improvement in topic quality after introducing a fine-tuning stage is not the same for all the datasets and all the models. It is difficult to define the correlation between the structure of neural-topic models and the inherent characteristics of the datasets, which poses a challenge to our study. In this work, we limited our study to benchmark datasets available on the internet. Currently, we are collecting data for the evaluation of our proposal with real-world datasets.

By incorporating the additional training with the original training corpus, along with pretrained word embedding with the external corpus, we can improve the purity and NMI

of the topics evaluated using the class labels of the documents. Thus, we can construct topics that are more suitable for the training corpus. This method can also be expected to improve the performance of downstream tasks, such as classifications for long texts. Even for short texts, the performance of the downstream tasks is better than when using pretrained word embedding without fine-tuning.

**Author Contributions:** Conceptualization, R.M.; methodology, R.M.; software, R.M.; validation, R.M. and B.C.; investigation, R.M.; data curation, R.M.; writing—original draft preparation, R.M.; writing— review and editing, B.C.; visualization, R.M. and B.C.; supervision, B.C.; project administration, B.C.; funding acquisition, B.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Acknowledgments:** The authors acknowledge the technical support of the Pattern Recognition and Machine Learning Laboratory, Department of Software, Iwate Prefectural University, Japan.

**Conflicts of Interest:** The authors declare no conflict of interest.
