**Appendix A**

1. Topic Modeling

> The heuristic of the probabilistic topic modeling can be seen in Figure A1.

LDA and other topic models are part of the larger field of probabilistic modeling [1]. Generative probabilistic modeling consider data as arising from a generative process that includes hidden variables. This generative process defines a joint probability distribution over both observed and hidden random variables.

The joint distribution to compute the conditional distribution of the hidden variables is given to the observed variables. This conditional distribution is also called the posterior distribution.

Structural topic modeling extends to the LDA framework. STM allows for correlations among topics. Covariate data including document metadata influences topic prevalence within documents. STM also uses (document-specific) covariate data to define distributions for word use within a topic [2].

We employed the ldatuning package [3] using the log-likelihood method via Gibbs sampling. Specifically, we used the "Griffiths" [4] and "CaoJuan" [5] metrics scores.

**Figure A3.** Find optimal number of topics.

### 2. STM Evaluation

The semantic coherence and exclusivity values were associated with each topic. Numerals represent the average for each model and dots represent topic specific scores.

Each model has semantic coherence and exclusivity values associated with each topic. Figure A4 plots these values and labels each with its topic number.

**Figure A4.** Topic models selection in STM packages.
