3.2. Parsimonious Topic Model (PTM)
We first introduce PTM’s data generation method.
For each document
For each word
Randomly select a topic based on the probability mass function (pmf) .
Given the selected topic j, randomly generate the i-th word based on the topic’s pmf over the word space
Here is the topic switch that indicates whether topic j is present in document d. If , it means that topic j is present in document d and is treated as a model parameter. and are the topic-specific probability of word n under topic j and the shared probability of word n, respectively.
Based on the above data generation, we can get the data likelihood of a document dataset
under our model
:
Here is the word switch that indicates if word n is topic-specific under topic j. The model structure parameters, denoted by , consist of two kinds of switches and the number of topics, M (model order). Likewise, the model parameters, given a fixed model structure, are denoted by . The model structure together with the model parameters constitutes the PTM model. In PTM the parameters are constrained by the following two conditions:
First,
is the probability that topic
j is present in document
d, and
determines whether or not topic
j is present. The probability mass function must sum to one. So, we have:
Additionally, the word probability parameters
and
must satisfy a pmf constraint for each topic:
Based on the PTM described above, we must determine the model parameters
and the model structure,
H. Assuming the model structure is known, we can estimate the model parameters using the expectation maximization (EM) algorithm. By introducing, as hidden data, random variables that indicate which topic generates each word in each document, we can compute the expected complete data log likelihood and maximize it subject to the two constraints mentioned above. The model selection is more complicated. We need to derive a BIC cost function to balance the model complexity and the data likelihood. For the PTM model a generalized expectation maximization (GEM) [
7,
8] algorithm was proposed to update the model parameters
and the model structure
iteratively. In the following section, we give a derivation of BIC and the GEM algorithm for our modified PTM model.
3.3. Derivation of PTM-Customized Bayesian Information Criterion (BIC)
In this section we derive our BIC objective function, which generalizes the PTM BIC objective [
3]. A naive BIC objective has the following form:
Here is the maximized value of the data likelihood of the model with structure H, that is, ), where is the collection of parameters that maximize the likelihood function.
Additionally, n is the number of data points (documents) in the dataset and K is the number of free parameters in the model.
However, the Laplace approximation used in deriving this BIC form is only valid under the assumption that the feature space is far smaller than the sample size, and for our topic model, the feature space (word dictionary) is quite large in practice. Moreover, in the naive BIC form, all the parameters incur the same description length penalty (the log of the sample size), but in PTM different types of parameters contribute unequally to the model complexity. So, a new customized BIC is derived for PTM.
The Bayesian approach to model selection is to maximize the posterior probability of the model
H given the dataset
. When applying Bayes’ theorem to calculate the posterior probability, we get:
Here we define:
where
is the prior distribution of the parameters given the model structure
H. Then, we need to use Laplace’s method to approximate
I, given the knowledge that for large sample size
peaks around the maximum point (posterior mode
). We can rewrite
I as:
We can now expand
around the posterior mode
using a Taylor series expansion.
where
and
, where
is the Hessian matrix (i.e.,
).
Since
Q attains its maximum at
,
and
is negative definite. We can thus approximate
I as follows:
With the above form of the approximation,
is a scaled Gaussian distribution with mean
and covariance
. Thus:
where
k is the number of parameters in
. So we have the approximation of
I:
BIC is the negative log model posterior:
Note that , the prior of the parameters given the structure H can be assumed to be a uniform distribution (i.e., a constant). So, this term can be neglected. The term is the data likelihood, and k is the total number of model parameters. Now we need to approximately calculate and .
To do so, we assume that
is a diagonal matrix. We thus obtain:
The terms on the right represent the cost of the model parameters , , and , respectively. Note that in the naive BIC form, each parameter pays the same cost . Here we instead use the effective sample size. The effective sample size of is , parameter has sample size , and the parameter has sample size .
Another term to be estimated is
. For
, in each document
d, suppose the number of topics follows a uniform distribution, and the switch configuration also follows a uniform distribution over all
switch configurations. We then obtain:
For
we propose here a probability model that can jointly estimate
and the corresponding parameter cost of
,
. For each word
n, we define three types of configurations of the word switches
: (1) each word is topic-specific (i.e.,
); (2) all the words are not topic-specific (i.e.,
); (3) some, but not all, components use the shared distribution (i.e.,
). For cases 1 and 2, there is only one possible configuration of the word switches related to a word (all
open or all
closed), so the probability associated with this configuration is 1; for case 3 there are
possible configurations. Assuming these are equally likely under this case
. We can then estimate
plus the two terms
in
. That is, we have
Here, , , .
Based on the derivation above, the BIC cost function for our modified PTM model is:
3.4. Generalized Expectation Maximization (EM) Algorithm
The EM [
9] algorithm is a popular method for maximizing the data log-likelihood. For unsupervised learning tasks, we only have the data points
, but we do not have any “labels” for those data points, so during the maximum likelihood estimation (MLE) process we introduce “label” random variables which are called latent variables. The EM algorithm can be described as follows:
E-step: With the parameters fixed, we compute the expectation of the latent variables
, which gives the class information for each data point. Using the expectation of the latent variables we can compute the expectation of the complete data log-likelihood:
M-step: We update the parameters
to find the maximum value of the expected complete data log-likelihood. By doing the E-step and M-step iteratively, the expected complete data log-likelihood strictly increases and typically converges to a local optimum of the
incomplete data log-likelihood (or of the expected BIC cost function). However, note that in PTM there are not only the model parameters
but also the model structure
H over which we need to optimize. The original EM algorithm cannot be applied here because one cannot get jointly optimal closed form estimates of both the model parameters
and the structure parameters
H to maximize
. However, a generalized expectation maximization (GEM) [
7,
8] algorithm is proposed, which alternately jointly optimizes
over
and then over
subsets of the structure parameters,
H given fixed
.
Our GEM algorithm is specified for fixed model order, M. First we introduce the hidden data Z: is an M-dimensional binary vector, with a “1” indicating the topic of origin for the word . For example, if the element and other elements of are all equal to zero, topic j is the topic of origin for word .
Our GEM algorithm strictly descends in the BIC cost function Equation (
16). It consists of an E-step followed by an M-step that minimizes the expected BIC cost function. These steps are given as follows: In the E-step, first we compute the expectation of the hidden data
Z:
With the expectation of the hidden data, we can compute the expected complete data log-likelihood using Equation (
17). By replacing the
term in BIC with the expected complete data log-likelihood(
), we get the expected complete data BIC.
In the generalized M-step, based on the expectation of the complete data BIC we computed in the E-step, we update the model structure H and the model parameters . First we optimize the model parameters given fixed model structure. Then we optimize the model structure given fixed model parameters, both steps taken to minimize the expected BIC. These steps are alternated until convergence.
When updating the model parameters, note that the only term in BIC that is related to the model parameters is the data likelihood term, so we can just maximize the expected complete data log-likelihood computed in the E-step to choose the model parameters. Taking those two constraints (Equations (
2) and (
3)) into consideration, we have our Lagrangian objective function.
where
and
are Lagrange multipliers. By computing the partial derivative of each parameter type and setting those derivatives to zero, we can get the optimized model parameters, satisfying necessary optimality conditions as:
We compute
by multiplying both sides of Equation (
21) by
, summing over all
n, and applying the distribution constrains on topic
j. This gives:
For the shared parameters, we only estimate them once via global frequency counts at initialization and hold them fixed during the GEM algorithm. That is, we set:
When updating the model structure, we implement an iterative loop in which all the topic switches u are visited one by one. If the current change reduces BIC, we accept the change; otherwise we keep the switch unchanged. Note that in updating the word switches v, we update all the word switches for a single word jointly to see if it is optimal to choose all the switches related to a single word to be closed (i.e., to specify that this word is completely uninformative). This process is repeated over all the switches until there is no decrease in BIC or until we reach a pre-defined maximum number of iterations. We first update the word switches u until convergence; then we update the topic switches v until convergence. Then we go back to the E-step. Note that when updating the word switches, for each single search of all the switches related to one word, we have three configurations. All switches are closed, all switches are open, or some are closed and some are open. We compute the minimized BIC for each configuration, and then choose the configuration that has the lowest BIC value.