In this section, we present two pioneering models, generalized Dirichlet Multinomial Principal Component Analysis (GDMPCA) and Beta-Liouville Multinomial Principal Component Analysis (BLMPCA), which were designed to revolutionize text classification and sentiment analysis. At the core of our approaches is the integration of generalized Dirichlet and Beta-Liouville distributions, respectively, into the PCA framework. This integration is pivotal, as it allows for a more nuanced representation of text data, capturing the inherent sparsity and thematic structures more effectively than traditional methods.
The GDMPCA model leverages the flexibility of the generalized Dirichlet distribution to model the variability and co-occurrence of terms within documents, enhancing the model’s ability to discern subtle thematic differences. On the other hand, the BLMPCA model utilizes the Beta-Liouville distribution to precisely capture the polytopic nature of texts, facilitating a deeper understanding of sentiment and thematic distributions. Both models employ variational Bayesian inference, offering a robust mathematical framework that significantly improves computational efficiency and scalability. This approach not only aids in handling large datasets with ease but also ensures that the models remain computationally viable without sacrificing accuracy.
To elucidate the architecture of our proposed models, we delve into the algorithmic underpinnings, detailing the iterative processes that underlie the variational Bayesian inference technique. This includes a comprehensive discussion of the optimization strategies employed to enhance convergence rates and ensure the stability of the models across varied datasets. Moreover, we provide a comparative analysis, drawing parallels and highlighting distinctions between our models and existing text analysis methodologies. This comparison underscores the superior performances of GDMPCA and BLMPCA in terms of accuracy, adaptability, and computational efficiency, as evidenced by an extensive empirical evaluation on diverse real-world datasets.
Our exposition on the practical implications of these models reveals their broad applicability across numerous domains, from automated content categorization to nuanced sentiment analysis in social media texts. The innovative aspects of the GDMPCA and BLMPCA models, coupled with their empirical validation, underscore their potential to set a new standard in text analysis, offering researchers and practitioners alike powerful tools for uncovering insights from textual data.
3.1. Generalized Dirichlet Multinomial PCA
Bouguila [
40] demonstrated that when mixture models are used, the generalized Dirichlet (GD) distribution is a reasonable alternative to the Dirichlet distribution for clustering count data.
As we mentioned previously, the GD distribution, like the Dirichlet distribution, is a conjugate prior to the multinomial distribution. Furthermore, the GD has a more general covariance matrix [
40].
Therefore, the variational Bayes approach will be utilized to develop an extension of the MPCA model incorporating the generalized Dirichlet assumption. GDMPCA is anticipated to perform effectively because the Dirichlet distribution is a specific instance of the GD [
41]. Like MPCA, GDMPCA is a fully generative model applied to a corpus. It considers a collection of
M documents represented as the corpus, denoted by
. Each document
consists of a sequence of
words, expressed as
. Words within a document are represented by binary vectors from a vocabulary of
V words, where if the
j-th word is selected,
, and if not,
[
42]. The GDMPCA model then describes the generation of each word in the document through a series of steps involving
c, a
dimensional binary vector of topics:
If the i-th topic is chosen, , and in other cases, . , where .
The multinomial probability
is conditioned on the variable
. The distribution
is a
d-variate generalized Dirichlet distribution characterized by the parameter set
, with its probability distribution function denoted by
p, where
[
42]:
The GD distribution simplifies to a Dirichlet distribution when .
The mean and the variance matrix of the GD distribution are as follows [
41]:
and the covariance between
and
is given by
The covariance matrix of the GD distribution offers greater flexibility compared to the Dirichlet distribution, due to its more general structure. This additional complexity allows for an extra set of parameters, providing
additional degrees of freedom, which enables the GD distribution to more accurately model real-world data. Indeed, the GD distribution fits count data better than the commonly used Dirichlet distribution [
43]. The Dirichlet and GD distributions are both members of the exponential family (
Appendix A). Furthermore, they are also conjugate priors to the multinomial distribution. As a result, we can use the following method to learn the model.
The likelihood for the GDMPCA is given as follows:
Hence, when hidden variables are assigned GD priors, and given a defined universe of words, we use an empirical prior derived from the observed proportions of words in the universe, denoted by
f, where
. The equation, then, is structured as follows:
where 2 shows the small size of the prior sample size.
First, we will calculate the parameters of GD utilizing the Hessian matrix as described in
Appendix B.1.2, following Equations (
19) and (
20). To find the optimal variational parameters, we minimize the Kullback–Leibler (KL) divergence between the variational distribution and the posterior distributions
. This is achieved through a repetitive fixed-point method. We specify the variational parameters as follows:
As an alternative to the posterior distribution
, we determine the variational parameters
and
through a detailed optimization process outlined subsequently. To simplify, Jensen’s inequality is applied to establish a lower bound on the log likelihood, which allows us to disregard parameters
and
[
44]:
Consequently, Jensen’s inequality provides a lower bound on the log likelihood for any given variational distribution .
If the right-hand side of Equation (
22) is denoted as
, the discrepancy between the left and right sides of this equation represents the KL divergence between the variational distribution and the true posterior probabilities. This re-establishes the importance of the variational parameters, leading to the following expression:
As demonstrated in Equation (
23), maximizing the lower bound
with respect to
and
is equivalent to minimizing the Kullback–Leibler (KL) divergence between the variational posterior probability. By factorizing the variational distributions, we can describe the lower bound as follows:
After that, we can extend Equation (
A7) in terms of the model parameters
and variational parameters
(
A13).
In order to find
, we proceed to maximize with the respect to
, so we have the following equations:
and therefore, we have
Setting the above equation to zero leads to
Next, we maximize Equation (
A13) with respect to
. The terms containing
are
Setting the derivative of the above equation to zero leads to the following updated parameters:
The challenge of deriving empirical Bayes estimates for the model parameters
and
is tackled by utilizing the variational lower bound as a substitute for the marginal log probability, using variational parameters
and
. The empirical Bayes estimates are then determined by maximizing this lower bound in relation to the model parameters. Until now, our discussion has centered on the log probability for a single document; the overall variational lower bound is computed as the sum of the individual lower bounds from each document. In the M-step, this bound is maximized with respect to model parameters
and
. Consequently, the entire process is akin to performing a coordinate ascent as outlined in Equation (
31). We formulate the update equation for estimating
by isolating terms and incorporating Lagrange multipliers to maximize the bound with respect to
:
To derive the update equation for
, we take the derivative of the variational lower bound with respect to
and set this derivative to zero. This step ensures that we find the point where the lower bound is maximized with respect to the parameter
.
The updates mentioned lead to convergence at a local maximum of the lower bound of
, which is optimal for all product approximations of the form
for the joint probability
. This approach ensures that the variational parameters are adjusted to optimally approximate the true posterior distributions within the constraints of the model.
Collapsed Gibbs Sampling Method
Utilizing the fundamental procedure of the GD distribution as delineated in the all-encompassing generative formula
within our innovative methodology, we can express it in the following manner:
Here,
signifies the GD document prior distribution, where
serves as a hyperparameter. Simultaneously,
, with
as its hyperparameters, represents the GD corpus prior distribution. The process of Bayesian inference seeks to approximate the posterior distribution of hidden variables
z by integrating out parameters, which can be mathematically depicted as follows:
Crucially, the joint distribution is expressed as a product of Gamma functions, as highlighted in prior research [
12,
45,
46]. This expression facilitates the determination of the expectation value for the accurate posterior distribution:
Employing the GD prior results in the posterior calculation as outlined below:
This leads to a posterior probability normalization as follows:
The sequence from Equation (
38) to Equation (
40) delineates the complete collapsed Gibbs sampling procedure, encapsulated as follows:
The implementation of collapsed Gibbs sampling in our GD-centric model facilitates sampling directly from the actual posterior distribution
p, as indicated in Equation (
41). This sampling technique is deemed more accurate than those employed in variational inference models, which typically approximate the distribution from which samples are drawn [
46,
47]. Hence, our model’s precision is ostensibly superior.
Upon the completion of the sampling phase, parameter estimation is conducted using the methodologies discussed.
3.2. Beta-Liouville Multinomial PCA
For the Beta-Liouville Multinomial PCA (BLMPCA) model, we define a corpus as a collection of documents with the same assumption described in the GDMPCA section. Hence, we have the following procedure for the model for every single word of the document. The BLMPCA model proceeds with generating every single word of the document with the following steps, where
c is a
-dimensional binary vector of topics defined:
In the model described, each topic is represented by a binary variable, where indicates that the i-th topic is chosen for the n-th word, and indicates it is not chosen. The vector is a -dimensional binary vector representing the topic assignments across all topics for a given word. The vector m is defined as , where captures the distribution of topic proportions across the document, ensuring that the sum of proportions across all topics equals 1.
A chosen topic is associated with a multinomial prior w over the vocabulary, where describes the probability of the j-th word being selected given that the i-th topic is chosen. This formulation allows for each word in the document to be drawn randomly from the vocabulary conditioned on the assigned topic.
The probability is a multinomial probability that conditions on the topic assignments and the topic-word distributions , effectively modeling the likelihood of each word in the document given the topic assignments.
Additionally,
represents a d-variate Beta-Liouville distribution with parameters
. The probability distribution function of this Beta-Liouville distribution encapsulates the prior beliefs about the distribution of topics across documents, accommodating complex dependencies among topics and allowing for flexibility in modeling topic prevalence and co-occurrence within the corpus.
A Dirichlet distribution is the special case of BL if
[
42,
45]. The mean, the variance, and the covariance in the case of a BL distribution are as follows [
45]:
and the covariance between
and
is given by
The earlier equation illustrates that the covariance matrix of the Beta-Liouville distribution offers a broader scope compared to the covariance matrix of the Dirichlet distribution. For the parameter estimation of BLMPCA, first, the parameter
is estimated using the Beta-Liouville prior on
m using parameter
[
19]. The likelihood model for the BLMPCA is given as follows:
For the Beta-Liouville priors, we have the following:
In the following step, we will estimate the parameters for
using the Beta-Liouville prior and the Hessian matrix (
Appendix C). As we explained in the previous
Section 3.1, we should estimate the model parameters
and the variational parameters
according to Equations (
21), (
22) and (
A7) to find
, and we proceed to maximize with respect to
; so, we have the following equations:
To find
, we proceed to maximize with respect to
:
The next step is to optimize Equation (
49) to find the update equations for the variational; we separate the terms containing the variational Beta-Liouville parameters once more.
Selecting the terms containing variational Beta-Liouville variables
and
, we have
and
Setting Equations (
52)–(
54) to zero, we have the following update parameters:
We address the challenge of deriving empirical Bayes estimates for the model parameters and by utilizing the variational lower bound as a substitute for the marginal log likelihood. This approach fixes the variational parameters and at values determined through variational inference. We then optimize this lower bound to obtain the empirical Bayes estimates of the model parameters.
To estimate
, we formulate necessary update equations. The process of maximizing Equation (
52) with respect to
results in the following equation:
Taking the derivatives with the respect to
and setting it to zero yields in
Appendix C.1:
Beta-Liouville Parameter
The objective of this subsection is to determine the estimates of the model’s parameters using variational inference techniques [
48].
The derivative of the above equation with respect to the BL parameter is given by
From the equations presented earlier, it is evident that the derivative in Equation (
52) with respect to each of the BL parameters is influenced not only by their individual values but also by their interactions with one another. Consequently, we utilize the Newton–Raphson method to address this optimization problem. To implement the Newton–Raphson method effectively, it is essential to first calculate the Hessian matrix for the parameter space, as illustrated below [
49]:
The Hessian matrix shown above is very similar to the Hessian matrix of the Dirichlet parameters in the MPCA model and generalized Dirichlet parameters in GDMPCA. In fact, the above matrix can be divided into two completely separate matrices using parameters , , and . Each of the two parts’ parameter derivation will be identical to the Newton–Raphson model provided by MPCA and GDMPCA.
3.3. Inference via Collapsed Gibbs Sampling
The collapsed Gibbs sampler (CGS) contributes to the inference by estimating posterior distributions through a Bayesian network of conditional probabilities, which are determined through a sampling process of hidden variables. Compared to the traditional Gibbs sampler that functions in the combined space of latent variables and model parameters, the CGS offers significantly faster estimation times. The CGS operates within the collapsed space of latent variables, where, in the joint distribution
, the model parameters
and
are marginalized out. This marginalization leads to the marginal joint distribution
, which is defined as follows:
In using Equation (
63), the method calculates the conditional probabilities of the latent variables
by considering the current state of all other variables while excluding the specific variable
itself [
50]. Meanwhile, the collapsed Gibbs sampler (CGS) determines the topic assignments for the observed words by employing the conditional probability of the latent variables, where “
” indicates counts or variables with
excluded [
50]. This specific conditional probability is defined as follows [
51]:
The sampling mechanism of the collapsed Gibbs approach can be summarized as an expectation problem:
The collapsed Gibbs sampling Beta-Liouville multinomial procedure consists of two phases for assigning documents to clusters. First, each document is assigned a random cluster for initialization. After that, each document is assigned a cluster based on the Beta-Liouville distribution after a specified number of iterations.
The goal is to use a network of conditional probabilities for individual classes to sample the latent variables from the joint distribution
. The assumption of conjugacy allows the integral in Equation (
63) to be estimated.
The likelihood of the multinomial distribution, defined by the parameter
, and the probability density function of the Beta-Liouville distribution can be expressed as follows:
By integrating the probability density function of the Beta-Liouville distribution over the parameter
and incorporating updated parameters derived from the remaining integral in Equation (
69), we are able to express it as a fraction of Gamma functions. The following shows the updated parameters, where
represents counts corresponding to variables [
45,
51]:
Equation (
67) is then equivalent to
The parameters and correspond to the Beta-Liouville distribution, while represents the number of documents in cluster k.
After the sampling process, parameter estimation is performed. Subsequently, the empirical likelihood method [
47] is utilized to validate the results using a held-out dataset. Ultimately, this process leads to the estimation of the class conditional probability
within the framework of collapsed Gibbs sampling:
The parameters are then computed as follows:
where S is the size of a sample.