Abstract
The Dirichlet distribution as a multivariate generalization of the beta distribution is especially important for modeling categorical distributions. Hence, its applications vary within a wide range from modeling cell probabilities of contingency tables to modeling income inequalities. Thus, it is commonly used as the conjugate prior of the multinomial distribution in Bayesian statistics. In this study, the parameters of a bivariate Dirichlet distribution are estimated by entropy formalism. As an alternative to maximum likelihood and the method of moments, two methods based on the principle of maximum entropy are used, namely the ordinary entropy method and the parameter space expansion method. It is shown that in estimating the parameters of the bivariate Dirichlet distribution, the ordinary entropy method and the parameter space expansion method give the same results as the method of maximum likelihood. Thus, we emphasize that these two methods can be used alternatively in modeling bivariate and multinomial Dirichlet distributions.
Keywords:
Dirichlet distribution; principle of maximum entropy; ordinary entropy method; parameter space expansion method; method of moments; maximum likelihood estimation MSC:
62H12; 94A17; 1F66; 54C70
1. Introduction
In statistics, the method of moments and maximum likelihood are used frequently, details of which can be found in [1,2]. For a long time, their asymptotic properties have been studied in detail [3]. Since the asymptotic distributions of estimators found by these two methods are normal, they have been proven to be very powerful tools for parameter estimation. However, nowadays, alternative estimation methods based on entropy maximization are applied increasingly frequently.
In 1948, [4] defined entropy as a numerical measure of uncertainty, or conversely the information content, associated with a probability distribution with parameter . It is used to describe a random variable X and is mathematically expressed as
for continuous X, where can be considered the mean value of . For discrete probability distributions, the integration operator in (1) is simply replaced by the summation operator. Rényi (1961) provided a generalization of Shannon entropy [5]. Rényi entropy is also called -class entropy. For a discrete case, it is defined as
By L’Hôspital’s rule
Therefore, Shannon entropy can be evaluated as a special case of Rényi entropy. Another generalization of Shannon entropy was realized by Constantino Tsallis (1988) [5]. Tsallis entropy is also known as -class entropy [6]. It is defined as
By L’Hôspital’s rule,
In other words, Tsallis entropy approaches Shannon entropy as as well as Rényi entropy. Note that for continuous distributions, the summation signs in defining equations are replaced by integration signs.
Kullback (1959) used entropy and relative entropy as the two key concepts in multivariate statistical analysis [7]. Asymptotic distributions of various entropy measures can be found in [8]. Pardo emphasizes that entropy and relative entropy formulas can be derived as special cases of divergence measures [9].
Entropy-Based Parameter Estimation in Hydrology is the first book to focus on parameter estimation using entropy for a number of distributions frequently used in hydrology [10], including uniform, exponential, normal, two-parameter lognormal, extreme value type I, Weibull, gamma, Pearson, and two-parameter Pareto distributions, among others. Singh also applies entropy theory to some problems of hydraulic and environmental engineering [11,12,13].
The principle of maximum entropy (POME), described by Jaynes as “the least biased estimate possible on the given information”, can be stated mathematically as follows [14]: Given m linearly independent constraints in the form
where are some functions whose averages over are specified, the maximum of I, subject to the conditions in Equation (6), is given by the distribution
where are Lagrange multipliers and can be determined from Equations (6) and (7) along with the normalization condition in Equation (1).
The general procedure for entropy-based parameter estimation involves (1) defining given information in terms of constraints, (2) maximizing entropy subject to given information, and (3) relating parameters to the given information. In this procedure, Lagrange multipliers are related to the constraints on one hand and to the distribution parameters on the other. One can eliminate the Lagrange multipliers and obtain parameter estimations as well.
The parameter space expansion method was developed by Singh and Rajogopal (1986). This method is different from the previous entropy method in that it employs enlarged parameter space and maximizes entropy subject to both the parameters and the Lagrange multipliers [15]. The method works as follows: for the given distribution, first the constraints are defined, and the POME formulation is obtained in terms of the parameters to be estimated and the Lagrange multipliers. After the maximization procedure, the parameter estimations can be obtained.
Entropy-based models have been intensively used for determining parameter estimations in recent years. For example, Song and Kang examined two entropy-based methods that both use the POME for the estimation of the parameters of the four-parameter exponential gamma distribution [16]. Hao and Singh applied two entropy-based methods, also using the POME, for the estimation of the parameters of the extended Burr XII distribution [17]. Singh and Deng revisited the four-parameter kappa distribution, presented an entropy-based method for estimating its parameters, and compared its performance with that of maximum likelihood estimation, methods of moments, and L moments [18]. Gao and Han used the maximum entropy method to apply a concrete solution to a special nonlinear expectation problem in a special parameter space and analyzed the convergence for the maximum entropy solution [19].
The objective of the present paper is to apply ordinary entropy and parameter space expansion to estimate the parameters of a bivariate Dirichlet distribution as an alternative to the known methods, and then to compare them with those estimated by the maximum likelihood method and method of moments.
2. Dirichlet Distribution
The beta distribution plays an important role in Bayesian statistics, especially in modeling the parameters of the Bernoulli distribution [20]. The Dirichlet distribution is a multivariate generalization of the beta distribution. Thus, the Dirichlet distribution and the generalized Dirichlet distribution can both be used as a conjugate prior for a multinomial distribution [21].
Let be a vector with k components, for and Also, , where for each i. The probability density function (pdf) of the Dirichlet distribution is given as
where , , , and and is the Euler’s gamma function, which is denoted by the formula or .
It can be noted that marginals of this Dirichlet distribution are beta distributions [22], namely . The moments are given by
For further properties, one may refer to [22,23,24].
3. Ordinary Entropy Method
In the ordinary entropy method, there are three steps in parameter estimation: (1) specification of appropriate constraints, (2) derivation of the entropy function of the distribution, and (3) derivation of the relations between parameters and constraints.
3.1. Specification of Constraints
Taking the natural logarithm of Equation (8), we obtain
3.2. Construction of the Partition Function and Zeroth Lagrange Multiplier
The least biased pdf, consistent with equations from (15) to (17) and by POME, takes the following form:
where are Lagrange multipliers. Substituting (18) in (15) yields
The zeroth Lagrange multiplier is obtained from Equation (21) as
The zeroth Lagrange multiplier is also obtained from (20) as
3.3. Relation between Lagrange Multipliers and Constraints
If it continues like this until ,
Furthermore,
Similarly, for the k term,
where is the digamma function, which is defined as [25].
If we go on until the term
3.4. Relation between Lagrange Multipliers and Parameters
3.5. Relation between Parameters and Constraints
The parameters of the Dirichlet distribution are related to the Lagrange multipliers. In turn, these parameters are related to the known constraints by equations from (31) to (34). By eliminating Lagrange multipliers from these sets of equations, we can obtain an alternative way of presentation as shown below:
3.6. Distribution Entropy
4. Parameter Space Expansion Method
4.1. Specification of Constraints
4.2. Derivation of the Entropy Function
4.3. Relation between Parameters and Constraints
5. Two Other Parameter Estimation Methods
5.1. Method of Moments
5.2. Method of Maximum Likelihood Estimation
The likelihood function L, where n is the sample size, is
Then the log likelihood function, , is
Differentiating Equation (59) with respect to parameters , respectively, and equating each derivative to zero yield the following equations:
These results are the same as those found by the ordinary entropy method and parameter space expansion method.
The maximum likelihood (ML) estimation method provides singular point estimates for model parameters while overlooking the residual uncertainty inherent in the estimation process. Conversely, the Bayesian estimation method adopts a different approach, yielding posterior probability distributions encompassing the entire spectrum of model parameters. This is achieved by integrating the observed data with prior distributions [26]. Broadly speaking, when contrasted with ML estimation, Bayesian parameter estimation within a statistical model has the potential to yield a robust and stable estimate. This is primarily due to its ability to incorporate the accompanying uncertainty into the estimation process, a particularly valuable attribute when dealing with limited amounts of observed data [27]. The Dirichlet distribution, being a constituent of the exponential family, possesses a corresponding conjugate prior. Nevertheless, due to the intricate nature of the posterior distribution, its practical utility in problem-solving scenarios is limited. Consequently, the task of Bayesian estimation for the Dirichlet distribution, in a general context, lacks analytical tractability. To achieve this objective, Zao employed an approximation approach to model the parameter distribution within the Dirichlet distribution. Specifically, they approximated it with a multivariate Gaussian distribution, leveraging the expectation propagation (EP) framework [28]. Furthermore, there are some studies in reliability engineering which estimate parameters with the determination of quantiles by the application of the maximum likelihood method, such as [29].
6. Simulation and Comparison of Parameter Estimation Methods
Simulation from the Dirichlet distribution can be performed in two steps: the probability integral theorem states that the distribution function of any continuous distribution is uniform on . Then, by the inverse distribution function of gamma, one may simulate a number of independent gamma variates as needed. In other words, first one may simulate k independent gamma variates such that and calculate ; . Then the random vector fits the Dirichlet distribution, having parameter vector [30]. This procedure can easily be realized even by Microsoft Excel.
In the present study, we first simulated 1000 pairs from the Dirichlet distribution for some arbitrary parameters and . We obtained estimates obtained by four methods but since the maximum likelihood estimators and estimators obtained by the ordinary entropy method and parameter space expansion method are all the same, the comparison is between moment estimates and the rest. Then we repeated this experiment 5000 times. The summarizing statistics are as shown in Table 1:
Table 1.
Results of some simulations (1000 runs, 5000 runs).
Note that the maximum likelihood estimates (MLEs) are obtained by Excel Solver. In general, moment estimates and MLEs are close to each other. Absolute percentage errors (APEs) are calculated by the following formula:
Then, it can be inferred that one measure does not dominate all the time, i.e., there are some instances in which moment estimators perform better than the others, and there are other instances in which maximum likelihood estimators do it better. In any case, it is definite from the table that increasing the number of simulations increases precision considerably.
Note that, by the central limit theorem, moment estimators are expected to be distributed normally for a large number of observations (or simulations) since a moment estimator considers a sum of random observations (or the sum of some power of these random observations). Maximum likelihood estimators also have the asymptotic normality property with lower variances. For the Dirichlet distribution, we found that the entropy estimators mentioned above and maximum likelihood estimators are identical. Therefore, entropy estimators also have the asymptotic normality property.
Finally, we note that the selection of is quite arbitrary just for addressing the fact that maximum likelihood estimators (and maximum entropy estimators) are better (i.e., show lower sampling variability as compared to moment estimators). Actually, this was not the case all the time in the simulations. Since, in our study, the initial estimates of maximum likelihood (and maximum entropy) are provided by the method of moments and since the initial estimates provided by moments are close enough to the actual parameters, a great improvement in sampling variability may not be achieved. This is probably due to the nature of nonlinear estimation. To have a better picture, first simulating a random vector several times, then trying to calculate moment estimates and then, based on these initial estimates, moving forward to maximum likelihood estimates may be meaningful.
7. Conclusions
In the present study, parameter estimations of the Dirichlet distribution are obtained by four methods. For a Dirichlet distribution with three parameters, parameter estimates found by entropy methods (that we considered here) and by maximum likelihood are almost the same. Maximum likelihood estimators are consistent, most efficient, sufficient, tend to normality (as the sample size increases), and are invariant under functional transformations [31]. Therefore, parameter estimators found by the entropy methodology have the same appealing properties as the maximum likelihood estimators. Based on the fact that the sample moment will tend to be more concentrated for the corresponding population moment for larger samples, a sample moment can also be used to estimate population moments [1]. In general, moment estimators are asymptotically normally distributed and consistent. However, their variance may be larger than that of estimators derived by other methods [32]. Yet, it may be a good idea to start nonlinear estimation either for maximum likelihood or entropy maximization methods with initial moment estimates. In the present study, we started with moment estimates of a Dirichlet distribution with arbitrarily selected parameters to demonstrate that better parameter estimates (i.e., estimates both with lower bias and lower sampling variability) can be achieved. The simulation part of this work can be enlarged by determining parameters randomly by further simulations for further generalizations.
Author Contributions
All authors made the same contributions. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
All results presented in the article were produced from model simulations. Therefore, there are no data to be made available. Researchers who wish to replicate the study should use Microsoft Excel and the parameters described in the article. With those parameters, researchers can use modeling simulations to replicate the tables and figures presented in the article.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Mood, A.M.; Graybill, F.A.; Boes, D. Introduction to the Theory of Statistics; McGraw-Hill Edition: New York, NY, USA, 1974. [Google Scholar]
- Casella, G.; Berger, R.L. Statistical Inference, 2nd ed.; Duxbury Advanced Series; Cengage Learning: Pacific Grove, CA, Australia, 2002. [Google Scholar]
- Dasgupta, A. Asymtotic Theory of Statistics and Probability; Springer: New York, NY, USA, 2002. [Google Scholar]
- Shannon, C.E. A mathematical theory of communication. Bell. Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
- Renyi, A. On measures of entropy and information. In Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; Volume 1, pp. 547–561. [Google Scholar]
- Ullah, A. Entropy, divergence and distance measures with econometric applications. J. Stat. Plan. Inference 1996, 49, 137–162. [Google Scholar] [CrossRef]
- Kullback, S. Information Theory and Statistics; Dover Publications: New York, NY, USA, 1978. [Google Scholar]
- Esteban, M.D.; Morales, D. A summary on entropy statistics. Kybernetika 1995, 1, 337–346. [Google Scholar]
- Pardo, L. Statistical İnference Based on Divergence Measures; Chapman&Hall/CRC: New york, NY, USA, 2006. [Google Scholar]
- Singh, V.P. Entropy-Based Parameter Estimation in Hydrology; Kluwer Academic Publishers: Boston, MA, USA, 1998. [Google Scholar]
- Singh, V.P. Entropy Theory and Its Application in Environmental and Water Engineering; John Wiley and Sons: West Sussex, UK, 2013. [Google Scholar]
- Singh, V.P. Entropy Theory in Hydraulic Engineering: An Introduction; ASCE Press: Reston, VA, USA, 2015. [Google Scholar]
- Singh, V.P. Entropy Theory in Hydrologic Science and Engineering; Hill Education: New York, NY, USA, 2014. [Google Scholar]
- Jaynes, E.T. Probability Theory: The Logic of Science; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
- Singh, V.P.; Rajagopal, A.K. A new method of parameter estimation for hydrologic frequency analysis. Hydrol. Sci. Technol. 1986, 3, 33–40. [Google Scholar]
- Song, S.; Song, X.; Kang, Y. Entropy-Based Parameter Estimation for the Four-Parameter Exponential Gamma Distribution. Entropy 2017, 19, 189. [Google Scholar] [CrossRef]
- Hao, Z.; Singh, V.P. Entropy-based parameter estimation for extended Burr XII distribution. Stoch Env. Res Risk Assess 2009, 23, 1113–1122. [Google Scholar] [CrossRef]
- Singh, V.P.; Deng, Z.Q. Entropy-based parameter estimation for kappa distribution. J. Hydrol. Eng. 2003, 8, 81–92. [Google Scholar] [CrossRef]
- Gao, L.; Han, D. Methods of Moment and Maximum Entropy for Solving Nonlinear Expectation. Mathematics 2019, 7, 45. [Google Scholar] [CrossRef]
- De Groot, M.; Shervish, M. Probability and Statistics, 4th ed.; Addison-Wesley: Boston, MA, USA, 2002. [Google Scholar]
- Press, S.J. Applied Multivariable Analysis Using Bayesian and Frequent Methods of Inference; Dover Publications: Mineola, NY, USA, 1981. [Google Scholar]
- Lin, J. On the Dirichlet Distribution. Master’s Thesis, Queens University, Kingston, ON, Canada, 2016. [Google Scholar]
- Robin, K.S. A generalization of the Dirichlet distribution. J. Stat. Softw. 2010, 33, 1–18. [Google Scholar]
- Bilodeau, M.; Brenner, D. Theory of Multivariate Statistics; Springer: New York, NY, USA, 1999. [Google Scholar]
- Abramowitz, M.; Stegun, I.A. Handbook of Mathematical Functions; Dover Publications: Washington, DC, USA, 1964. [Google Scholar]
- Bishop, C. Pattern Recognition and Machine Learning. J. Electron. Imaging 2006, 4, 049901. [Google Scholar] [CrossRef]
- Zhanyu, M.; Pravin, K.R.; Jalil, T.; Markus, F.; Arne, L. Bayesian estimation of Dirichlet mixture model with variational inference. Pattern Recognit. 2014, 47, 3143–3157. [Google Scholar] [CrossRef]
- Ma, Z. Bayesian estimation of the Dirichlet distribution with expectation propagation. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012; pp. 689–693. [Google Scholar]
- Zhuang, L.; Xu, A.; Wang, X.L. A prognostic driven predictive maintenance framework based on Bayesian deep learning. Reliab. Eng. Syst. Saf. 2023, 234, 109181. [Google Scholar] [CrossRef]
- Devroye, L. Non-Uniform Random Variate Generation; Springer Science+Business Media: New York, NY, USA, 1986. [Google Scholar]
- Keeping, E.S. Introduction to Statistical Inference; Dover Publications: New York, NY, USA, 1995; pp. 126–127. [Google Scholar]
- Hines, W.W.; Montgomery, D.C.; Goldsman, D.M.; Borror, C.M. Probability and Statistics in Engineering, 4th ed.; John Wiley Sons, Inc.: Hoboken, NJ, USA, 2008; pp. 222–225. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).