Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities
Abstract
:1. Introduction
- with equality if and only if (nonnegativity and positive definiteness),
- (symmetry),
- (subaddivity/triangle inequality).
2. Family of Alpha-Divergences
2.1. Asymmetric Alpha-Divergences
- Nonnegativity: The Csiszár–Morimoto f-divergence is always nonnegative, and equal to zero if and only if probability densities and coincide. This follows immediately from the Jensens inequality (for normalized densities):
- Generalized entropy: It corresponds to a generalized f-entropy of the form:
- Convexity: For any
- Scaling: For any positive constant we have
- Symmetricity: For an arbitrary Csiszár–Morimoto f-divergence, it is possible to construct a symmetric divergence for .
- Convexity: is convex with respect to both and .
- Strict Positivity: and if and only if .
- Continuity: The Alpha-divergence is continuous function of real variable α in the whole range including singularities.
- Duality: .
- Exclusive/Inclusive Properties: [42]
- For , the estimation of that approximates is exclusive, that is for all . This means that the minimization of with respect to will force to be exclusive approximation, i.e., the mass of ) will lie within (see detail and graphical illustrations in [42]).
- For , the estimation of that approximates is inclusive, that is for all . In other words, the mass of includes all the mass of .
- Zero-forcing and zero-avoiding properties: [42]Here, we treat the case where and are not necessary mutually absolutely continuous. In such a case the divergence may diverges to ∞. However, the following two properties hold:
- For the estimation of that approximates is zero-forcing (coercive), that is, forces .
- For the estimation of that approximates is zero-avoiding, that is, implies .
2.2. Alpha-Rényi Divergence
2.3. Extended Family of Alpha-Divergences
Divergence | Csiszár function |
2.4. Symmetrized Alpha-Divergences
- Symmetric Jensen–Shannon divergence [62,64]
- Arithmetic-Geometric divergence [39]
- Symmetric Chi-square divergence [54]
3. Family of Beta-Divergences
- Convexity: The Bregman divergence is always convex in the first argument , but is often not in the second argument .
- Nonnegativity: The Bregman divergence is nonnegative with zero .
- Linearity: Any positive linear combination of Bregman divergences is also a Bregman divergence, i.e.,
- The three-point property generalizes the “Law of Cosines”:
- Generalized Pythagoras Theorem:
3.1. Generation of Family of Beta-divergences Directly from Family of Alpha-Divergences
Alpha-divergence | Beta-divergence |
4. Family of Gamma-Divergences Generated from Beta- and Alpha-Divergences
- . The equality holds if and only if for a positive constant c.
- It is scale invariant for any value of γ, that is, , for arbitrary positive scaling constants .
- The Alpha-Gamma divergence is equivalent to the normalized Alpha-Rényi divergence (25), i.e.,
- It can be expressed via generalized weighted mean:
- As , the Alpha-Gamma-divergence becomes the Kullback–Leibler divergence:
- For , the Alpha-Gamma-divergence can be expressed by the reverse Kullback–Leibler divergence:
Alpha-divergence | Robust Alpha-Gamma-divergence |
Divergence or | Gamma-divergence and |
- . The equality holds if and only if for a positive constant c.
- It is scale invariant, that is, , for arbitrary positive scaling constants .
- As , the Gamma-divergence becomes the Kullback–Leibler divergence:
- For , the Gamma-divergence can be expressed as follows
- . The equality holds if and only if for a positive constant c, in particular, .
- It is scale invariant, that is,
- For , it is reduced to a special form of the symmetric Kullback–Leibler divergence (also called the J-divergence)
- For , we obtain a simple divergence expressed by weighted arithmetic meansFor the discrete Beta-Gamma divergence (or simply the Gamma divergence), we obtain divergence
- For , the asymmetric Gamma-divergences (equal to a symmetric Gamma-divergence) is reduced to Cauchy–Schwarz divergence, introduced by Principe [83]
5. Relationships for Asymmetric Divergences and their Unified Representation
Divergence name | Formula |
Alpha | |
Beta | |
Gamma | |
)) | |
Alpha-Rényi | |
Bregman | , (see Eqation(57)) |
Csiszár-Morimoto | , (see Equation (57)) for ) |
Duality
5.1. Conclusions and Discussion
References
- Amari, S. Differential-Geometrical Methods in Statistics; Springer Verlag: Berlin, Germany, 1985. [Google Scholar]
- Amari, S. Dualistic geometry of the manifold of higher-order neurons. Neural Network. 1991, 4, 443–451. [Google Scholar] [CrossRef]
- Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: New York, NY, USA, 2000. [Google Scholar]
- Amari, S. Integration of stochastic models by minimizing alpha-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef] [PubMed]
- Amari, S. Information geometry and its applications: Convex function and dually flat manifold. In Emerging Trends in Visual Computing; Nielsen, F., Ed.; Springer: New York, NY, USA; pp. 75–102.
- Amari, S. Alpha-divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar] [CrossRef]
- Amari, S.; Cichocki, A. Information geometry of divergence functions. Bull. Pol. Acad. Sci. 2010. (in print). [Google Scholar] [CrossRef]
- Murata, N.; Takenouchi, T.; Kanamori, T.; Eguchi, S. Information geometry of U-Boost and Bregman divergence. Neural Comput. 2004, 16, 1437–1481. [Google Scholar] [CrossRef] [PubMed]
- Fujimoto, Y.; Murata, N. A modified EM Algorithm for mixture models based on Bregman divergence. Ann. Inst. Stat. Math. 2007, 59, 57–75. [Google Scholar] [CrossRef]
- Zhu, H.; Rohwer, R. Bayesian Invariant measurements of generalization. Neural Process. Lett. 1995, 2, 28–31. [Google Scholar] [CrossRef]
- Zhu, H.; Rohwer, R. Measurements of generalisation based on information geometry. In Mathematics of Neural Networks: Model Algorithms and Applications; Ellacott, S.W., Mason, J.C., Anderson, I.J., Eds.; Kluwer: Norwell, MA, USA, 1997; pp. 394–398. [Google Scholar]
- Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 56, 2882–2903. [Google Scholar] [CrossRef]
- Boissonnat, J.D.; Nielsen, F.; Nock, R. Bregman Voronoi diagrams. Discrete and Computational Geometry (Springer) 2010. (in print). [Google Scholar] [CrossRef]
- Yamano, T. A generalization of the Kullback-Leibler divergence and its properties. J. Math. Phys. 2009, 50, 85–95. [Google Scholar] [CrossRef]
- Minami, M.; Eguchi, S. Robust blind source separation by Beta-divergence. Neural Comput. 2002, 14, 1859–1886. [Google Scholar]
- Bregman, L. The relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. Comp. Math. Phys., USSR 1967, 7, 200–217. [Google Scholar] [CrossRef]
- Csiszár, I. Eine Informations Theoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitt von Markoffschen Ketten. Magyar Tud. Akad. Mat. Kutat Int. Kzl 1963, 8, 85–108. [Google Scholar]
- Csiszár, I. Axiomatic characterizations of information measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef]
- Csiszár, I. Information measures: A critial survey. In Transactions of the 7th Prague Conference, Prague, Czech Republic, 18–23 August 1974; Reidel: Dordrecht, Netherlands, 1977; pp. 83–86. [Google Scholar]
- Ali, M.; Silvey, S. A general class of coefficients of divergence of one distribution from another. J. Royal Stat. Soc. 1966, Ser B, 131–142. [Google Scholar]
- Hein, M.; Bousquet, O. Hilbertian metrics and positive definite kernels on probability measures. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, 6–8 January 2005; Ghahramani, Z., Cowell, R., Eds.; AISTATS. 2005; 10, pp. 136–143. [Google Scholar]
- Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J. Referential duality and representational duality on statistical manifolds. In Proceedings of the Second International Symposium on Information Geometry and its Applications, University of Tokyo, Tokyo, Japan, 12–16 December 2005; University of Tokyo: Tokyo, Japan, 2006; pp. 58–67. [Google Scholar]
- Zhang, J. A note on curvature of a-connections of a statistical manifold. Ann. Inst. Stat. Math. 2007, 59, 161–170. [Google Scholar] [CrossRef]
- Zhang, J.; Matsuzoe, H. Dualistic differential geometry associated with a convex function. In Springer Series of Advances in Mechanics and Mathematics; 2008; Springer: New York, NY, USA; pp. 58–67. [Google Scholar]
- Lafferty, J. Additive models, boosting, and inference for generalized divergences. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 7–9 July 1999; ACM: New York, NY, USA, 1999. [Google Scholar]
- Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
- Villmann, T.; Haase, S. Divergence based vector quantization using Fréchet derivatives. Neural Comput. 2010. (submitted for publication). [Google Scholar]
- Villmann, T.; Haase, S.; Schleif, F.M.; Hammer, B. Divergence based online learning in vector quantization. In Proceedings of the International Conference on Artifial Intelligence and Soft Computing (ICAISC2010), LNAI, Zakopane, Poland, 13–17 June 2010.
- Cichocki, A.; Zdunek, R.; Phan, A.H.; Amari, S. Nonnegative Matrix and Tensor Factorizations; John Wiley & Sons Ltd: Chichester, UK, 2009. [Google Scholar]
- Cichocki, A.; Zdunek, R.; Amari, S. Csiszár’s divergences for nonnegative matrix factorization: Family of new algorithms. Springer, LNCS-3889 2006, 3889, 32–39. [Google Scholar]
- Cichocki, A.; Amari, S.; Zdunek, R.; Kompass, R.; Hori, G.; He, Z. Extended SMART algorithms for Nonnegative Matrix Factorization. Springer, LNAI-4029 2006, 4029, 548–562. [Google Scholar]
- Cichocki, A.; Zdunek, R.; Choi, S.; Plemmons, R.; Amari, S. Nonnegative tensor factorization using Alpha and Beta divergences. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Tulose, France, May 2007; Volume III, pp. 1393–1396.
- Cichocki, A.; Zdunek, R.; Choi, S.; Plemmons, R.; Amari, S.I. Novel multi-layer nonnegative tensor factorization with sparsity constraints. Springer, LNCS-4432 2007, 4432, 271–280. [Google Scholar]
- Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
- Liese, F.; Vajda, I. Convex Statistical Distances. Teubner-Texte zur Mathematik Teubner Texts in Mathematics 1987, 95, 1–85. [Google Scholar]
- Eguchi, S.; Kato, S. Entropy and divergence associated with power function and the statistical application. Entropy 2010, 12, 262–274. [Google Scholar] [CrossRef]
- Taneja, I. On generalized entropies with applications. In Lectures in Applied Mathematics and Informatics; Ricciardi, L., Ed.; Manchester University Press: Manchester, UK, 1990; pp. 107–169. [Google Scholar]
- Taneja, I. New developments in generalized information measures. In Advances in Imaging and Electron Physics; Hawkes, P., Ed.; Elsevier: Amsterdam, Netherlands, 1995; Volume 91, pp. 37–135. [Google Scholar]
- Gorban, A.N.; Gorban, P.A.; Judge, G. Entropy: The Markov ordering approach. Entropy 2010, 12, 1145–1193. [Google Scholar] [CrossRef]
- Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on a sum of observations. Ann. Math. Statist. 1952, 23, 493–507. [Google Scholar] [CrossRef]
- Minka, T. Divergence measures and message passing. Microsoft Research Technical Report (MSR-TR-2005) 2005. [Google Scholar]
- Taneja, I. On measures of information and inaccuarcy. J. Statist. Phys. 1976, 14, 203–270. [Google Scholar]
- Cressie, N.; Read, T. Goodness-of-Fit Statistics for Discrete Multivariate Data; Springer: New York, NY, USA, 1988. [Google Scholar]
- Cichocki, A.; Lee, H.; Kim, Y.D.; Choi, S. Nonnegative matrix factorization with Alpha-divergence. Pattern. Recognit. Lett. 2008, 29, 1433–1440. [Google Scholar] [CrossRef]
- Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Statist. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
- Havrda, J.; Charvát, F. Quantification method of classification processes: Concept of structrual a-entropy. Kybernetika 1967, 3, 30–35. [Google Scholar]
- Cressie, N.; Read, T. Multinomial goodness-of-fit tests. J. R. Stat. Soc. Ser. B 1984, 46, 440–464. [Google Scholar]
- Vajda, I. Theory of Statistical Inference and Information; Kluwer Academic Press: Amsterdam, Netherland, 1989. [Google Scholar]
- Hellinger, E. Neue Begründung der Theorie Quadratischen Formen von unendlichen vielen Veränderlichen. J. Reine Ang. Math. 1909, 136, 210–271. [Google Scholar]
- Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jap. 1963, 12, 328–331. [Google Scholar] [CrossRef]
- Österreicher, F. Csiszár’s f-divergences-basic properties. Technical report; In Research Report Collection; Victoria University: Melbourne, Australia, 2002. [Google Scholar]
- Harremoës, P.; Vajda, I. Joint range of f-divergences. In Accepted for presentation at ISIT 2010, Austin, TX, USA, 13–18 June 2010.
- Dragomir, S. Inequalities for Csiszár f-Divergence in Information Theory; Victoria University: Melbourne, Australia, 2000; (edited monograph). [Google Scholar]
- Rényi, A. On the foundation of information theory. Rev. Inst. Int. Stat. 1965, 33, 1–4. [Google Scholar] [CrossRef]
- Rényi, A. On measures of entropy and information. In Proceddings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA; Volome 1, pp. 547–561.
- Rényi, A. Probability Theory; North-Holland: Amsterdam, The Netherlands, 1970. [Google Scholar]
- Harremoës, P. Interpretaions of Rényi entropies and divergences. Physica A 2006, 365, 57–62. [Google Scholar] [CrossRef]
- Harremoës, P. Joint range of Rényi entropies. Kybernetika 2009, 45, 901–911. [Google Scholar]
- Hero, A.; Ma, B.; Michel, O.; Gorman, J. Applications of entropic spanning graphs. IEEE Signal Process. Mag. 2002, 19, 85–95. [Google Scholar] [CrossRef]
- Topsoe, F. Some inequalities for information divergence and related measuresof discrimination. IEEE Trans. Inf. Theory 2000, 46, 1602–1609. [Google Scholar] [CrossRef]
- Burbea, J.; Rao, C. Entropy differential metric, distance and divergence measures in probability spaces: A unified approach. J. Multi. Analysis 1982, 12, 575–596. [Google Scholar] [CrossRef]
- Burbea, J.; Rao, C. On the convexity of some divergence measures based on entropy functions. IEEE Trans. Inf. Theory 1982, IT-28, 489–495. [Google Scholar] [CrossRef]
- Sibson, R. Information radius. Probability Theory and Related Fields 1969, 14, 149–160. [Google Scholar] [CrossRef]
- Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. Lon., Ser. A 1946, 186, 453–461. [Google Scholar] [CrossRef]
- Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Statist. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Basu, A.; Harris, I.R.; Hjort, N.; Jones, M. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
- Mollah, M.; Minami, M.; Eguchi, S. Exploring latent structure of mixture ICA models by the minimum Beta-divergence method. Neural Comput. 2006, 16, 166–190. [Google Scholar] [CrossRef]
- Mollah, M.; Eguchi, S.; Minami, M. Robust prewhitening for ICA by minimizing Beta-divergence and its application to FastICA. Neural Process. Lett. 2007, 25, 91–110. [Google Scholar] [CrossRef]
- Kompass, R. A Generalized divergence measure for Nonnegative Matrix Factorization. Neural Comput. 2006, 19, 780–791. [Google Scholar] [CrossRef] [PubMed]
- Mollah, M.; Sultana, N.; Minami, M.; Eguchi, S. Robust extraction of local structures by the minimum of Beta-divergence method. Neural Netw. 2010, 23, 226–238. [Google Scholar] [CrossRef] [PubMed]
- Nielsen, F.; Nock, R. The dual Voronoi diagrams with respect to representational Bregman divergences. In Proceedings of the International Symposium on Voronoi Diagrams (ISVD), Copenhagen, Denmark, 23–26 June 2009.
- Cichocki, A.; Phan, A. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE (invited paper) 2009, E92-A (3), 708–721. [Google Scholar] [CrossRef]
- Cichocki, A.; Phan, A.; Caiafa, C. Flexible HALS algorithms for sparse non-negative matrix/tensor factorization. In Proceedings of the 18th IEEE workshops on Machine Learning for Signal Processing, Cancun, Mexico, 16–19 October 2008.
- Dhillon, I.; Sra, S. Generalized Nonnegative Matrix Approximations with Bregman Divergences. In Neural Information Processing Systems; MIT Press: Vancouver, Canada, 2005; pp. 283–290. [Google Scholar]
- Févotte, C.; Bertin, N.; Durrieu, J.L. Nonnegative matrix factorization with the Itakura-Saito divergence with application to music analysis. Neural Comput. 2009, 21, 793–830. [Google Scholar] [CrossRef] [PubMed]
- Itakura, F.; Saito, F. Analysis synthesis telephony based on the maximum likelihood method. In Proceedings of the of the 6th International Congress on Acoustics, Tokyo, Japan, 1968; pp. 17–20.
- Eggermont, P.; LaRiccia, V. On EM-like algorithms for minimum distance estimation. Technical report; In Mathematical Sciences; University of Delaware: Newark, DE, USA, 1998. [Google Scholar]
- Févotte, C.; Cemgil, A.T. Nonnegative matrix factorizations as probabilistic inference in composite models. In Proceedings of the 17th European Signal Processing Conference (EUSIPCO-09), Glasgow, Scotland, UK, 24–28 August 2009.
- Banerjee, A.; Dhillon, I.; Ghosh, J.; Merugu, S.; Modha, D. A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; ACM Press: New York, NY, USA, 2004; pp. 509–514. [Google Scholar]
- Lafferty, J. Additive models, boosting, and inference for generalized divergences. In Proceedings of the 12th Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 7–9 July 1999; ACM Press: New York, USA, 1999; pp. 125–133. [Google Scholar]
- Frigyik., B.A.; Srivastan, S.; Gupta, M. Functional Bregman divergence and Bayesion estimation of distributions. IEEE Trans. Inf. Theory 2008, 54, 5130–5139. [Google Scholar] [CrossRef]
- Principe, J. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: Berlin, Germany, 2010. [Google Scholar]
- Choi, H.; Choi, S.; Katake, A.; Choe, Y. Learning alpha-integration with partially-labeled data. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2010), Dallas, TX, USA, 14–19 March 2010.
- Jones, M.; Hjort, N.; Harris, I.R.; Basu, A. A comparison of related density-based minimum divergence estimators. Biometrika 1998, 85, 865–873. [Google Scholar] [CrossRef]
Appendix A. Divergences Derived from Tsallis α-entropy
Appendix B. Entropies and Divergences
Appendix C. Tsallis and Rényi Entropies
© 2010 by the authors licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license ( http://creativecommons.org/licenses/by/3.0/).
Share and Cite
Cichocki, A.; Amari, S.-i. Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities. Entropy 2010, 12, 1532-1568. https://doi.org/10.3390/e12061532
Cichocki A, Amari S-i. Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities. Entropy. 2010; 12(6):1532-1568. https://doi.org/10.3390/e12061532
Chicago/Turabian StyleCichocki, Andrzej, and Shun-ichi Amari. 2010. "Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities" Entropy 12, no. 6: 1532-1568. https://doi.org/10.3390/e12061532
APA StyleCichocki, A., & Amari, S. -i. (2010). Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities. Entropy, 12(6), 1532-1568. https://doi.org/10.3390/e12061532