Gaussian Mean Field Regularizes by Limiting Learned Information
Abstract
:1. Introduction
2. Regularization through the Mean Field
2.1. Fixed-Variance Gaussian Mean Field Inference
2.2. Generalization Error vs. Limited Information
2.3. Learned-Variance Gaussian Mean Field Inference
2.4. Supervised and Unsupervised Learning
2.5. Flexible Variational Distributions
3. Related Work
3.1. Regularization in Neural Networks
3.2. Information Bottlenecks
3.3. Information Estimation with Neural Networks
4. Experiments
4.1. Supervised Learning
4.2. Unsupervised Learning
4.2.1. Varying Model Capacity and Priors
4.2.2. Varying Training Set Size
4.2.3. Varying Model Size
4.2.4. Qualitative Reconstruction
5. Discussion
5.1. Choosing the Capacity
5.2. Role of Learning Dynamics and Architecture
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
Appendix A. Capacity in Learned-Variance Gaussian Mean Field Inference
1 | |
10 | |
100 |
References
- Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin, Germany, 2006. [Google Scholar]
- Barber, D. Bayesian Reasoning and Machine Learning; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
- Ghahramani, Z. Probabilistic machine learning and artificial intelligence. Nature 2015, 521, 452. [Google Scholar] [CrossRef] [PubMed]
- Wainwright, M.J.; Jordan, M.I. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 2008, 1, 1–305. [Google Scholar] [CrossRef]
- Hoffman, M.D.; Blei, D.M.; Wang, C.; Paisley, J. Stochastic variational inference. J. Mach. Learn. Res. 2013, 14, 1303–1347. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Titsias, M.; Lázaro-Gredilla, M. Doubly stochastic variational Bayes for non-conjugate inference. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014. [Google Scholar]
- Rezende, D.J.; Mohamed, S. Variational inference with normalizing flows. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
- Graves, A.; Mohamed, A.R.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
- Plappert, M.; Houthooft, R.; Dhariwal, P.; Sidor, S.; Chen, R.Y.; Chen, X.; Asfour, T.; Abbeel, P.; Andrychowicz, M. Parameter Space Noise for Exploration. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Fortunato, M.; Azar, M.G.; Piot, B.; Menick, J.; Hessel, M.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; et al. Noisy Networks For Exploration. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Turner, R.; Sahani, M. Two problems with variational expectation maximisation for time-series models. In Bayesian Time Series Models; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
- Trippe, B.; Turner, R. Overpruning in variational Bayesian neural networks. arXiv 2018, arXiv:1801.06230. [Google Scholar]
- Braithwaite, D.; Kleijn, W.B. Bounded Information Rate Variational Autoencoders. arXiv 2018, arXiv:1807.07306. [Google Scholar]
- Shu, R.; Bui, H.H.; Zhao, S.; Kochenderfer, M.J.; Ermon, S. Amortized Inference Regularization. arXiv 2018, arXiv:1805.08913. [Google Scholar]
- Rissanen, J. Modeling by shortest data description. Automatica 1978, 14, 465–471. [Google Scholar] [CrossRef]
- Rissanen, J. A universal prior for integers and estimation by minimum description length. Ann. Stat. 1983, 11, 416–431. [Google Scholar] [CrossRef]
- Hinton, G.; van Camp, D. Keeping neural networks simple by minimising the description length of weights. In Computational Learning Theory; ACM Press: New York, NY, USA, 1993. [Google Scholar]
- Xu, A.; Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems; NIPS: Grenada, Spain, 2017. [Google Scholar]
- Bu, Y.; Zou, S.; Veeravalli, V.V. Tightening Mutual Information Based Bounds on Generalization Error. arXiv 2019, arXiv:1901.04609. [Google Scholar]
- Bassily, R.; Moran, S.; Nachum, I.; Shafer, J.; Yehudayoff, A. Learners that use little information. In Proceedings of the Algorithmic Learning Theory, Lanzarote, Spain, 7–9 April 2018. [Google Scholar]
- Russo, D.; Zou, J. How much does your data exploration overfit? Controlling bias via information usage. arXiv 2015, arXiv:1511.05219. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: New York, NY, USA, 2012. [Google Scholar]
- Kingma, D.P.; Salimans, T.; Welling, M. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems; NIPS: Grenada, Spain, 2015. [Google Scholar]
- Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight uncertainty in neural networks. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2015. [Google Scholar]
- Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. β-VAE: Learning basic visual concepts with a constrained variational framework. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Bowman, S.R.; Vilnis, L.; Vinyals, O.; Dai, A.M.; Jozefowicz, R.; Bengio, S. Generating sentences from a continuous space. arXiv 2015, arXiv:1511.06349. [Google Scholar]
- Sønderby, C.K.; Raiko, T.; Maaløe, L.; Sønderby, S.K.; Winther, O. Ladder variational autoencoders. In Advances in Neural Information Processing Systems; NIPS: Grenada, Spain, 2016. [Google Scholar]
- Kingma, D.P.; Salimans, T.; Jozefowicz, R.; Chen, X.; Sutskever, I.; Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems; NIPS: Grenada, Spain, 2016. [Google Scholar]
- Salimans, T.; Kingma, D.; Welling, M. Markov chain Monte Carlo and variational inference: Bridging the gap. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2015; pp. 1218–1226. [Google Scholar]
- Ranganath, R.; Tran, D.; Altosaar, J.; Blei, D. Operator variational inference. In Advances in Neural Information Processing Systems; NIPS: Grenada, Spain, 2016. [Google Scholar]
- Huszár, F. Variational inference using implicit distributions. arXiv 2017, arXiv:1702.08235. [Google Scholar]
- Chen, T.Q.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D. Neural Ordinary Differential Equations. arXiv 2018, arXiv:1806.07366. [Google Scholar]
- Vertes, E.; Sahani, M. Flexible and accurate inference and learning for deep generative models. arXiv 2018, arXiv:1805.11051. [Google Scholar]
- Burda, Y.; Grosse, R.; Salakhutdinov, R. Importance weighted autoencoders. arXiv 2015, arXiv:1509.00519. [Google Scholar]
- Cremer, C.; Morris, Q.; Duvenaud, D. Reinterpreting importance-weighted autoencoders. In Proceedings of the International Conference on Learning Representations Workshop, Toulon, France, 24–26 April 2017. [Google Scholar]
- Molchanov, D.; Ashukha, A.; Vetrov, D. Variational dropout sparsifies deep neural networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2498–2507. [Google Scholar]
- Wang, S.; Manning, C. Fast dropout training. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
- Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv 2000, arXiv:physics/0004057. [Google Scholar]
- Shamir, O.; Sabato, S.; Tishby, N. Learning and generalization with the information bottleneck. Theor. Comput. Sci. 2010, 411, 2696–2711. [Google Scholar] [CrossRef] [Green Version]
- Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Achille, A.; Soatto, S. Emergence of invariance and disentangling in deep representations. arXiv 2017, arXiv:1706.01350. [Google Scholar]
- Alemi, A.; Poole, B.; Fischer, I.; Dillon, J.; Saurous, R.A.; Murphy, K. Fixing a broken ELBO. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Belghazi, I.; Rajeswar, S.; Baratin, A.; Hjelm, R.D.; Courville, A. MINE: Mutual information neural estimation. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv 2018, arXiv:1808.06670. [Google Scholar]
- Krogh, A.; Hertz, J.A. A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems; NIPS: Grenada, Spain, 1992. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Lei, D.; Sun, Z.; Xiao, Y.; Wang, W.Y. Implicit Regularization of Stochastic Gradient Descent in Natural Language Processing: Observations and Implications. arXiv 2018, arXiv:1811.00659. [Google Scholar]
- Li, Y.; Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems; NIPS: Grenada, Spain, 2018. [Google Scholar]
- Marceau-Caron, G.; Ollivier, Y. Natural Langevin dynamics for neural networks. In International Conference on Geometric Science of Information; Springer: London, UK, 2017. [Google Scholar]
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kunze, J.; Kirsch, L.; Ritter, H.; Barber, D. Gaussian Mean Field Regularizes by Limiting Learned Information. Entropy 2019, 21, 758. https://doi.org/10.3390/e21080758
Kunze J, Kirsch L, Ritter H, Barber D. Gaussian Mean Field Regularizes by Limiting Learned Information. Entropy. 2019; 21(8):758. https://doi.org/10.3390/e21080758
Chicago/Turabian StyleKunze, Julius, Louis Kirsch, Hippolyt Ritter, and David Barber. 2019. "Gaussian Mean Field Regularizes by Limiting Learned Information" Entropy 21, no. 8: 758. https://doi.org/10.3390/e21080758
APA StyleKunze, J., Kirsch, L., Ritter, H., & Barber, D. (2019). Gaussian Mean Field Regularizes by Limiting Learned Information. Entropy, 21(8), 758. https://doi.org/10.3390/e21080758