Hidden Markov Neural Networks
Abstract
:1. Introduction
2. Hidden Markov Neural Networks
2.1. Filtering Algorithm for HMNN
2.2. Sequential Reparameterization Trick
Algorithm 1 Approximate filtering recursion |
Set: |
for do |
Initialize: |
repeat |
Estimate the gradient ∇ with (11) evaluated in |
Update the parameters: |
until Maximum number of iterations |
Set: |
end for |
Return: |
2.3. Gaussian Case
2.4. Performance and Uncertainty Quantification
3. Related Work
3.1. Combining NNs and HMMs
3.2. Bayesian DropConnect and DropOut
3.3. Bayesian Filtering
3.4. Continual Learning
4. Experiments
4.1. Variational DropConnect
4.2. Illustration: Two Moons Dataset
4.3. Concept Drift: Logistic Regression
4.4. Concept Drift: Evolving Classifier on MNIST
- We define two labellers: , naming each digit with its label in MNIST; , labelling each digit with its MNIST’s label shifted by one unit, i.e., 0 is classified as 1, 1 is classified as 2, …, 9 is classified as 0.
- We consider 19 time steps where each time step t is associated with a probability and a portion of the MNIST’s dataset .
- At each time step t, we randomly label each digit in with either or according to the probabilities .
4.5. One-Step-Ahead Prediction for Flag Waving
5. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
HMM | Hidden Markov Model |
HMNN | Hidden Markov Neural Network |
NN | Neural Network |
ELBO | Evidence Lower Bound |
Appendix A. Formulae Derivation
Appendix B. Experiments
- Appendix B.1 treats the experiment on variational DropConnect;
- Appendix B.3 describes the application to the evolving classifier;
- Appendix B.4 considers the video texture of a waving flag.
Appendix B.1. Variational DropConnect
Appendix B.2. Two Moons Dataset
Appendix B.3. Evolving Classifier on MNIST
- We preprocessed the data by dividing each pixel by 126. We then defined two labellers: , naming each digit with its label in MNIST; , labelling each digit with its MNIST’s label shifted by one unit, i.e., 0 is classified as 1, 1 is classified as 2, …, 9 is classified as 0.
- We considered 19 time steps where each time step t is associated with a probability and a portion of the MNIST’s dataset . The probability of choosing evolves as follows:
- At each time step t, we randomly labelled each digit in with either or according to the probabilities .
- Sequential Bayes by Backprop. At each time step t, we trained for 100 epochs on . The parameters are . The Bayesian NN at time t was initialized with the previous estimates .
- Bayes by Backprop on the whole dataset. We trained for 100 epochs on the whole dataset (no time dimension; the sample size was 190,000) a Bayesian NN with Bayes by Backprop. The parameters are .
- Elastic Weight Consolidation. At each time step t, we trained for 100 epochs on . The tuning parameter was chosen from the grid 10,000} through the validation set and the metric (A6). We found that ADAM worked better, and hence, we trained with it. Recall that this method is not Bayesian, but it is a well-known baseline for continual learning.
- Variational Continual Learning. At each time step t, we trained for 100 epochs on . We chose the learning rate from the grid through validation and the metric (A6). The training was pursued without the use of a coreset because we were not comparing rehearsal methods.
Appendix B.4. One-Step-Ahead Prediction for Flag Waving
- Sequential Bayes by Backprop. At each time step t, we trained on the current sliding window for 150 epochs. The parameter was chosen with grid search and . We found that other choices of did not improve the performance. The Bayesian neural network at time t was initialized with the estimates at time .
- Sequential DropConnect. At each time step t, we trained on the current sliding window for 150 epochs. The learning rate was chosen with grid search using the validation score. The neural network at time t was initialized with the estimates at time . This is not a Bayesian method.
- LSTM. We chose a window size of 36 and the learning rate with grid search using the validation score. This is not a Bayesian method.
- Trivial predictor. We predicted frame t with frame . We decided to include this trivial baseline because it is an indicator of overfitting on the current window.
References
- Rabiner, L.R.; Juang, B.H. An introduction to hidden Markov models. IEEE ASSP Mag. 1986, 3, 4–16. [Google Scholar] [CrossRef]
- Krogh, A.; Larsson, B.; Von Heijne, G.; Sonnhammer, E.L. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 2001, 305, 567–580. [Google Scholar] [CrossRef]
- Ghahramani, Z.; Jordan, M.I. Factorial Hidden Markov Models. Mach. Learn. 1997, 29, 245–273. [Google Scholar] [CrossRef]
- Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
- Graves, A. Practical variational inference for neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; pp. 2348–2356. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight uncertainty in neural network. In Proceedings of the International Conference on Machine Learning, PMLR, Lile, France, 6–11 July 2015; pp. 1613–1622. [Google Scholar]
- Wang, L.; Zhang, X.; Su, H.; Zhu, J. A comprehensive survey of continual learning: Theory, method and application. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5362–5383. [Google Scholar] [CrossRef]
- Kurle, R.; Cseke, B.; Klushyn, A.; van der Smagt, P.; Günnemann, S. Continual Learning with Bayesian Neural Networks for Non-Stationary Data. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
- Nguyen, C.V.; Li, Y.; Bui, T.D.; Turner, R.E. Variational continual learning. arXiv 2017, arXiv:1710.10628. [Google Scholar]
- Ritter, H.; Botev, A.; Barber, D. Online structured laplace approximations for overcoming catastrophic forgetting. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 3738–3748. [Google Scholar]
- Chopin, N.; Papaspiliopoulos, O. An Introduction to Sequential Monte Carlo; Springer: Cham, Switzerland, 2020. [Google Scholar]
- Rebeschini, P.; Van Handel, R. Can local particle filters beat the curse of dimensionality? Ann. Appl. Probab. 2015, 25, 2809–2866. [Google Scholar] [CrossRef]
- Rimella, L.; Whiteley, N. Exploiting locality in high-dimensional Factorial hidden Markov models. J. Mach. Learn. Res. 2022, 23, 1–34. [Google Scholar]
- Duffield, S.; Power, S.; Rimella, L. A state-space perspective on modelling and inference for online skill rating. J. R. Stat. Soc. Ser. C Appl. Stat. 2024, 73, 1262–1282. [Google Scholar] [CrossRef]
- Sorenson, H.W.; Stubberud, A.R. Non-linear filtering by approximation of the a posteriori density. Int. J. Control 1968, 8, 33–51. [Google Scholar] [CrossRef]
- Minka, T.P. A Family of Algorithms for Approximate Bayesian Inference. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2001. [Google Scholar]
- Wan, L.; Zeiler, M.; Zhang, S.; Le Cun, Y.; Fergus, R. Regularization of neural networks using dropconnect. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1058–1066. [Google Scholar]
- Franzini, M.; Lee, K.F.; Waibel, A. Connectionist Viterbi training: A new hybrid method for continuous speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Albuquerque, NM, USA, 3–6 April 1990; pp. 425–428. [Google Scholar]
- Bengio, Y.; Cardin, R.; De Mori, R.; Normandin, Y. A hybrid coder for hidden Markov models using a recurrent neural networks. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Albuquerque, NM, USA, 3–6 April 1990; pp. 537–540. [Google Scholar]
- Bengio, Y.; De Mori, R.; Flammia, G.; Kompe, R. Global optimization of a neural network-hidden Markov model hybrid. In Proceedings of the IJCNN-91-Seattle International Joint Conference on Neural Networks, Seattle, WA, USA, 8–12 July 1991; Volume 2, pp. 789–794. [Google Scholar]
- Krogh, A.; Riis, S.K. Hidden neural networks. Neural Comput. 1999, 11, 541–563. [Google Scholar] [CrossRef]
- Johnson, M.J.; Duvenaud, D.; Wiltschko, A.B.; Datta, S.R.; Adams, R.P. Composing graphical models with neural networks for structured representations and fast inference. arXiv 2016, arXiv:1603.06277. [Google Scholar]
- Karl, M.; Soelch, M.; Bayer, J.; Van der Smagt, P. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv 2016, arXiv:1605.06432. [Google Scholar]
- Krishnan, R.; Shalit, U.; Sontag, D. Structured inference networks for nonlinear state space models. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Aitchison, L.; Pouget, A.; Latham, P.E. Probabilistic synapses. arXiv 2014, arXiv:1410.1029. [Google Scholar]
- Chang, P.G.; Durán-Martín, G.; Shestopaloff, A.Y.; Jones, M.; Murphy, K. Low-rank extended Kalman filtering for online learning of neural networks from streaming data. arXiv 2023, arXiv:2305.19535. [Google Scholar]
- Jones, M.; Scott, T.R.; Ren, M.; ElSayed, G.; Hermann, K.; Mayo, D.; Mozer, M. Learning in Temporally Structured Environments. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Campbell, A.; Shi, Y.; Rainforth, T.; Doucet, A. Online variational filtering and parameter learning. Adv. Neural Inf. Process. Syst. 2021, 34, 18633–18645. [Google Scholar]
- Mobiny, A.; Yuan, P.; Moulik, S.K.; Garg, N.; Wu, C.C.; Van Nguyen, H. Dropconnect is effective in modeling uncertainty of bayesian deep networks. Sci. Rep. 2021, 11, 5458. [Google Scholar] [CrossRef]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Salehin, I.; Kang, D.K. A review on dropout regularization approaches for deep neural networks within the scholarly domain. Electronics 2023, 12, 3106. [Google Scholar] [CrossRef]
- Kingma, D.P.; Salimans, T.; Welling, M. Variational dropout and the local reparameterization trick. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2575–2583. [Google Scholar]
- Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
- Puskorius, G.V.; Feldkamp, L.A. Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Trans. Neural Netw. 1994, 5, 279–297. [Google Scholar] [CrossRef]
- Puskorius, G.V.; Feldkamp, L.A. Parameter-based Kalman filter training: Theory and implementation. In Kalman Filtering and Neural Networks; Wiley: Hoboken, NJ, USA, 2001; pp. 23–67. [Google Scholar] [CrossRef]
- Feldkamp, L.A.; Prokhorov, D.V.; Feldkamp, T.M. Simple and conditioned adaptive behavior from Kalman filter trained recurrent networks. Neural Netw. 2003, 16, 683–689. [Google Scholar] [CrossRef]
- Ollivier, Y. Online natural gradient as a Kalman filter. Electron. J. Stat. 2018, 12, 2930–2961. [Google Scholar] [CrossRef]
- Aitchison, L. Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods. arXiv 2018, arXiv:1807.07540. [Google Scholar]
- Khan, M.E.; Swaroop, S. Knowledge-Adaptation Priors. In Proceedings of the Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 19757–19770. [Google Scholar]
- Liu, L.; Jiang, X.; Zheng, F.; Chen, H.; Qi, G.J.; Huang, H.; Shao, L. A bayesian federated learning framework with online laplace approximation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 1–16. [Google Scholar] [CrossRef]
- Sliwa, J.; Schneider, F.; Bosch, N.; Kristiadi, A.; Hennig, P. Efficient Weight-Space Laplace-Gaussian Filtering and Smoothing for Sequential Deep Learning. arXiv 2024, arXiv:2410.06800. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Chan, A.B.; Vasconcelos, N. Classifying video with kernel dynamic textures. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–6. [Google Scholar]
- Boots, B.; Gordon, G.J.; Siddiqi, S.M. A constraint generation approach to learning stable linear dynamical systems. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–11 December 2008; pp. 1329–1336. [Google Scholar]
- Basharat, A.; Shah, M. Time series prediction by chaotic modeling of nonlinear dynamical systems. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 1941–1948. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
- Venkatraman, A.; Hebert, M.; Bagnell, J.A. Improving multi-step prediction of learned time series models. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
- Yang, Y.; Pati, D.; Bhattacharya, A. α-Variational Inference with Statistical Guarantees. arXiv 2017, arXiv:1710.03266. [Google Scholar] [CrossRef]
- Chérief-Abdellatif, B.E. Convergence Rates of Variational Inference in Sparse Deep Learning. arXiv 2019, arXiv:1908.04847. [Google Scholar]
Notation | Meaning |
---|---|
hidden weights of an NN and observed data at time t | |
Markov transition kernel of the weights of the NN | |
probability distribution of the data given the weights | |
v | a weight of the NN, also used to represent marginal quantities |
filtering distribution and its approximation at time t | |
prediction operator and correction operator at time t | |
operator that minimizes the KL-divergence on the class | |
variational approximation and its parameters | |
transformation function in the reparameterization trick | |
scale mixture of Gaussian probability of weight v | |
scale mixture of Gaussian mean of weight v at time t | |
scale mixture of Gaussian standard dev. of weight v at time t | |
scale mixture of Gaussian probability of transition kernel | |
scale mixture of Gaussian stationary mean of transition kernel | |
scale mixture of Gaussian mean scaling of transition kernel | |
scale mixture of Gaussian big-jump standard dev. of transition kernel | |
scale mixture of Gaussians small-jump standard dev. of transition kernel |
Parameter Value | Accuracy |
---|---|
[7] |
Model | Accuracy |
---|---|
BBP (T = 1) | |
VCL | |
EWC | |
BBP | |
Kurle et al. [9] | |
HMNN |
Model | |
---|---|
Trivial Predictor | 0.2162 |
LSTM | 0.2080 |
DropConnect | 0.2063 |
BBP | 0.1932 |
HMNN | 0.1891 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rimella, L.; Whiteley, N. Hidden Markov Neural Networks. Entropy 2025, 27, 168. https://doi.org/10.3390/e27020168
Rimella L, Whiteley N. Hidden Markov Neural Networks. Entropy. 2025; 27(2):168. https://doi.org/10.3390/e27020168
Chicago/Turabian StyleRimella, Lorenzo, and Nick Whiteley. 2025. "Hidden Markov Neural Networks" Entropy 27, no. 2: 168. https://doi.org/10.3390/e27020168
APA StyleRimella, L., & Whiteley, N. (2025). Hidden Markov Neural Networks. Entropy, 27(2), 168. https://doi.org/10.3390/e27020168