Understanding Machine Learning Principles: Learning, Inference, Generalization, and Computational Learning Theory
Abstract
:1. Introduction
2. Learning and Inference Methods
2.1. Scientific Reasoning
2.1.1. Deductive Reasoning
2.1.2. Inductive Reasoning
2.1.3. Abductive Reasoning
2.1.4. Analogical Reasoning
2.1.5. Case-Based Reasoning
2.1.6. Ontologies
2.2. Supervised, Unsupervised, and Reinforcement Learning
2.2.1. Supervised Learning
2.2.2. Unsupervised Learning
- Hebbian learning: A local process involving two neurons and a synapse, where weight adjustment is proportional to the correlation observed between pre- and postsynaptic activities. It underpins models for PCA and associative memory.
- Competitive learning: Neurons compete to be the most responsive to a given input. The SOM, a form of competitive learning, relates closely to clustering techniques.
- Boltzmann machine: employs stochastic training via simulated annealing, inspired by thermodynamics, to learn in an unsupervised manner.
2.2.3. Reinforcement Learning
2.3. Semi-Supervised Learning and Active Learning
2.4. Transfer Learning
2.5. Other Learning Methods
2.5.1. Ordinal Regression and Ranking
2.5.2. Manifold Learning
- Locally Linear Embedding (LLE) [55]: captures global structures of nonlinear manifolds, such as face image datasets.
- Laplacian eigenmaps [56]: maintains local structures using a graph-based representation of data.
- Orthogonal neighborhood-preserving projections [57]: a linear extension of LLE that preserves local geometric relationships.
2.5.3. Multi-Task Learning
2.5.4. Imitation Learning
2.5.5. Curriculum Learning
2.5.6. Multiview Learning
2.5.7. Multilabel Learning
2.5.8. Multiple-Instance Learning
2.5.9. Parametric, Semiparametric, and Nonparametric Classifications
2.5.10. Learning from Imbalanced Data
2.5.11. Zero-Shot Learning
3. Criterion Functions
Robust Learning
4. Learning and Generalization
4.1. Generalization Error
4.2. Generalization by Stopping Criterion
4.3. Generalization by Regularization
4.4. Data Augmentation
4.5. Dropout
4.6. Fault Tolerance and Generalization
4.7. Sparsity Versus Stability
5. Model Selection
5.1. Occam’s Razor
5.2. Cross-Validation
5.3. Complexity Criteria
6. Bias and Variance
7. Overparameterization and Double Descent
8. Neural Networks as Universal Machines
8.1. Boolean Function Approximation
Binary Radial Basis Function
8.2. Linear Separability and Nonlinear Separability
8.3. Universal Function Approximation
Capacity of a Neural Network Architecture
8.4. Turing Machines
Turing Machine Computations
8.5. Winner-Takes-All
9. Introduction to Computational Learning Theory
No-Free-Lunch Theorem
10. Probably Approximately Correct (PAC) Learning
Sample Complexity
11. Vapnik–Chervonenkis Dimension
Machine Teaching and Teaching Dimension
12. Rademacher Complexity
13. Empirical Risk Minimization Principle
13.1. Generalization Error by VC-Theory
13.2. Generalization Error by Rademacher Bound
13.3. Fundamental Theorem of Learning Theory
- exhibits uniform convergence.
- Any ERM rule can agnostically learn .
- is agnostically PAC learnable.
- has a finite VC-dimension.
14. Conclusions and Future Directions
14.1. Future Directions
14.1.1. Analyzing Transfer Learning
14.1.2. Explaining Double Descent
14.1.3. Exploring SGD in Deep Learning Setting
14.1.4. Understanding Data Augmentation
14.1.5. Delving into Transformers
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
ADMM | alternating direction method of multipliers |
AI | artificial intelligence |
AIC | Akaike information criterion |
ANOVA | analysis of variance |
ANN | artificial neural network |
BIC | Bayesian information criterion |
BP | backpropagation |
CCA | canonical correlation analysis |
CNN | convolutional neural network |
ERM | empirical risk minimization |
FPE | final prediction error |
GAN | generative adversarial network |
GBDT | gradient-boosted decision trees |
IALE | imitation-active learning ensemble |
KL | Kullback–Leibler |
k-NN | k-nearest neighbors |
LASSO | least absolute shrinkage and selection operator |
LLE | locally linear embedding |
LMS | least mean squares |
LS | least squares |
LSM | liquid-state machines |
LSTM | long short-term memory |
LTG | linear threshold gate |
LTL | learning to learn |
MAD | median of absolute deviations |
MAP | maximum a posteriori |
MDL | minimum description length |
MLE | maximum likelihood estimation |
MLP | multilayer perceptrons |
MSE | mean squared error |
NMF | nonnegative matrix factorization |
OOD | out-of-distribution |
PAC | probably approximately correct |
PCA | principal component analysis |
probability density function | |
POMDP | partially observable Markov decision process |
RBF | radial basis function |
ReLU | rectified linear unit |
RNN | recurrent neural network |
SGD | stochastic gradient descent |
SMOTE | synthetic minority oversampling technique |
SNN | spiking neural network |
SNRF | signal-to-noise ratio figure |
SOM | self-organizing map |
SRM | structural risk minimization |
STDP | spike-timing dependent plasticity |
SVD | singular value decomposition |
SVM | support vector machine |
UMM | universal memcomputing machines |
VC | Vapnik–Chervonenkis |
VLSI | very large scale integration |
WTA | winner-takes-all |
References
- McCulloch, W.S.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
- Rosenblatt, F. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386–408. [Google Scholar] [CrossRef] [PubMed]
- Du, K.-L.; Swamy, M.N.S. Neural Networks and Statistical Learning, 2nd ed.; Springer: London, UK, 2019. [Google Scholar]
- Li, G.; Deng, L.; Tang, H.; Pan, G.; Tian, Y.; Roy, K.; Maass, W. Brain-inspired computing: A systematic survey and future trends. Proc. IEEE 2024, 112, 544–584. [Google Scholar] [CrossRef]
- Du, K.-L.; Jiang, B.; Lu, J.; Hua, J.; Swamy, M.N.S. Exploring kernel machines and support vector machines: Principles, techniques, and future directions. Mathematics 2024, 12, 3935. [Google Scholar] [CrossRef]
- Du, K.-L. Clustering: A neural network approach. Neural Netw. 2010, 23, 89–107. [Google Scholar] [CrossRef]
- Du, K.-L.; Swamy, M.N.S. Neural Networks in a Softcomputing Framework; Springer: London, UK, 2006. [Google Scholar]
- Du, K.-L.; Swamy, M.N.S. Search and Optimization by Metaheuristics: Techniques and Algorithms Inspired by Nature; Springer: New York, NY, USA, 2016. [Google Scholar]
- Vapnik, V.N. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
- Sarbo, J.J.; Cozijn, R. Belief in reasoning. Cogn. Syst. Res. 2019, 55, 245–256. [Google Scholar] [CrossRef]
- Tecuci, G.; Kaiser, L.; Marcu, D.; Uttamsingh, C.; Boicu, M. Evidence-based reasoning in intelligence analysis: Structured methodology and system. Comput. Sci. Eng. 2018, 20, 9–21. [Google Scholar] [CrossRef]
- Barto, A.G.; Sutton, R.S.; Anderson, C.W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. 1983, 13, 834–846. [Google Scholar] [CrossRef]
- Schultz, W. Predictive reward signal of dopamine neurons. J. Neurophysiol. 1998, 80, 1–27. [Google Scholar] [CrossRef]
- Belkin, M.; Niyogi, P.; Sindhwani, V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 2006, 7, 2399–2434. [Google Scholar]
- Fedorov, V.V. Theory of Optimal Experiments; Academic Press: San Diego, CA, USA, 1972. [Google Scholar]
- Sugiyama, M.; Nakajima, S. Pool-based active learning in approximate linear regression. Mach. Learn. 2009, 75, 249–274. [Google Scholar] [CrossRef]
- Freund, Y.; Seung, H.S.; Shamir, E.; Tishby, N. Selective sampling using the query by committee algorithm. Mach. Learn. 1997, 28, 133–168. [Google Scholar] [CrossRef]
- Wu, D. Pool-based sequential active learning for regression. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 1348–1359. [Google Scholar] [CrossRef] [PubMed]
- MacKay, D. Information-based objective functions for active data selection. Neural Comput. 1992, 4, 590–604. [Google Scholar] [CrossRef]
- Sugiyama, M.; Ogawa, H. Incremental active learning for optimal generalization. Neural Comput. 2000, 12, 2909–2940. [Google Scholar] [CrossRef]
- Hoi, S.C.H.; Jin, R.; Lyu, M.R. Batch mode active learning with applications to text categorization and image retrieval. IEEE Trans. Knowl. Data Eng. 2009, 21, 1233–1248. [Google Scholar] [CrossRef]
- Zhang, L.; Gao, X. Transfer Adaptation Learning: A Decade Survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 23–44. [Google Scholar] [CrossRef]
- Yang, L.; Hanneke, S.; Carbonell, J. A theory of transfer learning with applications to active learning. Mach. Learn. 2013, 90, 161–189. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25; Curran Associates, Inc.: New York, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
- Lampinen, A.K.; Ganguli, S. An analytic theory of generalization dynamics and transfer learning in deep linear networks. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Ammar, H.B.; Eaton, E.; Luna, J.M.; Ruvolo, P. Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI), Buenos Aires, Argentina, 25–31 July 2015; pp. 3345–3349. [Google Scholar]
- Taylor, M.E.; Stone, P.; Liu, Y. Value functions for RL-based behavior transfer: A comparative study. In Proceedings of the 20th National Conference on Artificial Intelligence, AAAI, Pittsburgh, PA, USA, 9 July 2005; pp. 880–885. [Google Scholar]
- Silva, F.; Costa, A. A survey on transfer learning for multiagent reinforcement learning systems. J. Artif. Intell. Res. 2019, 64, 645–703. [Google Scholar] [CrossRef]
- Bard, N.; Foerster, J.N.; Chandar, S.; Burch, N.; Lanctot, M.; Song, H.F.; Parisotto, E.; Dumoulin, V.; Moitra, S.; Hughes, E.; et al. The Hanabi challenge: A new frontier for AI research. Artif. Intell. 2020, 280, 103216. [Google Scholar] [CrossRef]
- Barrett, S.; Stone, P. Cooperating with unknown teammates in complex domains: A robot soccer case study of ad hoc teamwork. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; pp. 2010–2016. [Google Scholar]
- Smith, M.O.; Anthony, T.; Wellman, M.P. Strategic Knowledge Transfer. J. Mach. Learn. Res. 2023, 24, 1–96. [Google Scholar]
- Hu, X.; Zhang, X. Optimal parameter-transfer learning by semiparametric model averaging. J. Mach. Learn. Res. 2023, 24, 1–53. [Google Scholar]
- Bastani, H. Predicting with proxies: Transfer learning in highdimension. Manag. Sci. 2021, 67, 2964–2984. [Google Scholar] [CrossRef]
- Li, S.; Cai, T.T.; Li, H. Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. J. Roy. Stat. Soc. B 2021, 84, 149–173. [Google Scholar] [CrossRef]
- Tian, Y.; Feng, Y. Transfer learning under high-dimensional generalized linear models. J. Am. Stat. Assoc. 2023, 118, 2684–2697. [Google Scholar] [CrossRef]
- Li, S.; Cai, T.T.; Li, H. Transfer learning in large-scale Gaussian graphical models with false discovery rate control. J. Am. Stat. Assoc. 2023, 118, 2171–2183. [Google Scholar] [CrossRef]
- Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [Google Scholar] [CrossRef]
- Muandet, K.; Balduzzi, D.; Scholkopf, B. Domain generalization via invariant feature representation. In Proceedings of the International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013; Volume 28, pp. I.10–I.18. [Google Scholar]
- Blanchard, G.; Deshmukh, A.A.; Dogan, U.; Lee, G.; Scott, C. Domain generalization by marginal transfer learning. J. Mach. Learn. Res. 2021, 22, 1–55. [Google Scholar]
- Blanchard, G.; Lee, G.; Scott, C. Generalizing from several related classification tasks to a new unlabeled sample. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2011; Volume 24, pp. 2178–2186. [Google Scholar]
- Thrun, S. Is learning the n-th thing any easier than learning the first? In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 1996; pp. 640–646. [Google Scholar]
- Baxter, J. A model of inductive bias learning. J. Artif. Intell. Res. 2000, 12, 149–198. [Google Scholar] [CrossRef]
- Denevi, G.; Ciliberto, C.; Stamos, D.; Pontil, M. Incremental learning-to-learn with statistical guarantees. In Proceedings of the Uncertainty in Artificial Intelligence (UAI), Monterey, CA, USA, 6–10 August 2018; pp. 457–466. [Google Scholar]
- Maurer, A. Transfer bounds for linear feature learning. Mach. Learn. 2009, 75, 327–350. [Google Scholar] [CrossRef]
- Pentina, A.; Lampert, C. A PAC-Bayesian bound for lifelong learning. In Proceedings of the International Conference on Machine Learning (ICML), Beijing, China, 21–26 June 2014; Volume 32, pp. 991–999. [Google Scholar]
- Maurer, A.; Pontil, M.; Romera-Paredes, B. Sparse coding for multitask and transfer learning. In Proceedings of the International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013; Volume 28, pp. 343–351. [Google Scholar]
- Li, Y.; Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2018; pp. 8168–8177. [Google Scholar]
- McCullagh, P. Regression models for ordinal data. J. Roy. Stat. Soc. B 1980, 42, 109–142. [Google Scholar] [CrossRef]
- Chapelle, O.; Chang, Y. Yahoo! learning to rank challenge overview. In Proceedings of the JMLR Workshop and Conference Proceedings: Workshop on Yahoo! Learning to Rank Challenge, San Francisco, CA, USA, 28 June 2011; Volume 14, pp. 1–24. [Google Scholar]
- Herbrich, R.; Graepel, T.; Obermayer, K. Large margin rank boundaries for ordinal regression. In Advances in Large Margin Classifiers; Bartlett, P.J., Scholkopf, B., Schuurmans, D., Smola, A.J., Eds.; MIT Press: Cambridge, MA, USA, 2000; pp. 115–132. [Google Scholar]
- Burges, C.J.C. From RankNet to LambdaRank to LambdaMART: An Overview; Technical Report MSR-TR-2010-82; Microsoft Research: Redmond, WA, USA, 2010. [Google Scholar]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Freund, Y.; Iyer, R.; Schapire, R.E.; Singer, Y. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 2003, 4, 933–969. [Google Scholar]
- Du, K.-L.; Swamy, M.N.S.; Wang, Z.-Q.; Mow, W.H. Matrix factorization techniques in machine learning, signal processing and statistics. Mathematics 2023, 11, 2674. [Google Scholar] [CrossRef]
- Roweis, S.T.; Saul, L.K. Nonlinear dimensionality reduction by locally linear embedding. Science 2000, 290, 2323–2326. [Google Scholar] [CrossRef]
- Belkin, M.; Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 2003, 15, 1373–1396. [Google Scholar] [CrossRef]
- Kokiopoulou, E.; Saad, Y. Orthogonal neighborhood preserving projections: A projection-based dimensionality reduction technique. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 2143–2156. [Google Scholar] [CrossRef]
- Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
- Evgeniou, T.; Michelli, C.A.; Pontil, M. Learning multiple tasks with kernel methods. J. Mach. Learn. Res. 2005, 6, 615–637. [Google Scholar]
- Yang, X.; Kim, S.; Xing, E.P. Heterogeneous multitask learning with joint sparsity constraints. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2009; pp. 2151–2159. [Google Scholar]
- Kaelbling, L.P. Learning to achieve goals. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Chambery, France, 28 August–3 September 1993; pp. 1094–1099. [Google Scholar]
- Ruder, S. An overview of multi-task learning in deep neural networks. arXiv 2017, arXiv:1706.05098. [Google Scholar]
- Jong, N.K.; Stone, P. State abstraction discovery from irrelevant state variables. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, Scotland, UK, 30 July–5 August 2005; pp. 752–757. [Google Scholar]
- Walsh, T.J.; Li, L.; Littman, M.L. Transferring state abstractions between MDPs. In Proceedings of the ICML-06 Workshop on Structural Knowledge Transfer for Machine Learning, Pittsburgh, PA, USA, 25 June 2006. [Google Scholar]
- Foster, D.; Dayan, P. Structure in the space of value functions. Mach. Learn. 2002, 49, 325–346. [Google Scholar] [CrossRef]
- Konidaris, G.; Barto, A. Autonomous shaping: Knowledge transfer in reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25–29 June 2006; pp. 489–496. [Google Scholar]
- Snel, M.; Whiteson, S. Learning potential functions and their representations for multi-task reinforcement learning. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Paris, France, 5–9 May 2014; pp. 637–681. [Google Scholar]
- Czarnecki, W.; Jayakumar, S.; Jaderberg, M.; Hasenclever, L.; Teh, Y.W.; Heess, N.; Osindero, S.; Pascanu, R. Mix & match agent curricula for reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 1087–1095. [Google Scholar]
- Pomerleau, D.A. Alvinn: An Autonomous Land Vehicle in a Neural Network; Technical Report; Carnegie-Mellon University: Pittsburgh, PA, USA, 1989. [Google Scholar]
- Arora, S.; Doshi, P. A survey of inverse reinforcement learning: Challenges, methods, and progress. Artif. Intell. 2021, 297, 103500. [Google Scholar] [CrossRef]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2014; Volume 27, pp. 2672–2680. [Google Scholar]
- Grill, J.-B.; Strub, F.; Altche, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent: A new approach to self-supervised learning. In Proceedings of the Conference on Neural Information Processing Systems, Vitual, 6–12 December 2020; pp. 21271–21284. [Google Scholar]
- Syed, U.; Schapire, R.E. A reduction from apprenticeship learning to classification. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2010; Volume 23, pp. 2253–2261. [Google Scholar]
- Ross, S.; Gordon, G.; Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 627–635. [Google Scholar]
- Cohen, M.K.; Hutter, M.; Nanda, N. Fully General Online Imitation Learning. J. Mach. Learn. Res. 2022, 23, 1–30. [Google Scholar]
- Loffler, C.; Mutschler, C. IALE: Imitating active learner ensembles. J. Mach. Learn. Res. 2022, 23, 1–29. [Google Scholar]
- Schmidhuber, J. Curious model-building control systems. In Proceedings of the IEEE International Joint Conference on Neural Networks, Seoul, Republic of Korea, 17–21 June 1991; pp. 1458–1463. [Google Scholar]
- Schmidhuber, J. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Trans. Auton. Ment. Develop. 2010, 2, 230–247. [Google Scholar] [CrossRef]
- Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 41–48. [Google Scholar]
- Gong, C.; Tao, D.; Liu, W.; Liu, L.; Yang, J. Label propagation via teaching-to-learn and learning-to-teach. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 1452–1465. [Google Scholar] [CrossRef]
- Zaremba, W.; Sutskever, I. Learning to execute. arXiv 2014, arXiv:1410.4615. [Google Scholar]
- Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artif. Intell. 1998, 101, 99–134. [Google Scholar] [CrossRef]
- Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; Munos, R. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2016; pp. 1471–1479. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Langford, J. Efficient exploration in reinforcement learning. In Encyclopedia Machine Learning; Springer: Berlin/Heidelberg, Germany, 2011; pp. 309–311. [Google Scholar]
- Sukhbaatar, S.; Lin, Z.; Kostrikov, I.; Synnaeve, G.; Szlam, A.; Fergus, R. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv 2017, arXiv:1703.05407. [Google Scholar]
- Held, D.; Geng, X.; Florensa, C.; Abbeel, P. Automatic goal generation for reinforcement learning agents. arXiv 2017, arXiv:1705.06366. [Google Scholar]
- Matiisen, T.; Oliver, A.; Cohen, T.; Schulman, J. Teacher–student curriculum learning. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 3732–3740. [Google Scholar] [CrossRef] [PubMed]
- Graves, A.; Bellemare, M.G.; Menick, J.; Munos, R.; Kavukcuoglu, K. Automated curriculum learning for neural networks. arXiv 2017, arXiv:1704.03003. [Google Scholar]
- Du, K.-L.; Swamy, M.N.S. Wireless Communication Systems; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
- Dasgupta, S.; Littman, M.; McAllester, D. PAC generalization bounds for co-training. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2002; Volume 14, pp. 375–382. [Google Scholar]
- Hotelling, H. Relations between two sets of variates. Biometrika 1936, 28, 321–377. [Google Scholar] [CrossRef]
- Kettenring, J. Canonical analysis of several sets of variables. Biometrika 1971, 58, 433–451. [Google Scholar] [CrossRef]
- Tucker, L.R. The extension of factor analysis to three-dimensional matrices. In Contributions to Mathematical Psychology; Holt, Rinehardt & Winston: New York, NY, USA, 1964; pp. 109–127. [Google Scholar]
- Lanckriet, G.R.G.; Cristianini, N.; Bartlett, P.; El Ghaoui, L.; Jordan, M.I. Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 2004, 5, 27–72. [Google Scholar]
- Zhang, M.-L.; Zhou, Z.-H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recogn. 2007, 40, 2038–2048. [Google Scholar] [CrossRef]
- Dietterich, T.G.; Lathrop, R.H.; Lozano-Perez, T. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 1997, 89, 31–71. [Google Scholar] [CrossRef]
- Sabato, S.; Tishby, N. Multi-instance learning with any hypothesis class. J. Mach. Learn. Res. 2012, 13, 2999–3039. [Google Scholar]
- Parzen, E. On estimation of a probability density function and mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
- Chawla, N.; Bowyer, K.; Kegelmeyer, W. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Lin, Y.; Lee, Y.; Wahba, G. Support vector machines for classification in nonstandard situations. Mach. Learn. 2002, 46, 191–202. [Google Scholar] [CrossRef]
- Wu, G.; Cheng, E. Class-boundary alignment for imbalanced dataset learning. In Proceedings of the ICML 2003 Workshop Learning Imbalanced Data Sets II, Washington, DC, USA, 21 August 2003; pp. 49–56. [Google Scholar]
- Chao, W.L.; Changpinyo, S.; Gong, B.; Sha, F. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 52–68. [Google Scholar]
- Lampert, C.H.; Nickisch, H.; Harmeling, S. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–26 June 2009; pp. 951–958. [Google Scholar]
- Rahman, S.; Khan, S.; Porikli, F. A unified approach for conventional zero-shot, generalized zero-shot, and few-shot learning. IEEE Trans. Image Process. 2018, 27, 5652–5667. [Google Scholar] [CrossRef] [PubMed]
- Wang, D.; Li, Y.; Lin, Y.; Zhuang, Y. Relational knowledge transfer for zero-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 1–7. [Google Scholar]
- Tian, P.; Li, W.; Gao, Y. Consistent meta-regularization for better meta-knowledge in few-shot learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 7277–7288. [Google Scholar] [CrossRef] [PubMed]
- Rumelhart, D.E.; Durbin, R.; Golden, R.; Chauvin, Y. Backpropagation: The basic theory. In Backpropagation: Theory, Architecture, and Applications; Chauvin, Y., Rumelhart, D.E., Eds.; Lawrence Erlbaum: Hillsdale, NJ, USA, 1995; pp. 1–34. [Google Scholar]
- Baum, E.B.; Wilczek, F. Supervised learning of probability distributions by neural networks. In Neural Information Processing Systems; Anderson, D.Z., Ed.; American Institute Physics: New York, NY, USA, 1988; pp. 52–61. [Google Scholar]
- Matsuoka, K.; Yi, J. Backpropagation based on the logarithmic error function and elimination of local minima. In Proceedings of the International Joint Conference on Neural Networks, Seattle, WA, USA, 8–12 July 1991; pp. 1117–1122. [Google Scholar]
- Solla, S.A.; Levin, E.; Fleisher, M. Accelerated learning in layered neural networks. Complex Syst. 1988, 2, 625–640. [Google Scholar]
- Blum, A.L.; Rivest, R.L. Training a 3-node neural network is NP-complete. Neural Netw. 1992, 5, 117–127. [Google Scholar] [CrossRef]
- Sima, J. Back-propagation is not efficient. Neural Netw. 1996, 9, 1017–1023. [Google Scholar] [CrossRef]
- Auer, P.; Herbster, M.; Warmuth, M.K. Exponentially many local minima for single neurons. In Advances in Neural Information Processing Systems; Touretzky, D.S., Mozer, M.C., Hasselmo, M.E., Eds.; MIT Press: Cambridge, MA, USA, 1996; Volume 8, pp. 316–322. [Google Scholar]
- Bartlett, P.L. The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inf. Theory 1998, 44, 525–536. [Google Scholar] [CrossRef]
- Gish, H. A probabilistic approach to the understanding and training of neural network classifiers. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Albuquerque, NM, USA, 3–6 April 1990; pp. 1361–1364. [Google Scholar]
- Hinton, G.E. Connectionist learning procedure. Artif. Intell. 1989, 40, 185–234. [Google Scholar] [CrossRef]
- Rimer, M.; Martinez, T. Classification-based objective functions. Mach. Learn. 2006, 63, 183–205. [Google Scholar] [CrossRef]
- Hanson, S.J.; Burr, D.J. Minkowski back-propagation: Learning in connectionist models with non-Euclidean error signals. In Neural Information Processing Systems; Anderson, D.Z., Ed.; American Institute Physics: New York, NY, USA, 1988; pp. 348–357. [Google Scholar]
- Silva, L.M.; de Sa, J.M.; Alexandre, L.A. Data classification with multilayer perceptrons using a generalized error function. Neural Netw. 2008, 21, 1302–1310. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.; Pokharel, P.P.; Principe, J.C. Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Trans. Signal Process. 2007, 55, 5286–5298. [Google Scholar] [CrossRef]
- Huber, P.J. Robust Statistics; Wiley: New York, NY, USA, 1981. [Google Scholar]
- Poggio, T.; Girosi, F. Networks for approximation and learning. Proc. IEEE 1990, 78, 1481–1497. [Google Scholar] [CrossRef]
- Hui, L.; Belkin, M. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtually, 3–7 May 2021. [Google Scholar]
- Cichocki, A.; Unbehauen, R. Neural Networks for Optimization and Signal Processing; Wiley: New York, NY, USA, 1992. [Google Scholar]
- Chen, D.S.; Jain, R.C. A robust backpropagation learning algorithm for function approximation. IEEE Trans. Neural Netw. 1994, 5, 467–479. [Google Scholar] [CrossRef]
- Tabatabai, M.A.; Argyros, I.K. Robust estimation and testing for general nonlinear regression models. Appl. Math. Comput. 1993, 58, 85–101. [Google Scholar] [CrossRef]
- Singh, A.; Pokharel, R.; Principe, J.C. The C-loss function for pattern classification. Pattern Recogn. 2014, 47, 441–453. [Google Scholar] [CrossRef]
- Tikhonov, A.N. On solving incorrectly posed problems and method of regularization. Dokl. Akad. Nauk USSR 1963, 151, 501–504. [Google Scholar]
- Widrow, B.; Lehr, M.A. 30 years of adaptive neural networks: Perceptron, Madaline, and backpropagation. Proc. IEEE 1990, 78, 1415–1442. [Google Scholar] [CrossRef]
- Niyogi, P.; Girosi, F. Generalization bounds for function approximation from scattered noisy data. Adv. Comput. Math. 1999, 10, 51–80. [Google Scholar] [CrossRef]
- Barron, A.R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 1993, 39, 930–945. [Google Scholar] [CrossRef]
- Shamir, O. Gradient methods never overfit on separable data. J. Mach. Learn. Res. 2021, 22, 1–20. [Google Scholar]
- Prechelt, L. Automatic early stopping using cross validation: Quantifying the criteria. Neural Netw. 1998, 11, 761–767. [Google Scholar] [CrossRef] [PubMed]
- Amari, S.; Murata, N.; Muller, K.R.; Finke, M.; Yang, H. Statistical theory of overtraining: Is cross-validation asymptotically effective? In Advances in Neural Information Processing Systems; Touretzky, D.S., Mozer, M.C., Hasselmo, M.E., Eds.; MIT Press: Cambridge, MA, USA, 1996; Volume 8, pp. 176–182. [Google Scholar]
- Liu, Y.; Starzyk, J.A.; Zhu, Z. Optimized approximation algorithm in neural networks without overfitting. IEEE Trans. Neural Netw. 2008, 19, 983–995. [Google Scholar]
- Hu, T.; Lei, Y. Early stopping for iterative regularization with general loss functions. J. Mach. Learn. Res. 2022, 23, 1–36. [Google Scholar]
- Geman, S.; Bienenstock, E.; Doursat, R. Neural networks and the bias/variance dilemma. Neural Comput. 1992, 4, 1–58. [Google Scholar] [CrossRef]
- Bishop, C.M. Neural Networks for Pattern Recognition; Oxford Press: New York, NY, USA, 1995. [Google Scholar]
- Bishop, C.M. Training with noise is equivalent to Tikhonov regularization. Neural Comput. 1995, 7, 108–116. [Google Scholar] [CrossRef]
- Reed, R.; Marks, R.J., II; Oh, S. Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter. IEEE Trans. Neural Netw. 1995, 6, 529–538. [Google Scholar] [CrossRef]
- Holmstrom, L.; Koistinen, P. Using additive noise in back-propagation training. IEEE Trans. Neural Netw. 1992, 3, 24–38. [Google Scholar] [CrossRef]
- Hinton, G.E.; van Camp, D. Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory, Santa Cruz, CA, USA, 26–28 July 1993; pp. 5–13. [Google Scholar]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition; Rumelhart, D.E., McClelland, J.L., Eds.; MIT Press: Cambridge, MA, USA, 1986; Volume 1, pp. 318–362. [Google Scholar]
- Nowlan, S.J.; Hinton, G.E. Simplifying neural networks by soft weight-sharing. Neural Comput. 1992, 4, 473–493. [Google Scholar] [CrossRef]
- Sun, H.; Gatmiry, K.; Ahn, K.; Azizan, N. A unified approach to controlling implicit regularization via mirror descent. J. Mach. Learn. Res. 2023, 24, 1–58. [Google Scholar]
- Stankewitz, B.; Mucke, N.; Rosasco, L. From inexact optimization to learning via gradient concentration. Comput. Optim. Appl. 2023, 84, 265–294. [Google Scholar] [CrossRef]
- Neyshabur, B.; Tomioka, R.; Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. Proceedings of International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 April 2015. [Google Scholar]
- Rosasco, L.; Villa, S. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Redhook, NY, USA, 2015; Volume 28. [Google Scholar]
- Wei, Y.T.; Yang, F.; Wainwright, M.J. Early stopping for kernel boosting algorithms: A general analysis with localized complexities. IEEE Trans. Inf. Theory 2019, 65, 6685–6703. [Google Scholar] [CrossRef]
- Erhan, D.; Bengio, Y.; Courville, A.; Manzagol, P.-A.; Vincent, P.; Bengio, S. Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 2010, 11, 625–660. [Google Scholar]
- Yao, Y.; Yu, B.; Gong, C.; Liu, T. Understanding how pretraining regularizes deep learning algorithms. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 5828–5840. [Google Scholar] [CrossRef]
- Chen, S.; Dobriban, E.; Lee, J.H. A group-theoretic framework for data augmentation. J. Mach. Learn. Res. 2020, 21, 1–71. [Google Scholar]
- Dao, T.; Gu, A.; Ratner, A.; Smith, V.; De Sa, C.; Re, C. A kernel theory of modern data augmentation. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 1528–1537. [Google Scholar]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollar, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ), New Orleans, LA, USA, 19–24 June 2022; pp. 16000–16009. [Google Scholar]
- DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
- Chapelle, O.; Weston, J.; Bottou, L.; Vapnik, V. Vicinal risk minimization. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2001; pp. 416–422. [Google Scholar]
- Zhang, J.; Cho, K. Query-efficient imitation learning for end-to-end simulated driving. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Lin, C.-H.; Kaushik, C.; Dyer, E.L.; Muthukumar, V. The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective. J. Mach. Learn. Res. 2024, 25, 1–85. [Google Scholar]
- Bartlett, P.L.; Long, P.M.; Lugosi, G.; Tsigler, A. Benign overfitting in linear regression. Proc. Natl. Acad. Sci. USA 2020, 117, 30063–30070. [Google Scholar] [CrossRef]
- Muthukumar, V.; Narang, A.; Subramanian, V.; Belkin, M.; Hsu, D.; Sahai, A. Classification vs regression in overparameterized regimes: Does the loss function matter? J. Mach. Learn. Res. 2021, 22, 1–69. [Google Scholar]
- Shen, R.; Bubeck, S.; Gunasekar, S. Data augmentation as feature manipulation: A story of desert cows and grass cows. arXiv 2022, arXiv:2203.01572. [Google Scholar]
- Wu, D.; Xu, J. On the optimal weighted 2 regularization in overparameterized linear regression. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2020; pp. 10112–10123. [Google Scholar]
- LeJeune, D.; Balestriero, R.; Javadi, H.; Baraniuk, R.G. Implicit rugosity regularization via data augmentation. arXiv 2019, arXiv:1905.11639. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Wan, L.; Zeiler, M.; Zhang, S.; LeCun, Y.; Fergus, R. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013; pp. 1058–1066. [Google Scholar]
- Arora, R.; Bartlett, P.; Mianjy, P.; Srebro, N. Dropout: Explicit forms and capacity control. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtually, 18–24 July 2021; pp. 351–361. [Google Scholar]
- Baldi, P.; Sadowski, P. Understanding dropout. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2013; Volume 27, pp. 2814–2822. [Google Scholar]
- Cavazza, J.; Morerio, P.; Haeffele, B.; Lane, C.; Murino, V.; Vidal, R. Dropout as a low-rank regularizer for matrix factorization. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS), Grimma, Germany, 9–11 April 2018; Volume 84, pp. 435–444. [Google Scholar]
- McAllester, D. A PAC-Bayesian tutorial with a dropout bound. arXiv 2013, arXiv:1307.2118. [Google Scholar]
- Mianjy, P.; Arora, R. On dropout and nuclear norm regularization. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 4575–4584. [Google Scholar]
- Senen-Cerda, A.; Sanders, J. Asymptotic convergence rate of dropout on shallow linear neural networks. In Proceedings of the ACM Measurement and Analysis of Computing Systems, Rome, Italy, 5–8 April 2022; Volume 6, pp. 32:1–32:53. [Google Scholar]
- Wager, S.; Wang, S.; Liang, P. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26; Curran Associates, Inc.: New York, NY, USA, 2013; pp. 351–359. [Google Scholar]
- Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. Comput. Sci. 2012, 3, 212–223. [Google Scholar]
- Hinton, G.E. Dropout: A Simple and Effective Way to Improve Neural Networks. 2012. Available online: https://videolectures.net (accessed on 14 December 2024).
- Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1050–1059. [Google Scholar]
- Helmbold, D.P.; Long, P.M. Surprising properties of dropout in deep networks. J. Mach. Learn. Res. 2018, 18, 1–28. [Google Scholar]
- Khan, S.H.; Hayat, M.; Porikli, F. Regularization of deep neural networks with spectral dropout. Neural Netw. 2019, 110, 82–90. [Google Scholar] [CrossRef]
- Kendall, A.; Gal, Y. What uncertainties do we need in Bayesian deep learning for computer vision? In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5574–5584. [Google Scholar]
- Sicking, J.; Akila, M.; Pintz, M.; Wirtz, T.; Wrobel, S.; Fischer, A. Wasserstein dropout. Mach. Learn. 2024, 113, 3161–3204. [Google Scholar] [CrossRef]
- Li, H.; Weng, J.; Mao, Y.; Wang, Y.; Zhan, Y.; Cai, Q.; Gu, W. Adaptive dropout method based on biological principles. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4267–4276. [Google Scholar] [CrossRef]
- Clara, G.; Langer, S.; Schmidt-Hieber, J. Dropout regularization versus ℓ2-penalization in the linear model. J. Mach. Learn. Res. 2024, 25, 1–48. [Google Scholar]
- Gao, W.; Zhou, Z.-H. Dropout Rademacher complexity of deep neural networks. Sci. China Inf. Sci. 2016, 59, 072104. [Google Scholar] [CrossRef]
- Zhai, K.; Wang, H. Adaptive dropout with Rademacher complexity regularization. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Mianjy, P.; Arora, R. On convergence and generalization of dropout training. In Advances in Neural Information Processing Systems 33; Curran Associates, Inc.: New York, NY, USA, 2020; pp. 21151–21161. [Google Scholar]
- Blanchet, J.; Kang, Y.; Olea, J.L.M.; Nguyen, V.A.; Zhang, X. Dropout training is distributionally robust optimal. J. Mach. Learn. Res. 2023, 24, 1–60. [Google Scholar]
- Murray, A.F.; Edwards, P.J. Synaptic weight noise euring MLP training: Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training. IEEE Trans. Neural Netw. 1994, 5, 792–802. [Google Scholar] [CrossRef] [PubMed]
- Chiu, C.; Mehrotra, K.; Mohan, C.K.; Ranka, S. Modifying training algorithms for improved fault tolerance. In Proceedings of the IEEE International Conference on Neural Networks, Orlando, FL, USA, 26–29 June 1994; Volume 4, pp. 333–338. [Google Scholar]
- Edwards, P.J.; Murray, A.F. Towards optimally distributed computation. Neural Comput. 1998, 10, 997–1015. [Google Scholar] [CrossRef] [PubMed]
- Bernier, J.L.; Ortega, J.; Ros, E.; Rojas, I.; Prieto, A. A quantitative study of fault tolerance, noise immunity, and generalization ability of MLPs. Neural Comput. 2000, 12, 2941–2964. [Google Scholar] [CrossRef]
- Krogh, A.; Hertz, J.A. A simple weight decay improves generalization. In Proceedings of the Neural Information Processing Systems (NIPS) Conference, Denver, CO, USA, 7–10 December 1992; Morgan Kaufmann: San Mateo, CA, USA, 1992; pp. 950–957. [Google Scholar]
- Phatak, D.S. Relationship between fault tolerance, generalization and the Vapnik-Cervonenkis (VC) dimension of feedforward ANNs. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Washington, DC, USA, 10–14 July 1999; Volume 1, pp. 705–709. [Google Scholar]
- Sum, J.P.-F.; Leung, C.-S.; Ho, K.I.-J. On-line node fault injection training algorithm for MLP networks: Objective function and convergence analysis. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 211–222. [Google Scholar] [CrossRef] [PubMed]
- Ho, K.I.-J.; Leung, C.-S.; Sum, J. Convergence and objective functions of some fault/noise-injection-based online learning algorithms for RBF networks. IEEE Trans. Neural Netw. 2010, 21, 938–947. [Google Scholar] [CrossRef]
- Xiao, Y.; Feng, R.-B.; Leung, C.-S.; Sum, J. Objective function and learning algorithm for the general node fault situation. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 863–874. [Google Scholar] [CrossRef]
- Bousquet, O.; Elisseeff, A. Stability and Generalization. J. Mach. Learn. Res. 2002, 2, 499–526. [Google Scholar]
- Xu, H.; Caramanis, C.; Mannor, S. Robust regression and Lasso. IEEE Trans. Inf. Theory 2010, 56, 3561–3574. [Google Scholar] [CrossRef]
- Xu, H.; Caramanis, C.; Mannor, S. Sparse algorithms are not stable: A no-free-lunch theorem. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 187–193. [Google Scholar]
- Domingos, P. The role of Occam’s razor in knowledge discovery. Data Min. Knowl. Disc. 1999, 3, 409–425. [Google Scholar] [CrossRef]
- Zahalka, J.; Zelezny, F. An experimental test of Occam’s razor in classification. Mach. Learn. 2011, 82, 475–481. [Google Scholar] [CrossRef]
- Janssen, P.; Stoica, P.; Soderstrom, T.; Eykhoff, P. Model structure selection for multivariable systems by cross-validation. Int. J. Contr. 1988, 47, 1737–1758. [Google Scholar] [CrossRef]
- Stone, M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. B 1974, 36, 111–147. [Google Scholar] [CrossRef]
- Bengio, Y.; Grandvalet, Y. No unbiased estimator of the variance of K-fold cross-validation. J. Mach. Learn. Res. 2004, 5, 1089–1105. [Google Scholar]
- Markatou, M.; Tian, H.; Biswas, S.; Hripcsak, G. Analysis of variance of cross-validation estimators of the generalization error. J. Mach. Learn. Res. 2005, 6, 1127–1168. [Google Scholar]
- Arlot, S.; Lerasle, M. Choice of V for V-fold cross-validation in least-squares density estimation. J. Mach. Learn. Res. 2016, 17, 1–50. [Google Scholar]
- Breiman, L.; Spector, P. Submodel selection and evaluation in regression: The X-random case. Int. Statist. Rev. 1992, 60, 291–319. [Google Scholar] [CrossRef]
- Plutowski, M.E.P. Survey: Cross-Validation in Theory and in Practice; Research Report; Department of Computational Science Research, David Sarnoff Research Center: Princeton, NJ, USA, 1996. [Google Scholar]
- Shao, J. Linear model selection by cross-validation. J. Am. Stat. Assoc. 1993, 88, 486–494. [Google Scholar] [CrossRef]
- Nadeau, C.; Bengio, Y. Inference for the generalization error. Mach. Learn. 2003, 52, 239–281. [Google Scholar] [CrossRef]
- Akaike, H. Fitting autoregressive models for prediction. Ann. Inst. Stat. Math. 1969, 21, 425–439. [Google Scholar] [CrossRef]
- Akaike, H. A new look at the statistical model identification. IEEE Trans. Auto. Contr. 1974, 19, 716–723. [Google Scholar] [CrossRef]
- Schwarz, G. Estimating the dimension of a model. Ann. Statist. 1978, 6, 461–464. [Google Scholar] [CrossRef]
- Rissanen, J. Modeling by shortest data description. Automatica 1978, 14, 465–477. [Google Scholar] [CrossRef]
- Stoica, P.; Selen, Y. A review of information criterion rules. IEEE Signal Process. Mag. 2004, 21, 36–47. [Google Scholar] [CrossRef]
- Akaike, H. Statistical prediction information. Ann. Inst. Stat. Math. 1970, 22, 203–217. [Google Scholar] [CrossRef]
- Rissanen, J. Hypothesis selection and testing by the MDL principle. Comput. J. 1999, 42, 260–269. [Google Scholar] [CrossRef]
- Ghodsi, A.; Schuurmans, D. Automatic basis selection techniques for RBF networks. Neural Netw. 2003, 16, 809–816. [Google Scholar] [CrossRef]
- Cawley, G.C.; Talbot, N.L.C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 2010, 11, 2079–2107. [Google Scholar]
- Cawley, G.C.; Talbot, N.L.C. Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters. J. Mach. Learn. Res. 2007, 8, 841–861. [Google Scholar]
- Nelson, B.L. Control variate remedies. Oper. Res. 1990, 38, 974–992. [Google Scholar] [CrossRef]
- Glynn, P.W.; Szechtman, R. Some new perspectives on the method of control variates. In Monte Carlo and Quasi-Monte Carlo Methods 2000; Springer: Berlin/Heidelberg, Germany, 2002; pp. 27–49. [Google Scholar]
- Portier, F.; Segers, J. Monte Carlo integration with a growing number of control variates. J. Appl. Prob. 2018, 56, 1168–1186. [Google Scholar] [CrossRef]
- South, L.F.; Oates, C.J.; Mira, A.; Drovandi, C. Regularized zero-variance control variates. Bayes. Anal. 2023, 18, 865–888. [Google Scholar] [CrossRef]
- Deng, Z.; Kammoun, A.; Thrampoulidis, C. A model of double descent for high-dimensional binary linear classification. arXiv 2019, arXiv:1911.05822. [Google Scholar] [CrossRef]
- Kini, G.R.; Thrampoulidis, C. Analytic study of double descent in binary classification: The impact of loss. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Virtually, 12–17 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 2527–2532. [Google Scholar]
- Montanari, A.; Ruan, F.; Sohn, Y.; Yan, J. The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime. arXiv 2019, arXiv:1911.01544. [Google Scholar]
- Chatterji, N.S.; Long, P.M. Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. arXiv 2020, arXiv:2004.12019. [Google Scholar]
- Belkin, M.; Hsu, D.; Ma, S.; Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl. Acad. Sci. USA 2019, 116, 15849–15854. [Google Scholar] [CrossRef]
- Geiger, M.; Spigler, S.; dAscoli, S.; Sagun, L.; Baity-Jesi, M.; Biroli, G.; Wyart, M. Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Phys. Rev. E 2019, 100, 012115. [Google Scholar] [CrossRef]
- Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning requires rethinking generalization. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Lin, L.; Dobriban, E. What causes the test error? Going beyond bias-variance via ANOVA. J. Mach. Learn. Res. 2021, 22, 1–82. [Google Scholar]
- Cherkassky, V.; Mulier, F. Learning from Data: Concepts, Theory, and Methods, 2nd ed.; Wiley: New York, NY, USA, 2007. [Google Scholar]
- Vapnik, V.N. The Nature of Statistical Learning Theory, 2nd ed.; Springer: New York, NY, USA, 2000. [Google Scholar]
- Du, K.-L. Several misconceptions and misuses of deep neural networks and deep learning. In Proceedings of the 2023 International Congress on Communications, Networking, and Information Systems (CNIS 2023), Guilin, China, 25–27 March 2023; CCIS 1893. Springer: Berlin/Heidelberg, Germany, 2023; pp. 155–171. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: Berlin, Germany, 2005. [Google Scholar]
- Ji, Z.; Telgarsky, M. The implicit bias of gradient descent on nonseparable data. In Proceedings of the 32nd Conference on Learning Theory (COLT), Phoenix, AZ, USA, 5–8 July 2019; Volume 99, pp. 1772–1798. [Google Scholar]
- Soudry, D.; Hoffer, E.; Nacson, M.S.; Gunasekar, S.; Srebro, N. The implicit bias of gradient descent on separable data. J. Mach. Learn. Res. 2018, 19, 1–57. [Google Scholar]
- Belkin, M.; Hsu, D.; Xu, J. Two models of double descent for weak features. SIAM J. Math. Data Sci. 2020, 2, 1167–1180. [Google Scholar] [CrossRef]
- Hastie, T.; Montanari, A.; Rosset, S.; Tibshirani, R.J. Surprises in high-dimensional ridgeless least squares interpolation. Ann. Stat. 2022, 50, 949–986. [Google Scholar] [CrossRef] [PubMed]
- Mei, S.; Montanari, A. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv 2019, arXiv:1908.05355. [Google Scholar] [CrossRef]
- Adlam, B.; Pennington, J. Understanding double descent requires A fine-grained bias-variance decomposition. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 11022–11032. [Google Scholar]
- d’Ascoli, S.; Refinetti, M.; Biroli, G.; Krzakala, F. Double trouble in double descent: Bias and variance(s) in the lazy regime. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtually, 13–18 July 2020; Volume 119, pp. 2280–2290. [Google Scholar]
- Yang, Z.; Yu, Y.; You, C.; Steinhardt, J.; Ma, Y. Rethinking bias–variance trade-off for generalization of neural networks. In Proceedings of the International Conference on Machine Learning, Virtually, 13–18 July 2020; pp. 10767–10777. [Google Scholar]
- Ibragimov, I.A.; HasMinskii, R.Z. Statistical Estimation: Asymptotic Theory; Springer: Berlin/Heidelberg, Germany, 2013; Volume 16. [Google Scholar]
- Zou, D.; Wu, J.; Braverman, V.; Gu, Q.; Kakade, S.M. Benign overfitting of constant-stepsize SGD for linear regression. J. Mach. Learn. Res. 2023, 24, 1–58. [Google Scholar]
- Bartlett, P.L.; Foster, D.J.; Telgarsky, M.J. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30, pp. 6240–6249. [Google Scholar]
- Bengio, Y. Learning deep architectures for AI. FNT Mach. Learn. 2009, 2, 1–127. [Google Scholar] [CrossRef]
- Dinh, L.; Pascanu, R.; Bengio, S.; Bengio, Y. Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning (IMCL), Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1019–1028. [Google Scholar]
- Oneto, L.; Ridella, S.; Anguita, D. Do we really need a new theory to understand over-parameterization? Neurocomputing 2023, 543, 126227. [Google Scholar] [CrossRef]
- Chuang, C.-Y.; Mroueh, Y.; Greenewald, K.; Torralba, A.; Jegelka, S. Measuring generalization with optimal transport. In Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W., Eds.; Curran Associates, Inc.: New York, NY, USA, 2021; Volume 34, pp. 8294–8306. [Google Scholar]
- Neyshabur, B.; Bhojanapalli, S.; Mcallester, D.; Srebro, N. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
- Neyshabur, B.; Li, Z.; Bhojanapalli, S.; LeCun, Y.; Srebro, N. The role of over-parametrization in generalization of neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Neyshabur, B.; Tomioka, R.; Srebro, N. Norm-based capacity control in neural networks. In Proceedings of the 28th Conference on Learning Theory, PMLR, Paris, France, 3–6 July 2015; Grnwald, P., Hazan, E., Kale, S., Eds.; Volume 40, pp. 1376–1401. [Google Scholar]
- Cherkassky, V.; Lee, E.H. To understand double descent, we need to understand VC theory. Neural Netw. 2024, 169, 242–256. [Google Scholar] [CrossRef]
- Shawe-Taylor, J.; Bartlett, P.L.; Williamson, R.C.; Anthony, M. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory 1998, 44, 1926–1940. [Google Scholar] [CrossRef]
- Lee, E.H. and Cherkassky, V. Understanding Double Descent Using VC-Theoretical Framework. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 18838–18847. [Google Scholar] [CrossRef]
- Siegelmann, H.T.; Sontag, E.D. On the computational power of neural nets. In Proceedings of the Conference on Computational Learning Theory (COLT), Pittsburgh, PA, USA, 24–26 July 1992; pp. 440–449. [Google Scholar]
- Li, B.; Fong, R.S.; Tino, P. Simple cycle reservoirs are universal. J. Mach. Learn. Res. 2024, 25, 1–28. [Google Scholar]
- Paassen, B.; Schulz, A.; Stewart, T.C.; Hammer, B. Reservoir memory machines as neural computers. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 2575–2585. [Google Scholar] [CrossRef] [PubMed]
- Traversa, F.L.; Ramella, C.; Bonani, F.; Di Ventra, M. Memcomputing NP-complete problems in polynomial time using polynomial resources and collective states. Sci. Adv. 2015, 1, e1500031. [Google Scholar] [CrossRef] [PubMed]
- Pei, Y.R.; Traversa, F.L.; Di Ventra, M. On the universality of memcomputing machines. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 1610–1620. [Google Scholar] [CrossRef]
- Cover, T.M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 1965, 14, 326–334. [Google Scholar] [CrossRef]
- Hassoun, M.H. Fundamentals of Artificial Neural Networks; MIT Press: Cambridge, MA, USA, 1995. [Google Scholar]
- Denker, J.S.; Schwartz, D.; Wittner, B.; Solla, S.A.; Howard, R.; Jackel, L.; Hopfield, J. Large automatic learning, rule extraction, and generalization. Complex Syst. 1987, 1, 877–922. [Google Scholar]
- Muller, B.; Reinhardt, J.; Strickland, M. Neural Networks: An Introduction, 2nd ed.; Springer: Berlin, Germany, 1995. [Google Scholar]
- Friedrichs, F.; Schmitt, M. On the power of Boolean computations in generalized RBF neural networks. Neurocomputing 2005, 63, 483–498. [Google Scholar] [CrossRef]
- Kolmogorov, A.N. On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk USSR 1957, 114, 953–956. [Google Scholar]
- Ismayilova, A.; Ismailov, V.E. On the Kolmogorov neural networks. Neural Netw. 2024, 176, 106333. [Google Scholar] [CrossRef]
- Schmidt-Hieber, J. The Kolmogorov–Arnold representation theorem revisited. Neural Netw. 2021, 137, 119–126. [Google Scholar] [CrossRef]
- Hecht-Nielsen, R. Kolmogorov’s mapping neural network existence theorem. In Proceedings of the 1st IEEE International Conference on Neural Networks, San Diego, CA, USA, 21–24 June 1987; Volume 3, pp. 11–14. [Google Scholar]
- Shen, Z.; Yang, H.; Zhang, S. Deep network approximation characterized by number of neurons. Commun. Comput. Phys. 2020, 28, 1768–1811. [Google Scholar] [CrossRef]
- Royden, H.L. Real Analysis, 2nd ed.; Macmillan: New York, NY, USA, 1968. [Google Scholar]
- Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Contr. Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
- Funahashi, K.-I. On the approximate realization of continuous mappings by neural networks. Neural Netw. 1989, 2, 183–192. [Google Scholar] [CrossRef]
- Chen, T.; Chen, H.; Liu, R.-W. A constructive proof and an extension of cybenko’s approximation theorem. In Computing Science and Statistics; Page, C., LePage, R., Eds.; Springer: New York, NY, USA, 1992; pp. 163–168. [Google Scholar]
- Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 1991, 4, 251–257. [Google Scholar] [CrossRef]
- Leshno, M.; Lin, V.Y.; Pinkus, A.; Schocken, S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 1993, 6, 861–867. [Google Scholar] [CrossRef]
- Li, L.K. Approximation theory and recurrent networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Baltimore, MD, USA, 7–11 June 1992; pp. 266–271. [Google Scholar]
- Li, X.; Yu, W. Dynamic system identification via recurrent multilayer perceptrons. Inf. Sci. 2002, 147, 45–63. [Google Scholar] [CrossRef]
- Gonon, L.; Ortega, J.-P. Fading memory echo state networks are universal. Neural Netw. 2021, 138, 10–13. [Google Scholar] [CrossRef]
- Arena, P.; Fortuna, L.; Muscato, G.; Xibilia, M. Multilayer perceptrons to approximate quaternion valued functions. Neural Netw. 1997, 10, 335–342. [Google Scholar] [CrossRef]
- Buchholz, S.; Sommer, G. A hyperbolic multilayer perceptron. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Como, Italy, 24–27 July 2000; IEEE: Piscataway, NJ, USA, 2000; Volume 2, pp. 129–133. [Google Scholar]
- Buchholz, S.; Sommer, G. Clifford algebra multilayer perceptrons. In Geometric Computing with Clifford Algebras: Theoretical Foundations and Applications in Computer Vision and Robotics; Springer: Berlin/Heidelberg, Germany, 2001; pp. 315–334. [Google Scholar]
- Carniello, R.A.F.; Vital, W.L.; Valle, M.E. Universal approximation theorem for tessarine-valued neural networks. In Proceedings of the Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), Sao Carlos, Sao Paulo, Brazil, 25–29 2021; pp. 233–243. [Google Scholar]
- Voigtlaender, F. The universal approximation theorem for complex-valued neural networks. Appl. Comput. Harmon. Anal. 2023, 64, 33–61. [Google Scholar] [CrossRef]
- Valle, M.E.; Vital, W.L.; Vieira, G. Universal approximation theorem for vector-and hypercomplex-valued neural networks. Neural Netw. 2024, 180, 106632. [Google Scholar] [CrossRef]
- Valle, M.E. Understanding vector-valued neural networks and their relationship with real and hypercomplex-valued neural networks: Incorporating intercorrelation between features into neural networks [Hypercomplex signal and image processing]. IEEE Signal Process. Mag. 2024, 41, 49–58. [Google Scholar] [CrossRef]
- Manita, O.A.; Peletier, M.A.; Portegies, J.W.; Sanders, J.; Senen-Cerda, A. Universal approximation in dropout neural networks. J. Mach. Learn. Res. 2022, 23, 1–46. [Google Scholar]
- Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
- Klassen, M.; Pao, Y.; Chen, V. Characteristics of the functional link net: A higher order delta rule net. In Proceedings of the IEEE International Conference on Neural Networks (ICNN), San Diego, CA, USA, 18–22 July 1988; pp. 507–513. [Google Scholar]
- Chen, C.L.P.; Liu, Z.L. Broad learning system: An effective and efficient incremental learning system without the need for deep architecture. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 10–24. [Google Scholar] [CrossRef] [PubMed]
- Chen, C.L.P.; Liu, Z.; Feng, S. Universal approximation capability of broad learning system and its structural variations. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 1191–1204. [Google Scholar] [CrossRef]
- Kratsios, A.; Papon, L. Universal approximation theorems for differentiable geometric deep learning. J. Mach. Learn. Res. 2022, 23, 1–73. [Google Scholar]
- Ganea, O.; Becigneul, G.; Hofmann, T. Hyperbolic neural networks. In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2018; Volume 31, pp. 5345–5355. [Google Scholar]
- Krishnan, R.G.; Shalit, U.; Sontag, D. Deep Kalman filters. In Proceedings of the NeurIPS—Advances in Approximate Bayesian Inference, Montreal, QC, Canada, 11 December 2015. [Google Scholar]
- Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R.R.; Smola, A.J. Deep sets. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30, pp. 3391–3401. [Google Scholar]
- Wagstaff, E.; Fuchs, F.B.; Engelcke, M.; Osborne, M.A.; Posner, I. Universal approximation of functions on sets. J. Mach. Learn. Res. 2022, 23, 1–56. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
- Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A.R.; Choi, S.; Teh, Y.W. Set Transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 3744–3753. [Google Scholar]
- Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Murphy, R.L.; Srinivasan, B.; Rao, V.; Ribeiro, B. Janossy pooling: Learning deep permutation-invariant functions for variable-size inputs. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Baldi, P.; Vershynin, R. The capacity of feedforward neural networks. Neural Netw. 2019, 116, 288–311. [Google Scholar] [CrossRef]
- Baldi, P.; Vershynin, R. On neuronal capacity. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2018; pp. 7740–7749. [Google Scholar]
- Bartlett, P.L.; Harvey, N.; Liaw, C.; Mehrabian, A. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res. 2019, 20, 1–17. [Google Scholar]
- Graves, A.; Wayne, G.; Danihelka, I. Neural Turing machines. arXiv 2014, arXiv:1410.5401. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2020; pp. 1877–1901. [Google Scholar]
- Perez, J.; Barcelo, P.; Marinkovic, J. Attention is Turing complete. J. Mach. Learn. Res. 2021, 22, 1–35. [Google Scholar]
- Dehghani, M.; Gouws, S.; Vinyals, O.; Uszkoreit, J.; Kaiser, L. Universal transformers. arXiv 2018, arXiv:1807.03819. [Google Scholar]
- Kleene, S.C. Representation of events in nerve nets and finite automata. In Automata Studies; Shannon, C., McCarthy, J., Eds.; Princeton University Press: Princeton, NJ, USA, 1956; pp. 3–42. [Google Scholar]
- Hahn, M. Theoretical limitations of self-attention in neural sequence models. arXiv 2019, arXiv:1906.06755. [Google Scholar] [CrossRef]
- Yun, C.; Bhojanapalli, S.; Rawat, A.S.; Reddi, S.J.; Kumar, S. Are transformers universal approximators of sequence-to-sequence functions? In Proceedings of the International Conference on Learning Representations (ICLR), Virtually, 26–30 April 2020.
- Chen, Y.; Gilroy, S.; Maletti, A.; May, J.; Knight, K. Recurrent neural networks as weighted language recognizers. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), New Orleans, LA, USA, 1–6 June 2018; pp. 2261–2271. [Google Scholar]
- Maass, W. On the computational power of winner-take-all. Neural Comput. 2000, 12, 2519–2535. [Google Scholar] [CrossRef] [PubMed]
- Vapnik, V.N.; Chervonenkis, A.J. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Its Appl. 1971, 16, 264–280. [Google Scholar] [CrossRef]
- Valiant, P. A theory of the learnable. Commun. ACM 1984, 27, 1134–1142. [Google Scholar] [CrossRef]
- Vapnik, V.N. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
- Schapire, R.E. The strength of weak learnability. Mach. Learn. 1990, 5, 197–227. [Google Scholar] [CrossRef]
- Bartlett, P.L.; Long, P.M.; Williamson, R.C. Fat-shattering and the learnability of real-valued functions. In Proceedings of the 7th Annual Conference on Computational Learning Theory, New Brunswick, NJ, USA, 10–12 July 1994; pp. 299–310. [Google Scholar]
- Mendelson, S. Rademacher averages and phase transitions in Glivenko-Cantelli classes. IEEE Trans. Inf. Theory 2002, 48, 251–263. [Google Scholar] [CrossRef]
- Hanneke, S.; Yang, L. Minimax analysis of active learning. J. Mach. Learn. Res. 2015, 16, 3487–3602. [Google Scholar]
- Cherkassky, V.; Ma, Y. Another look at statistical learning theory and regularization. Neural Netw. 2009, 22, 958–969. [Google Scholar] [CrossRef]
- Wolpert, D.H.; Macready, W.G. No Free Lunch Theorems for Search; SFI-TR-95-02-010; Santa Fe Institute: Santa Fe, NM, USA, 1995. [Google Scholar]
- Cataltepe, Z.; Abu-Mostafa, Y.S.; Magdon-Ismail, M. No free lunch for early stropping. Neural Comput. 1999, 11, 995–1009. [Google Scholar] [CrossRef]
- Magdon-Ismail, M. No free lunch for noise prediction. Neural Comput. 2000, 12, 547–564. [Google Scholar] [CrossRef] [PubMed]
- Zhu, H. No free lunch for cross validation. Neural Comput. 1996, 8, 1421–1426. [Google Scholar] [CrossRef]
- Goutte, C. Note on free lunches and cross-validation. Neural Comput. 1997, 9, 1245–1249. [Google Scholar] [CrossRef]
- Rivals, I.; Personnaz, L. On cross-validation for model selection. Neural Comput. 1999, 11, 863–870. [Google Scholar] [CrossRef] [PubMed]
- Haussler, D. Probably approximately correct learning. In Proceedings of the 8th National Conference on Artificial Intelligence, Boston, MA, USA, 29 July–3 August 1990; Volume 2, pp. 1101–1108. [Google Scholar]
- McAllester, D. Some PAC-Bayesian theorems. In Proceedings of the Annual Conference on Computational Learning Theory (COLT), Tuscaloosa, AL, USA, 23–26 July 1998; ACM: New York, NY, USA, 1998; pp. 230–234. [Google Scholar]
- Shawe-Taylor, J.; Williamson, R. A PAC analysis of a bayesian estimator. In Proceedings of the Annual Conference on Computational Learning Theory (COLT), Nashville, TN, USA, 29 July–1 August 1997; ACM: New York, NY, USA, 1997; pp. 2–9. [Google Scholar]
- Viallard, P.; Germain, P.; Habrard, A.; Morvant, E. A general framework for the practical disintegration of PAC-Bayesian bounds. Mach. Learn. 2024, 113, 519–604. [Google Scholar] [CrossRef]
- Blanchard, G.; Fleuret, F. Occam’s hammer. In Proceedings of the Annual Conference on Learning Theory (COLT), Angeles, CA, USA, 25–28 June 2007; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4539, pp. 112–126. [Google Scholar]
- Catoni, O. PAC-Bayesian supervised classification: The thermodynamics of statistical learning. arXiv 2007, arXiv:0712.0248. [Google Scholar]
- Anthony, M.; Biggs, N. Computational Learning Theory; Cambridge University Press: Cambridge, UK, 1992. [Google Scholar]
- Shawe-Taylor, J. Sample sizes for sigmoidal neural networks. In Proceedings of the 8th Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 17–19 July 1995; pp. 258–264. [Google Scholar]
- Baum, E.B.; Haussler, D. What size net gives valid generalization? Neural Comput. 1989, 1, 151–160. [Google Scholar] [CrossRef]
- Haykin, S. Neural Networks: A Comprehensive Foundation, 2nd ed.; Prentice Hall: Upper Saddle River, NJ, USA, 1999. [Google Scholar]
- Gribonval, R.; Jenatton, R.; Bach, F.; Kleinsteuber, M.; Seibert, M. Sample complexity of dictionary learning and other matrix factorizations. IEEE Trans. Inf. Theory 2015, 61, 3469–3486. [Google Scholar] [CrossRef]
- Simon, H.U. An almost optimal PAC algorithm. In Proceedings of the 28th Conference on Learning Theory (COLT), Paris, France, 6–9 July 2015; pp. 1–12. [Google Scholar]
- Bartlett, P.L.; Maass, W. Vapnik-Chervonenkis dimension of neural nets. In The Handbook of Brain Theory and Neural Networks, 2nd ed.; Arbib, M.A., Ed.; MIT Press: Cambridge, MA, USA, 2003; pp. 1188–1192. [Google Scholar]
- Koiran, P.; Sontag, E.D. Neural networks with quadratic VC dimension. In Advances in Neural Information Processing Systems; Touretzky, D.S., Mozer, M.C., Hasselmo, M.E., Eds.; MIT Press: Cambridge, MA, USA, 1996; Volume 8, pp. 197–203. [Google Scholar]
- Schmitt, M. On the capabilities of higher-order neurons: A radial basis function approach. Neural Comput. 2005, 17, 715–729. [Google Scholar] [CrossRef]
- Bartlett, P.L. Lower bounds on the Vapnik-Chervonenkis dimension of multi-layer threshold networks. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory (COLT), Louisville, KY, USA, 10–13 July 1993; ACM Press: New York, NY, USA, 1993; pp. 144–150. [Google Scholar]
- Vapnik, V.; Levin, E.; Le Cun, Y. Measuring the VC-dimension of a learning machine. Neural Comput. 1994, 6, 851–876. [Google Scholar] [CrossRef]
- Shao, X.; Cherkassky, V.; Li, W. Measuring the VC-dimension using optimized experimental design. Neural Comput. 2000, 12, 1969–1986. [Google Scholar] [CrossRef] [PubMed]
- Mpoudeu, M.; Clarke, B. Model selection via the VC dimension. J. Mach. Learn. Res. 2019, 20, 1–26. [Google Scholar]
- Anthony, M.; Bartlett, P.L. Neural Network Learning: Theoretical Foundations; Cambridge University Press: Cambridge, CA, USA, 1999; Volume 9. [Google Scholar]
- Elesedy, B.; Zaidi, S. Provably strict generalisation benefit for equivariant models. In Proceedings of the International Conference on Machine Learning (ICML), Virtually, 18–24 July 2021; pp. 2959–2969. [Google Scholar]
- Cohen, T.; Welling, M. Group equivariant convolutional networks. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; Volume 48, pp. 2990–2999. [Google Scholar]
- Bietti, A.; Venturi, L.; Bruna, J. On the sample complexity of learning under geometric stability. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2021; Volume 34, pp. 18673–18684. [Google Scholar]
- Sannai, A.; Imaizumi, M.; Kawano, M. Improved generalization bounds of group invariant/equivariant deep networks via quotient feature spaces. In Uncertainty Inartificial Intelligence; PMLR; Elsevier: Amsterdam, The Netherlands, 2021; pp. 771–780. [Google Scholar]
- Shao, H.; Montasser, O.; Blum, A. A theory of pac learnability under transformation invariances. Adv. Neural Inf. Process. Syst. 2022, 35, 13989–14001. [Google Scholar]
- Philipp Christian Petersen, Anna Sepliarskaia VC dimensions of group convolutional neural networks. Neural Netw. 2024, 169, 462–474. [CrossRef]
- Goldman, S.; Kearns, M. On the complexity of teaching. J. Comput. Syst. Sci. 1995, 50, 20–31. [Google Scholar] [CrossRef]
- Shinohara, A.; Miyano, S. Teachability in computational learning. New Gener. Comput. 1991, 8, 337–348. [Google Scholar] [CrossRef]
- Liu, J.; Zhu, X. The teaching dimension of linear learners. J. Mach. Learn. Res. 2016, 17, 1–25. [Google Scholar]
- Simon, H.U.; Telle, J.A. MAP- and MLE-based teaching. J. Mach. Learn. Res. 2024, 25, 1–34. [Google Scholar]
- Koltchinskii, V. Rademacher penalties and structural risk minimization. IEEE Trans. Inf. Theory 2001, 47, 1902–1914. [Google Scholar] [CrossRef]
- Bartlett, P.L.; Mendelson, S. Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 2002, 3, 463–482. [Google Scholar]
- Anguita, D.; Ghio, A.; Oneto, L.; Ridella, S. A deep connection between the Vapnik-Chervonenkis entropy and the Rademacher complexity. IEEE Trans. Neural Netw. Learn. Syst. 2014, 25, 2202–2211. [Google Scholar] [CrossRef] [PubMed]
- Bartlett, P.L.; Bousquet, O.; Mendelson, S. Local Rademacher complexities. Ann. Stat. 2005, 33, 1497–1537. [Google Scholar] [CrossRef]
- Dudley, R. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. J. Funct. Anal. 1967, 1, 290–330. [Google Scholar] [CrossRef]
- Mendelson, S. A few notes on statistical learning theory. In Advanced Lectures on Machine Learning; Mendelson, S., Smola, A., Eds.; LNCS Volume 2600; Springer: Berlin, Germany, 2003; pp. 1–40. [Google Scholar]
- Yu, H.-F.; Jain, P.; Dhillon, I.S. Large-scale multi-label learning with missing labels. In Proceedings of the 21st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1–9. [Google Scholar]
- Lei, Y.; Ding, L.; Zhang, W. Generalization performance of radial basis function networks. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 551–564. [Google Scholar]
- Vapnik, V.N. Estimation of Dependences Based on Empirical Data; Springer: New York, NY, USA, 1982. [Google Scholar]
- Oneto, L.; Anguita, D.; Ridella, S. A local Vapnik-Chervonenkis complexity. Neural Netw. 2016, 82, 62–75. [Google Scholar] [CrossRef]
- Cherkassky, V.; Ma, Y. Comparison of model selection for regression. Neural Comput. 2003, 15, 1691–1714. [Google Scholar] [CrossRef]
- Grunwald, P.D.; Mehta, N.A. Fast rates for general unbounded loss functions: From ERM to generalized Bayes. J. Mach. Learn. Res. 2020, 21, 1–80. [Google Scholar]
- Vovk, V.; Papadopoulos, H.; Gammerman, A. (Eds.) Measures of complexity: Festschrift for Alexey Chervonenkis; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar]
- V’yugin, V.V. VC dimension, fat-shattering dimension, rademacher averages, and their applications. In Measures of Complexity; Vovk, V., Papadopoulos, H., Gammerman, A., Eds.; Springer: Cham, Switzerland, 2015; pp. 57–74. [Google Scholar]
- Koltchinskii, V.; Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Stat. 2002, 30, 1–50. [Google Scholar] [CrossRef]
- Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: New York, NY, USA, 2014. [Google Scholar]
- Du, K.-L.; Leung, C.-S.; Mow, W.H.; Swamy, M.N.S. Perceptron: Learning, generalization, model Selection, fault tolerance, and role in the deep learning era. Mathematics 2022, 10, 4730. [Google Scholar] [CrossRef]
- Kawaguchi, K.; Kaelbling, L.P.; Bengio, Y. Generalization in deep learning. arXiv 2017, arXiv:1710.05468. [Google Scholar]
- Balestriero, R.; Bottou, L.; LeCun, Y. The effects of regularization and data augmentation are class dependent. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2022; Volume 35, pp. 37878–37891. [Google Scholar]
- Ratner, A.J.; Ehrenberg, H.; Hussain, Z.; Dunnmon, J.; Re, C. Learning to compose domain-specific transformations for data augmentation. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Redhook, NY, USA, 2017; Volume 30. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Du, K.-L.; Zhang, R.; Jiang, B.; Zeng, J.; Lu, J. Understanding Machine Learning Principles: Learning, Inference, Generalization, and Computational Learning Theory. Mathematics 2025, 13, 451. https://doi.org/10.3390/math13030451
Du K-L, Zhang R, Jiang B, Zeng J, Lu J. Understanding Machine Learning Principles: Learning, Inference, Generalization, and Computational Learning Theory. Mathematics. 2025; 13(3):451. https://doi.org/10.3390/math13030451
Chicago/Turabian StyleDu, Ke-Lin, Rengong Zhang, Bingchun Jiang, Jie Zeng, and Jiabin Lu. 2025. "Understanding Machine Learning Principles: Learning, Inference, Generalization, and Computational Learning Theory" Mathematics 13, no. 3: 451. https://doi.org/10.3390/math13030451
APA StyleDu, K.-L., Zhang, R., Jiang, B., Zeng, J., & Lu, J. (2025). Understanding Machine Learning Principles: Learning, Inference, Generalization, and Computational Learning Theory. Mathematics, 13(3), 451. https://doi.org/10.3390/math13030451