Abstract
The aim of this article is to establish a stochastic search algorithm for neural networks based on the fractional stochastic processes with the Hurst parameter . We define and discuss the properties of fractional stochastic processes, , which generalize a standard Brownian motion. Fractional stochastic processes capture useful yet different properties in order to simulate real-world phenomena. This approach provides new insights to stochastic gradient descent (SGD) algorithms in machine learning. We exhibit convergence properties for fractional stochastic processes.
Keywords:
fractional Brownian motion; fractional stochastic gradient descent; machine learning; stochastic gradient descent; complex systems MSC:
37M05; 60G18; 60G22; 62M45; 68T05; 68Q01; 65Y04; 68M07
1. Introduction
The gradient descent methodology is not computationally efficient in all applications. Sometimes, optimization algorithms become stuck in the flat regions of manifolds. In those cases, the optimization algorithm requires a long time to escape. This is the challenge of a vanishing gradient where, for instance, is almost zero (see Section 4). The method of stochastic gradient descent (SGD) generally overcomes this problem.
Recent advancements in the field of factional stochastic processes exhibit the theoretical benefits of modeling complex systems [1,2,3,4]. Yet, so far no literature exists on fractional stochastic gradient descent (fSGD) or fractional stochastic networks. This paper sketches the potential of such new literature for modeling complex systems. Moreover, we exhibit that fractional stochastic processes are an advancement in machine learning (ML) and artificial intelligence (AI).
The methodology of fractional stochastic gradient descent and the role of stochastic neural networks are based on a generalized assumption of randomness. Mandelbrot and Van Ness defined a fractional Brownian motion (fBM), , together with a Hurst parameter in 1968 [5]. For , we obtain a standard Browning motion. Yet, for , we obtain new forms of randomness or stochastic processes that match real-world phenomena.
The new feature of a fractional Brownian motion (fBM) is that increments are interdependent. In the literature, this is called self-similarity. A self-similar stochastic process reveals invariance with respect to the time scale (scaling invariance). A standard Brownian motion or a Lévy process displays different properties. They have independent increments and belong to the famous class of Markov processes.
However, in science, there is ubiquitous evidence that fractional stochastic processes are of relevance. For instance, we frequently observe probability densities with sharp peaks, which is related to the phenomena of long-range interdependence. In many real-world observations and applications, we find the presence of interdependence, too. This pattern can be captured by fractional stochastic processes.
Nonetheless, some phenomena are even more complicated and require further generalization towards sub-fractional stochastic processes. The literature on sub-fBM’s demonstrates that those stochastic processes are useful in scientific applications [6]. A sub-fractional Brownian motion provides a nexus between a Brownian motion and fractional stochastic process. Those processes were introduced by Tudor et al. [7,8] and Bojdecki et al. [9]. Note that, as sub-fractional stochastic processes are not martingale processes, the basic tools of stochastic analysis are insufficient. However, researchers have developed new machinery to handle fractional stochastic processes, such as [10] or [11,12,13,14,15].
In this paper, our purpose is to develop and study the idea of fractional stochastic gradient descent algorithms. Our approach generalizes the existing literature on stochastic gradient descent (SGD) and stochastic neural networks. For instance, Hopfield [16] developed neural networks consisting of several perceptrons with randomness. Similarly, a Boltzmann network is a type of stochastic neural network wherein the output of the activation function is interpreted as a probability.
A study already exists about stochastic gradient descent and its challenges in machine learning [17,18]. Recent developments in the theory and applications of stochastic gradient descent are discussed in the following papers: Schmidt et al. [19], Haochen and Sra [20], Gotmare et al. [21], Curtis and Scheinberg [22], de Roos et al. [23]. The focus of our research is the motivation of fractional stochastic gradient descent (fSGD) algorithms. Thus, our research is beyond the scope of current literature and focuses on the possibility of fractional stochastic gradient descent in theory. We neglect potential computational limitations in machine learning.
The paper is organized as follows. Section 2 provides preliminary definitions. Subsequently, we introduce the foundations of fractional stochastic processes in Section 3. Section 4 introduces the idea of fractional stochastic search algorithms and derives the convergence results in general. Finally, in Section 5, we apply the method to two different cases. Section 6 concludes the paper.
2. Preliminaries
Machine learning is mainly based on neural networks and efficient optimization algorithms. The most primitive neural model is inspired by the work of Rosenblatt [24]. In the following section, we define the major elements from a machine learning perspective.
Definition 1.
A stochastic neuron is defined by n-inputs , n-weighting factors and an n-dimensional vector of biases together with a sigmoid activation function with and a stochastic output
where . Hence, we define the output for by and for by the inverse probability: .
Note that, if the activation potential is greater than zero, such as , then this neural network is not necessarily activated according to Definition 1. An activation value of one only occurs with the probability of the activation function.
In machine learning, the gradient descent algorithm is omnipresent in all optimization problems. Yet, it does not provide robust solutions in each case. There are computational obstacles, such as when the algorithm becomes stuck in a local minimum or lost in a plateau from which it takes a long time to get out. A plateau is defined as a flat surface region where the gradient is very small (or almost zero).
The optimization algorithm of a neural network always has the goal of finding the optimal weighting parameters . The standard algorithm used to optimize the parameters is frequently reformulated in order to minimize the cost function. This is called the gradient descent method. This method is closely related to Newton’s algorithm in numerical computing. The following definition summarizes the algorithm from a machine learning vantage point.
Definition 2.
The gradient descent algorithm is defined by
where is the gradient of a cost function, is an optional conditioning matrix, and is the learning rate.
The stochastic gradient descent (SGD) method overcomes the obstacle if the gradient is close to zero. Indeed, SGD reaches a minimum along a non-linear stochastic process. In the following sections, we first discuss the literature and then generalize the approach to fractional stochastic processes.
3. Fractional Stochastic Processes
3.1. General Definitions
Consider a stochastic process with a Hurst parameter H. Subsequently, we define the elementary tools in fractional calculus.
Definition 3.
Let , . Let and . The left- and right-sided fractional integrals of f of order α are defined for , respectively, as
and
This is the fractional integral of the Riemann–Liouville type. In the same vein, we define factional derivatives where we distinguish between left- and right-sided derivatives.
Definition 4.
The factional left- and right-sided derivatives, for and , are defined by
and
for all and is the image of .
Let us assume , then we obtain
Notably, exists for all if . Given those definitions, we are ready to define a Brownian motion:
Definition 5.
Let H be , and let be an arbitrary real number. We call a fractional Brownian motion (fBM) with Hurst parameter H and starting value at time 0, such as
- 1.
- , and;
- 2.
- [Wyle fractional integral];
- 3.
- Equivalent to the Riemann–Liouville integral:.
Next, let us consider the following corollary:
Corollary 1.
Consider and . Then the Brownian motion is of .
Proof.
Let , we find □
In the literature, there exists an alternative, yet useful, definition:
Definition 6.
A fractional Brownian motion is a Gaussian process for defined by the following covariance function
where the Hurst index is denoted by .
Since the covariance of a Brownian motion is given in the literature, it is easy to extend the definition to an fBM with Hurst index H, such as
where we obtain the definition of a Brownian motion for . Following Herzog [15], we derive the covariance step-by-step:
Corollary 2.
Consider a fractional Brownian motion. The expectation values of non-overlapping increments are and the variance is of for all
Proof.
See [15]. □
3.2. Properties
Next, we consider the properties of the fBM over time for different Hurst parameters. Suppose or . If we assume that the Hurst parameter is of , we say the fractional stochastic process has a short memory. Conversely, if , we obtain the property of long-range dependence. Figure 1 illustrates sample processes for the three ranges of the Hurst parameters, H.
Figure 1.
Different fractional Brownian motions with the following Hurst index: (Left-panel) , (middle-panel) (standard BM), and (right-panel) .
Proposition 1.
Given a fractional Brownian motion, we obtain the following properties:
- 1.
- The fBM has stationary increments: ;
- 2.
- The fBM is H-self-similar, such as ;
- 3.
- The fBM is H-self-similar, such as ;
Proof.
The proof follows Herzog [15]. In order to prove the stationary of increments, we set . The equality of the covariance implies . Moreover, it has the same distribution, such as . Subsequently, we find
where and with . This demonstrates that the increments and the time evolution of the increments are the same at any given point. Consequently, we obtain stationary increments.
The second property of Proposition 1 is self-similarity. Consider the following definition,
Here, we find that and . Part (3) is already given in Corollary 2. □
3.3. Definition of Sub-Fractional Processes
In a recent paper, Herzog [15] described a sub-fractional Brownian motion (sub-fBM) as an intermediate between a Brownian motion and a fractional Brownian motion. Without loss of generality, a sub-fBM is a self-similar Gaussian process. Note that both the fBM and sub-fBM have the properties of self-similarity and long-range dependence, yet a sub-fBM does not have stationary increments [9].
Any Brownian motion is uniquely defined by its covariance. For the sub-fBM we denote covariance by .
Definition 7.
Consider a sub-fractional Brownian motion with Hurst parameter H and a centered mean zero Gaussian process with the following covariance function
where and .
Note, a fractional Brownian motion coincides with a Brownian motion if the Hurst parameter is . Thus, a Brownian motion on the real line has a covariance of . The process has the following representation for (see [25]):
The kernel function of a sub-fractional Brownian motion is given by
3.4. Properties of Sub-Fractional Processes
In this subsection, we reiterate useful properties of sub-fractional Brownian motions such as those described in Herzog [15].
Lemma 1.
Consider be a sub-fBM for all t. The properties of the sub-fBM are:
- 1.
- .
- 2.
- .
- 3.
- If , then , i.e., the increments are non-stationary.
Proof.
See [15]. □
Finally, we follow Herzog [15] and prove the following proposition:
Proposition 2.
Let be a fractional Brownian motion and be a sub-fractional Brownian motion. For , the following holds:
- 1.
- ;
- 2.
- .
Proof.
Obviously, an fBM has the following variance: . Similarly, we obtain the variance of for a sub-fBM. Subsequently, we have if .
The second part follows for :
In the case of or , we have equality. □
4. Fractional Stochastic Search
Let be a an m-dimensional stochastic process driven by a fractional Brownian motion , where . The respective stochastic process is as follows:
where is the initial value and is an m-dimensional fractional Brownian motion. Next, consider a cost function which needs to be optimized. Hence, we study the vector field for which the auxiliary function is decreasing. This requires us to find the expectation value:
Thus, the function is stochastic and dependent on time t. In general, an optimization algorithm of a neural network minimizes the expectation value of this function. Utilizing the machinery of stochastic analysis, Dynkin’s formula, among others, and following the approach described in [26], we obtain
where the operator . The usage of Taylor-series approximation and the differencing of Equations (12) yields
The method of steepest descent computes the gradient of such that the process , is as negative as possible. However, if is a stochastic process, we need to study the expectation of the gradient, particularly where the value is as negative as possible, such as .
In order to construct a stochastic process with this property, we specify and in Equation (10), respectively. Next, we specify the diffusion term in Equation (10), , or the product , which is a matrix, such that the algorithm in Equation (1) converges efficiently. Indeed, if we set the term of as being inversely proportional to the Hessian matrix, , then with . Through this process we can show the convergence of the algorithm and the existence of the solution.
Given that function z is of class and strictly convex, then, according to [27], the Hessian matrix is symmetric, real, positive definite, and non-degenerate. This guarantees that the Hessian matrix has an inverse, which is also positive definite. Efficient computation can be achieved by utilizing the Cholesky decomposition. One can show that the diffusion term is a lower triangular matrix satisfying . Under those conditions, we compute
In order to minimize the gradient, , we have to minimize the first-term, because the second term is a constant. Choosing and assuming obtains the following condition:
Using the square vector norm and the assumption of , it is sufficient to set and as the main parameters in the SGD algorithm, such that . In the sequel, we apply this algorithm to fractional stochastic search problems.
5. Application of Fractional Stochastic Search
In this section, we demonstrate the working of a fractional stochastic search. We exhibit the convergence of a fractional stochastic search within neural networks.
5.1. Stochastic Search: Case I
Suppose that we have a neural network with the following cost function: for with and . The stochastic gradient descent method searches the minimum of this cost function.
Mathematically, the solution is obvious for this problem. The first derivative is of for . Hence, the minimum is at , and consequently, the minimum value is . Next, we show that we can obtain the same value under a fractional stochastic search algorithm in a neural network.
In step one, we establish an adequate stochastic differential equation according to for . The gradient of the cost function, which is equal to the first derivative , as well as the Hessian of the cost function and the second derivative, is defined as . Both conditions enable us to compute the Lipschitz continuous coefficient functions. For . For . Hence, we obtain . The stochastic differential equation for has the form:
The SDE in Equation (14) is an Ornstein–Uhlenbeck process driven by a Brownian motion with the Hurst parameter [28].
The solution is divided into two parts: In part one, we solve the non-stochastic problem . This is an ordinary differential equation and has the solution . In part two, we define an auxiliary function and apply the Itô-Doeblin’s lemma:
Note that, in this case, the derivation coincides with a standard Brownian motion. Next, integrating the last line yields . Hence, we obtain , which is
Based on Equation (15), we find the expectation of . Note that the expected integral of a Brownian motion is zero. For , the expected value is . Next, utilizing the general condition of for and in Section 4, we obtain
Finally, it remains to show that the SDE in Equation (14) converges to the minimum value. Hence, we study the convergence sequence:
where, for , we have substituted Equation (15). Next, we use the property that the expected stochastic integral is zero and the variance of the Brownian motion is . Thus, we obtain
In order to show the convergence, we compute the limit of the sequence for time to infinity. We obtain the following:
Indeed, we find that the (fractional) stochastic algorithm converges to the same minimum value of our function for .
5.2. Stochastic Search: Case II
Conversely, suppose a fractional stochastic differential equation with a Hurst index of the form
where we define and . We search the minimum of the function , where is the solution of the SDE in Equation (16). This equation can be rewritten in the fractional Hida space as
where ⋄ is defined as the Wick product. Using Wick calculus, we find the solution as
where we have used the definition . By applying the following definitions , where and for , we obtain the final solution:
The solution of Equation (18) has an expectation of . Hence, for , the expected value is zero: . It remains to show the convergence of the fractional SDE in machine learning:
In order to show convergence, we compute the limit of the sequence for time to infinity. We equally obtain .
There are notable limitations of fractional stochastic gradient descent in general. Fractional calculus is built around the Riemann–Liouville integral, which is a non-local operator, lacks in uniqueness, and relies on the initial conditions. Given that a fractional process is not a martingale, the common stochastic tools are not applicable. Whether those properties constrain fractional stochastic gradient descent remains an open research question. Computational aspects might also be a limiting factor. However, for the first time, this research studies the idea of fractional search analogous to stochastic gradient descent in machine learning.
6. Conclusions
This article discovers fractional stochastic gradient descent algorithms for the optimization of neural networks. In the standard case, the fractional stochastic approach follows the well-known stochastic gradient descent method in machine learning. We discuss two special cases. First, we exhibit that fractional stochastic algorithms find the minima. This result might enhance algorithmic optimization in machine learning. Second, we discover the generalized patterns and properties of fractional stochastic processes. These insights may create a universal optimization approach in machine learning and AI in the future. We highlight the need for further research in that direction, particularly for the computational issues.
Funding
This research received no external funding except basic financial support from RRI—Reutlingen Research Institute, Reutlingen University. I appreciate the support for the advancement of scientific research and the betterment of society for the future.
Data Availability Statement
All data are available in the paper or upon request from the author.
Acknowledgments
I thank three anonymous reviewers for helpful comments.
Conflicts of Interest
The author declares no conflict of interest.
References
- Padhi, S.; Graef, J.; Pati, S. Multiple Positive Solutions for a boundary value problem with nonlinear nonlocal Riemann-Stieltjes Integral Boundary Conditions. Fract. Calc. Appl. Calc. 2018, 21, 716–745. [Google Scholar] [CrossRef]
- Ruiz, W. Dynamical system method for investigating existence and dynamical property of solution of nonlinear time-fractional PDEs. Nonlinear Dyn. 2019, 99, 1–20. [Google Scholar]
- Kamran, J.W.; Jamal, A.; Li, X. Numerical Solution of Fractional-Order Fredholm Integrodifferentiantial Equation in the Sense of Atangana-Baleanu Derivative. Math. Probl. Eng. 2020, 2021, 6662808. [Google Scholar]
- Guarigilia, E. Fractional calculus, zeta functions and Shannon entropy. Open Math. 2021, 19, 87–100. [Google Scholar] [CrossRef]
- Mandelbrot, B.; van Ness, J. Fractional Brownian Motions, Fractional Noises and Applications. SIAM Rev. 1968, 10, 422–437. [Google Scholar] [CrossRef]
- Monin, A.; Yaglom, A. Statistical Fluid Mechansics: Mechanics of Turbulence; Dover Publication: New York, NY, USA, 2007; Volume 2. [Google Scholar]
- Tudor, C. On the Wiener integral with respect to sub-fractional Brownian motion on an interval. J. Math. Anal. Appl. 2009, 351, 456–468. [Google Scholar] [CrossRef]
- Tudor, C.; Zili, M. Covariance measure and stochastic heat equation with fractional noise. Fract. Calc. Appl. Anal. 2014, 17, 807–826. [Google Scholar] [CrossRef]
- Bojdecki, T.; Gorostiza, L.; Talarczyk, A. Sub-fractional Brownian motion and its relation to occuption times. Statist. Probab. Lett. 2004, 69, 405–419. [Google Scholar] [CrossRef]
- Duncan, T.; Hu, Y.; Pasik-Duncan, B. Stochastic Calculus for Fractional Brownian Motion. SIAM J. Control Optim. 2000, 38, 582–612. [Google Scholar] [CrossRef]
- Shen, G.; Yan, L. The stochastic integral with respect to the sub-fractional Brownian motion with H> . J. Math. Sci. Sci. Adv. 2010, 6, 219–239. [Google Scholar]
- Yan, L.; Shen, G.; He, K. Itô’s formula for a sub-fractional Brownian motion. Commun. Stoch. Anal. 2011, 5, 135–159. [Google Scholar] [CrossRef]
- Liu, J.; Yan, L. Remarks on asymptotic behavior of weighted quadratic variation of subfractional Brownian motion. J. Korean Stat. Soc. 2012, 41, 177–187. [Google Scholar] [CrossRef]
- Prakasa, R. On some maximal and integral inequailities for sub-fractional Brownian motion. Stoch. Anal. Appl. 2017, 35, 2017. [Google Scholar]
- Herzog, B. Adopting Feynman–Kac Formula in Stochastic Differential Equations with (Sub-)Fractional Brownian Motion. Mathematics 2022, 10, 340. [Google Scholar] [CrossRef]
- Hopfield, J.J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 1982, 79, 2554–2558. [Google Scholar] [CrossRef] [PubMed]
- Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 2018, 60, 223–311. [Google Scholar] [CrossRef]
- Kochenderfer, H.; Wheeler, T. Algorithms for Optimization; MIT Press: Cambridge, MA, USA, 2019. [Google Scholar]
- Schmidt, M.; Roux, N.L.; Bach, F. Minimizing finite sums with the stochastic average gradient. Math. Program. 2017, 162, 83–112. [Google Scholar] [CrossRef]
- Haochen, J.; Sra, S. Random Shuffling Beats SGD after Finite Epochs. Proc. Mach. Learn. Res. 2019, 97, 2624–2633. [Google Scholar]
- Gotmare, A.; Keskar, N.S.; Xiong, C.; Socher, R. A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation. arXiv 2018, arXiv:1810.13243. [Google Scholar]
- Curtis, F.E.; Scheinberg, K. Adaptive Stochastic Optimization: A Framework for Analyzing Stochastic Optimization Algorithms. IEEE Signal Process 2020, 37, 32–42. [Google Scholar] [CrossRef]
- de Roos, F.; Jidling, C.; Wills, A.; Schön, T.; Hennig, P. A Probabilistically Motivated Learning Rate Adaptation for Stochastic Optimization. arXiv 2021, arXiv:2102.10880. [Google Scholar]
- Rosenblatt, F. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychol. Rev. 1958, 65, 386–408. [Google Scholar] [CrossRef] [PubMed]
- Alòs, E.; Mazet, O.; Nualart, D. Stochastic Calculus with Respect to Gaussian processes. Ann. Probab. 2001, 29, 766–801. [Google Scholar] [CrossRef]
- Calin, O. Deep Learning Architectures; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
- Golub, G.H.; van Loan, C. Matrix Computations; Johns Hopkins Press: Baltimore, MD, USA, 1996. [Google Scholar]
- Ornstein, L.; Uhlenbeck, G. On the theory of Brownian motion. Phys. Rev. 1930, 36, 823–841. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
