Fractional Stochastic Search Algorithms: Modelling Complex Systems via AI

Herzog, Bodo

doi:10.3390/math11092061

Open AccessArticle

Fractional Stochastic Search Algorithms: Modelling Complex Systems via AI

by

Bodo Herzog

^1,2

¹

Economic & Data Science Department, ESB Business School, Reutlingen University, Alteburgstr. 150, 72762 Reutlingen, Germany

²

RRI-Reutlingen Research Institute, Reutlingen University, 72762 Reutlingen, Germany

Mathematics 2023, 11(9), 2061; https://doi.org/10.3390/math11092061

Submission received: 2 April 2023 / Revised: 24 April 2023 / Accepted: 25 April 2023 / Published: 26 April 2023

(This article belongs to the Special Issue Fractional Calculus and Mathematical Applications)

Download

Browse Figure

Versions Notes

Abstract

The aim of this article is to establish a stochastic search algorithm for neural networks based on the fractional stochastic processes

{B_{t}^{H}, t \geq 0}

with the Hurst parameter

H \in (0, 1)

. We define and discuss the properties of fractional stochastic processes,

{B_{t}^{H}, t \geq 0}

, which generalize a standard Brownian motion. Fractional stochastic processes capture useful yet different properties in order to simulate real-world phenomena. This approach provides new insights to stochastic gradient descent (SGD) algorithms in machine learning. We exhibit convergence properties for fractional stochastic processes.

Keywords:

fractional Brownian motion; fractional stochastic gradient descent; machine learning; stochastic gradient descent; complex systems

MSC:

37M05; 60G18; 60G22; 62M45; 68T05; 68Q01; 65Y04; 68M07

1. Introduction

The gradient descent methodology is not computationally efficient in all applications. Sometimes, optimization algorithms become stuck in the flat regions of manifolds. In those cases, the optimization algorithm requires a long time to escape. This is the challenge of a vanishing gradient where, for instance,

\nabla z (θ)

is almost zero (see Section 4). The method of stochastic gradient descent (SGD) generally overcomes this problem.

Recent advancements in the field of factional stochastic processes exhibit the theoretical benefits of modeling complex systems [1,2,3,4]. Yet, so far no literature exists on fractional stochastic gradient descent (fSGD) or fractional stochastic networks. This paper sketches the potential of such new literature for modeling complex systems. Moreover, we exhibit that fractional stochastic processes are an advancement in machine learning (ML) and artificial intelligence (AI).

The methodology of fractional stochastic gradient descent and the role of stochastic neural networks are based on a generalized assumption of randomness. Mandelbrot and Van Ness defined a fractional Brownian motion (fBM),

B_{t}^{H}

, together with a Hurst parameter in 1968 [5]. For

H = \frac{1}{2}

, we obtain a standard Browning motion. Yet, for

H \neq \frac{1}{2}

, we obtain new forms of randomness or stochastic processes that match real-world phenomena.

The new feature of a fractional Brownian motion (fBM) is that increments are interdependent. In the literature, this is called self-similarity. A self-similar stochastic process reveals invariance with respect to the time scale (scaling invariance). A standard Brownian motion or a Lévy process displays different properties. They have independent increments and belong to the famous class of Markov processes.

However, in science, there is ubiquitous evidence that fractional stochastic processes are of relevance. For instance, we frequently observe probability densities with sharp peaks, which is related to the phenomena of long-range interdependence. In many real-world observations and applications, we find the presence of interdependence, too. This pattern can be captured by fractional stochastic processes.

Nonetheless, some phenomena are even more complicated and require further generalization towards sub-fractional stochastic processes. The literature on sub-fBM’s demonstrates that those stochastic processes are useful in scientific applications [6]. A sub-fractional Brownian motion provides a nexus between a Brownian motion and fractional stochastic process. Those processes were introduced by Tudor et al. [7,8] and Bojdecki et al. [9]. Note that, as sub-fractional stochastic processes are not martingale processes, the basic tools of stochastic analysis are insufficient. However, researchers have developed new machinery to handle fractional stochastic processes, such as [10] or [11,12,13,14,15].

In this paper, our purpose is to develop and study the idea of fractional stochastic gradient descent algorithms. Our approach generalizes the existing literature on stochastic gradient descent (SGD) and stochastic neural networks. For instance, Hopfield [16] developed neural networks consisting of several perceptrons with randomness. Similarly, a Boltzmann network is a type of stochastic neural network wherein the output of the activation function is interpreted as a probability.

A study already exists about stochastic gradient descent and its challenges in machine learning [17,18]. Recent developments in the theory and applications of stochastic gradient descent are discussed in the following papers: Schmidt et al. [19], Haochen and Sra [20], Gotmare et al. [21], Curtis and Scheinberg [22], de Roos et al. [23]. The focus of our research is the motivation of fractional stochastic gradient descent (fSGD) algorithms. Thus, our research is beyond the scope of current literature and focuses on the possibility of fractional stochastic gradient descent in theory. We neglect potential computational limitations in machine learning.

The paper is organized as follows. Section 2 provides preliminary definitions. Subsequently, we introduce the foundations of fractional stochastic processes in Section 3. Section 4 introduces the idea of fractional stochastic search algorithms and derives the convergence results in general. Finally, in Section 5, we apply the method to two different cases. Section 6 concludes the paper.

2. Preliminaries

Machine learning is mainly based on neural networks and efficient optimization algorithms. The most primitive neural model is inspired by the work of Rosenblatt [24]. In the following section, we define the major elements from a machine learning perspective.

Definition 1.

A stochastic neuron is defined by n-inputs

X = {(x_{1}, \dots, x_{n})}^{T}

, n-weighting factors

W = {(w_{1}, \dots, w_{n})}^{T}

and an n-dimensional vector of biases

B

together with a sigmoid activation function

σ_{c} (ζ) = \frac{1}{1 + e^{- c ζ}}

with

c \geq 0

and a stochastic output

Y

Y = σ (W^{T} X + B),

where

σ (W^{T} X + B) \in (0, 1)

. Hence, we define the output for

Y = 1

by

P r o b (Y = 1; X) = σ (W^{T} X + B)

and for

Y = 0

by the inverse probability:

P r o b (Y = 0; X) = 1 - σ (W^{T} X + B) = σ (- W^{T} X - B)

.

Note that, if the activation potential is greater than zero, such as

W^{T} X + B > 0

, then this neural network is not necessarily activated according to Definition 1. An activation value of one only occurs with the probability of the activation function.

In machine learning, the gradient descent algorithm is omnipresent in all optimization problems. Yet, it does not provide robust solutions in each case. There are computational obstacles, such as when the algorithm becomes stuck in a local minimum or lost in a plateau from which it takes a long time to get out. A plateau is defined as a flat surface region where the gradient

\nabla σ_{c}

is very small (or almost zero).

The optimization algorithm of a neural network always has the goal of finding the optimal weighting parameters

θ = (W, B)

. The standard algorithm used to optimize the parameters is frequently reformulated in order to minimize the cost function. This is called the gradient descent method. This method is closely related to Newton’s algorithm in numerical computing. The following definition summarizes the algorithm from a machine learning vantage point.

Definition 2.

The gradient descent algorithm is defined by

θ_{t + 1} = θ_{t} - λ_{t} C_{t} g_{t},

(1)

where

g_{t} = \nabla L (θ_{t})

is the gradient of a cost function,

C_{t}

is an optional conditioning matrix, and

λ_{t}

is the learning rate.

The stochastic gradient descent (SGD) method overcomes the obstacle if the gradient is close to zero. Indeed, SGD reaches a minimum along a non-linear stochastic process. In the following sections, we first discuss the literature and then generalize the approach to fractional stochastic processes.

3. Fractional Stochastic Processes

3.1. General Definitions

Consider a stochastic process with a Hurst parameter H. Subsequently, we define the elementary tools in fractional calculus.

Definition 3.

Let

a, b \in R

,

a < b

. Let

f \in L^{1} (a, b)

and

α > 0

. The left- and right-sided fractional integrals of f of order α are defined for

x \in (a, b)

, respectively, as

_{a} D_{x}^{- α} f (x) =_{a} I_{x}^{α} f (x) = \frac{1}{Γ (α)} \int_{a}^{x} {(x - u)}^{α - 1} f (u) d u - \infty \leq a \leq x,

and

_{b} D_{x}^{- α} f (x) =_{b} I_{x}^{α} f (x) = \frac{1}{Γ (α)} \int_{x}^{b} {(u - x)}^{α - 1} f (u) d u - \infty \leq x \leq b .

This is the fractional integral of the Riemann–Liouville type. In the same vein, we define factional derivatives where we distinguish between left- and right-sided derivatives.

Definition 4.

The factional left- and right-sided derivatives, for

f \in I_{a}^{α} (L^{p})

and

0 < α < 1

, are defined by

_{a} I_{x}^{- α} f (x) =_{a} D_{x}^{α} f (x) = \frac{1}{Γ (1 - α)} (\frac{d}{d x}) \int_{a}^{x} {(x - u)}^{- (α + 1)} f (u) d u

(2)

and

_{b} I_{x}^{- α} f (x) =_{b} D_{x}^{α} f (x) = \frac{{(- 1)}^{α}}{Γ (1 - α)} (\frac{d}{d x}) \int_{x}^{b} {(u - x)}^{- (α + 1)} f (u) d u,

(3)

for all

x \in (a, b)

and

I_{a}^{α} (L^{p})

is the image of

L^{p} (a, b)

.

Let us assume

f \in I_{a}^{1} (L^{1})

, then we obtain

_{a} D_{x}^{α}_{a} D_{x}^{1 - α} f (x) = D f (x),_{b} D_{x}^{α}_{b} D_{x}^{1 - α} f (x) = D f (x) .

(4)

Notably,

D^{α} f (x)

exists for all

f \in C^{β} ([a, b])

if

α < β

. Given those definitions, we are ready to define a Brownian motion:

Definition 5.

Let H be

0 < H < 1

, and let

B_{0}

be an arbitrary real number. We call

B^{H} (t, ω)

a fractional Brownian motion (fBM) with Hurst parameter H and starting value

B_{0}

at time 0, such as

1.: $B^{H} (0, ω) = B_{0}$ , and;
2.: $B^{H} (t, ω) - B^{H} (0, ω) = \frac{1}{Γ (H + \frac{1}{2})} [\int_{- \infty}^{0} [{(t - s)}^{H - \frac{1}{2}} - {(- s)}^{H - \frac{1}{2}}] d B (s, ω) +$
$\int_{0}^{t} {(t - s)}^{H - \frac{1}{2}} d B (s, ω)]$ [Wyle fractional integral];
3.: Equivalent to the Riemann–Liouville integral:
$B^{H} (t, ω) - B^{H} (0, ω) = \frac{1}{Γ (H + \frac{1}{2})} \int_{0}^{t} {(t - s)}^{H - \frac{1}{2}} d B (s, ω)$ .

Next, let us consider the following corollary:

Corollary 1.

Consider

H = \frac{1}{2}

and

B_{0} = 0

. Then the Brownian motion is of

B (t, ω) = B^{\frac{1}{2}} (t, ω)

.

Proof.

Let

H = \frac{1}{2}

, we find

B^{\frac{1}{2}} (t, ω) - B^{\frac{1}{2}} (0, ω) = \frac{1}{Γ (1)} \int_{0}^{t} d B (s, ω) = B (t, ω) .

□

In the literature, there exists an alternative, yet useful, definition:

Definition 6.

A fractional Brownian motion is a Gaussian process

B^{H} (t)

for

t \geq 0

defined by the following covariance function

R^{f B M} (t, s) = E [B^{H} (t) B^{H} (s)] = \frac{1}{2} {[| t |}^{2 H} + {| s |}^{2 H} {- | t - s |}^{2 H}],

(5)

where the Hurst index is denoted by

H \in (0, 1)

.

Since the covariance of a Brownian motion is given in the literature, it is easy to extend the definition to an fBM with Hurst index H, such as

\begin{matrix} V a r [B (t) - B (s)] & = E [{(B (t) - B (s))}^{2}] = | t - s | \\ \Leftrightarrow V a r [B^{H} (t) - B^{H} (s)] & = E [{(B^{H} (t) - B^{H} (s))}^{2}] = {| t - s |}^{2 H}, \end{matrix}

where we obtain the definition of a Brownian motion for

H = \frac{1}{2}

. Following Herzog [15], we derive the covariance step-by-step:

\begin{matrix} C o v [B^{H} (t) B^{H} (s)] & = E [(B^{H} (t) - E [B^{H} (t)]) (B^{H} (s) - E [B^{H} (s)])] = E [B^{H} (t) B^{H} (s)] \\ = \frac{1}{2} [E [B^{H} {(t)}^{2}] + E [B^{H} {(s)}^{2}] - E [{(B^{H} (t) - B^{H} (s))}^{2}]] \\ = \frac{1}{2} {[| t |}^{2 H} + {| s |}^{2 H} {- | t - s |}^{2 H}] . \end{matrix}

Corollary 2.

Consider a fractional Brownian motion. The expectation values of non-overlapping increments are

E [B^{H} (t) - B^{H} (s)] \neq 0

and the variance is of

E [{(B^{H} (t) - B^{H} (s))}^{2}] = {| t - s |}^{2 H}

for all

t, s \in R

Proof.

See [15]. □

3.2. Properties

Next, we consider the properties of the fBM over time for different Hurst parameters. Suppose

0 < H < \frac{1}{2}

or

\frac{1}{2} < H < 1

. If we assume that the Hurst parameter is of

0 < H < \frac{1}{2}

, we say the fractional stochastic process has a short memory. Conversely, if

\frac{1}{2} < H < 1

, we obtain the property of long-range dependence. Figure 1 illustrates sample processes for the three ranges of the Hurst parameters, H.

Proposition 1.

Given a fractional Brownian motion, we obtain the following properties:

1.: The fBM has stationary increments: $B_{t}^{H} - B_{s}^{H} \overset{dis .}{=} B_{u}^{H} - B_{s}^{H}$ ;
2.: The fBM is H-self-similar, such as $B^{H} (a t) = a^{H} B^{H} (t)$ ;
3.: The fBM is H-self-similar, such as $B^{H} (a t) = a^{H} B^{H} (t)$ ;

Proof.

The proof follows Herzog [15]. In order to prove the stationary of increments, we set

t_{1} < t_{2} < t_{3} < t_{4}

. The equality of the covariance implies

Y : = B^{H} (t_{2}) - B^{H} (t_{1})

. Moreover, it has the same distribution, such as

X : = B^{H} (t_{4}) - B^{H} (t_{3})

. Subsequently, we find

\begin{matrix} E [{(B^{H} (t_{2}) - B^{H} (t_{1}))}^{2}] & = {(t_{2} - t_{1})}^{2 H} = {(Δ t)}^{2 H} \\ E [{(B^{H} (t_{4}) - B^{H} (t_{3}))}^{2}] & = {(t_{4} - t_{3})}^{2 H} = {(Δ t)}^{2 H}, \end{matrix}

where

t_{1} < t_{2}

and

t_{3} < t_{4}

with

Δ t = t_{2} - t_{1} = t_{4} - t_{3}

. This demonstrates that the increments and the time evolution of the increments are the same at any given point. Consequently, we obtain stationary increments.

The second property of Proposition 1 is self-similarity. Consider the following definition,

\begin{matrix} E [{(B^{H} (a t))}^{2}] & = \frac{1}{2} [{(a t)}^{2 H} + {(a t)}^{2 H} - {(a t - a t)}^{2 H}] = {(a t)}^{2 H} = a^{2 H} t^{2 H} \\ = a^{2 H} E [{(B^{H} (t))}^{2}] . \end{matrix}

Here, we find that

{(B^{H} (a t))}^{2} = a^{2 H} {(B^{H} (t))}^{2}

and

B^{H} (a t) = a^{H} B^{H} (t)

. Part (3) is already given in Corollary 2. □

3.3. Definition of Sub-Fractional Processes

In a recent paper, Herzog [15] described a sub-fractional Brownian motion (sub-fBM) as an intermediate between a Brownian motion and a fractional Brownian motion. Without loss of generality, a sub-fBM is a self-similar Gaussian process. Note that both the fBM and sub-fBM have the properties of self-similarity and long-range dependence, yet a sub-fBM does not have stationary increments [9].

Any Brownian motion is uniquely defined by its covariance. For the sub-fBM we denote covariance by

C o v (ξ_{t}^{H}, ξ_{s}^{H})

.

Definition 7.

Consider a sub-fractional Brownian motion with Hurst parameter H and a centered mean zero Gaussian process

ξ^{H} = {ξ_{t}^{H}, t \geq 0}

with the following covariance function

R^{s f B M} (t, s) : = E [ξ_{t}^{H} ξ_{s}^{H}] = s^{2 H} + t^{2 H} - \frac{1}{2} [{(s + t)}^{2 H} {+ | s - t |}^{2 H}],

(6)

where

ξ_{0}^{H} = 0

and

E [ξ_{t}^{H}] = 0

.

Note, a fractional Brownian motion coincides with a Brownian motion if the Hurst parameter is

H = \frac{1}{2}

. Thus, a Brownian motion on the real line has a covariance of

C o v (ξ_{t}^{H}, ξ_{s}^{H}) = s \land t : = min [s, t]

. The process

ξ_{t}^{H}

has the following representation for

H > \frac{1}{2}

(see [25]):

\begin{matrix} ξ_{t}^{H} & = \int_{0}^{t} K^{H} (t, s) d W_{s}, \end{matrix}

(7)

\begin{matrix} K^{H} (t, s) & = c_{H} (H - \frac{1}{2}) s^{1 / 2 - H} \int_{s}^{t} {(u - s)}^{H - 3 / 2} u^{H - 1 / 2} d u . \end{matrix}

(8)

The kernel function of a sub-fractional Brownian motion is given by

ϕ^{s f B M} (s, t) = \frac{\partial^{2} C o v (ξ_{t}^{H}, ξ_{s}^{H})}{\partial s \partial t} = H (2 H - 1) [{| s - t |}^{2 H - 2} - {(s + t)}^{2 H - 2}] .

(9)

3.4. Properties of Sub-Fractional Processes

In this subsection, we reiterate useful properties of sub-fractional Brownian motions such as those described in Herzog [15].

Lemma 1.

Consider

ξ_{t}^{H}

be a sub-fBM for all t. The properties of the sub-fBM are:

1.: $E [{(ξ_{t}^{H})}^{2}] = (2 - 2^{2 H - 1}) t^{2 H}$ .
2.: $E [{(ξ_{t}^{H} - ξ_{s}^{H})}^{2}] = - 2^{2 H - 1} (t^{2 H} + s^{2 H}) + {(t + s)}^{2 H} + {(t - s)}^{2 H}$ .
3.: If $H \neq \frac{1}{2}$ , then $ξ_{t}^{H} - ξ_{s}^{H} \overset{dis .}{\neq} ξ_{u}^{H} - ξ_{s}^{H}$ , i.e., the increments are non-stationary.

Proof.

See [15]. □

Finally, we follow Herzog [15] and prove the following proposition:

Proposition 2.

Let

B_{t}^{H}

be a fractional Brownian motion and

ξ_{t}^{H}

be a sub-fractional Brownian motion. For

H \in (\frac{1}{2}, 1)

, the following holds:

1.: $E [{(ξ_{t}^{H})}^{2}] < E [{(B_{t}^{H})}^{2}]$ ;
2.: $R_{ξ_{t}^{H}} (s, t) \leq R_{B_{t}^{H}} (s, t)$ .

Proof.

Obviously, an fBM has the following variance:

V a r [B_{t}^{H}] = {| t |}^{2 H}

. Similarly, we obtain the variance of

V a r [ξ_{t}^{H}] = (2 - 2^{2 H - 1}) {| t |}^{2 H}

for a sub-fBM. Subsequently, we have

0 < (2 H - 1) ln 2

if

H > \frac{1}{2}

.

The second part follows for

s, t > 0

:

\begin{matrix} s^{2 H} + t^{2 H} - \frac{1}{2} {[(s + t)}^{2 H} {+ | t - s |}^{2 H}] & \leq \frac{1}{2} {[| t |}^{2 H} + {| s |}^{2 H} {- | t - s |}^{2 H}] \\ s^{2 H} + t^{2 H} & \leq {(s + t)}^{2 H} . \end{matrix}

In the case of

s = t = 0

or

s = 0, t \neq 0

, we have equality. □

4. Fractional Stochastic Search

Let

X_{t}

be a an m-dimensional stochastic process driven by a fractional Brownian motion

B_{t}^{H}

, where

H = \frac{1}{2}

. The respective stochastic process

X_{t}

is as follows:

d X_{t} = a (X_{t}) d t + σ (X_{t}) d B_{t}^{H}, and X_{0} = x_{0},

(10)

where

X_{0}

is the initial value and

B_{t}^{H} = (B_{1}^{H} (t), \dots, B_{m}^{H} (t))

is an m-dimensional fractional Brownian motion. Next, consider a cost function

z : R^{m} \to R

which needs to be optimized. Hence, we study the vector field for which the auxiliary function

z (X (t))

is decreasing. This requires us to find the expectation value:

κ (t) = E [z (X (t))] .

(11)

Thus, the function

z (X (t))

is stochastic and dependent on time t. In general, an optimization algorithm of a neural network minimizes the expectation value of this function. Utilizing the machinery of stochastic analysis, Dynkin’s formula, among others, and following the approach described in [26], we obtain

\begin{matrix} \begin{matrix} κ (t) & = & z (X_{0}) + \int_{0}^{t} E [A (z (X (s)))] d s \\ κ (t + d t) & = & z (X_{0}) + \int_{0}^{t + d t} E [A (z (X (s)))] d s, \end{matrix} \end{matrix}

(12)

where the operator

A = \sum_{k} a_{k} \frac{\partial}{\partial X_{k}} + \frac{1}{2} \sum_{i, j} (σ σ^{T}) \frac{\partial^{2}}{\partial X_{i} \partial X_{j}}

. The usage of Taylor-series approximation and the differencing of Equations (12) yields

\begin{matrix} \begin{matrix} Δ κ (t) & = & κ (t + d t) - κ (t) = \int_{t}^{t + d t} E [A (z (X (s)))] d s \\ = & E [A (z (X (t)))] + O (d t^{2}) . \end{matrix} \end{matrix}

(13)

The method of steepest descent computes the gradient of

Δ κ (t)

such that the process

X_{t}

, is as negative as possible. However, if

X_{t}

is a stochastic process, we need to study the expectation of the gradient, particularly where the value

Δ κ (t)

is as negative as possible, such as

E [A (z (X (t)))] < 0

.

In order to construct a stochastic process

X_{t}

with this property, we specify

a (X_{t})

and

σ (X_{t})

in Equation (10), respectively. Next, we specify the diffusion term in Equation (10),

σ

, or the product

σ σ^{T}

, which is a matrix, such that the algorithm in Equation (1) converges efficiently. Indeed, if we set the term of

σ σ^{T}

as being inversely proportional to the Hessian matrix,

H

, then

{(σ σ^{T})}_{i j} = τ {(H_{z})}_{i j}^{- 1}

with

τ > 0

. Through this process we can show the convergence of the algorithm and the existence of the solution.

Given that function z is of class

C^{2}

and strictly convex, then, according to [27], the Hessian matrix

H_{z}

is symmetric, real, positive definite, and non-degenerate. This guarantees that the Hessian matrix

H_{z}

has an inverse, which is also positive definite. Efficient computation can be achieved by utilizing the Cholesky decomposition. One can show that the diffusion term

σ

is a lower triangular matrix satisfying

{(σ σ^{T})}_{i j} = τ {(H_{z})}_{i j}^{- 1}

. Under those conditions, we compute

A z (x)

\begin{matrix} A z (x) & = & \sum_{k} a_{k} \frac{\partial z (x)}{\partial X_{k}} + \frac{1}{2} τ \sum_{i, j} {(H_{z}^{- 1})}_{i j} {(H_{z})}_{i j} \\ = & 〈 \nabla z (x), a (x) 〉 + \frac{n}{2} τ . \end{matrix}

In order to minimize the gradient,

A z (x)

, we have to minimize the first-term, because the second term is a constant. Choosing

a (x) = - λ \nabla z (x)

and assuming

A z (x) < 0

obtains the following condition:

\begin{matrix} A z (x) & = & 〈 \nabla z (x), - λ \nabla z (x) 〉 + \frac{n}{2} τ \\ = & - {λ ∥ \nabla z (x) ∥}^{2} + \frac{n}{2} τ < 0 \\ \frac{τ}{λ} & < & \frac{2}{n} {∥ \nabla z (x) ∥}^{2} . \end{matrix}

Using the square vector norm and the assumption of

ξ = {inf}_{x} {∥ \nabla z (x) ∥}^{2} \neq 0

, it is sufficient to set

τ

and

λ

as the main parameters in the SGD algorithm, such that

τ = \frac{2 ξ}{n} λ

. In the sequel, we apply this algorithm to fractional stochastic search problems.

5. Application of Fractional Stochastic Search

In this section, we demonstrate the working of a fractional stochastic search. We exhibit the convergence of a fractional stochastic search within neural networks.

5.1. Stochastic Search: Case I

Suppose that we have a neural network with the following cost function:

z (x) = \frac{1}{2} x^{2}

for

x \in (a_{1}, a_{2})

with

a_{i} > 0

and

a_{1} < a_{2}

. The stochastic gradient descent method searches the minimum of this cost function.

Mathematically, the solution is obvious for this problem. The first derivative is of

z^{'} (x) = x

for

x \in (a_{1}, a_{2})

. Hence, the minimum is at

z^{'} (a_{1}) = a_{1}

, and consequently, the minimum value is

z (a_{1}) = \frac{1}{2} a_{1}^{2}

. Next, we show that we can obtain the same value under a fractional stochastic search algorithm in a neural network.

In step one, we establish an adequate stochastic differential equation according to

d X_{t} = a (X_{t}) d t + σ (X_{t}) d B_{t}^{H}

for

X_{0} = x_{0} \in (a_{1}, a_{2})

. The gradient of the cost function, which is equal to the first derivative

\nabla z (x) = x

, as well as the Hessian of the cost function and the second derivative, is defined as

H (x) = \nabla^{2} z (x) = 1

. Both conditions enable us to compute the Lipschitz continuous coefficient functions. For

a (X_{t}) = - λ \nabla z (x) = - λ x

. For

σ^{2} (X_{t}) = τ H^{- 1} (x) = \frac{τ}{\nabla^{2} z (x)} = \frac{τ}{1} = τ

. Hence, we obtain

σ (X_{t}) = \sqrt{τ}

. The stochastic differential equation for

H = \frac{1}{2}

has the form:

d X_{t} = - λ X_{t} d t + \sqrt{τ} d B_{t}^{\frac{1}{2}} .

(14)

The SDE in Equation (14) is an Ornstein–Uhlenbeck process driven by a Brownian motion

B^{H} (t)

with the Hurst parameter

H = \frac{1}{2}

[28].

The solution is divided into two parts: In part one, we solve the non-stochastic problem

d X_{t} = - λ X_{t} d t

. This is an ordinary differential equation and has the solution

X_{t} = X_{0} e^{- λ t}

. In part two, we define an auxiliary function

Y_{t} = X_{t} e^{λ t}

and apply the Itô-Doeblin’s lemma:

\begin{matrix} d Y_{t} & = & X_{t} e^{λ t} (λ) d t + e^{λ t} d X_{t} \\ = & X_{t} e^{λ t} (λ) d t + e^{λ t} [- λ X_{t} d t + \sqrt{τ} d B_{t}^{H}] \\ = & \sqrt{τ} e^{λ t} d B_{t}^{H} . \end{matrix}

Note that, in this case, the derivation coincides with a standard Brownian motion. Next, integrating the last line yields

Y_{t} - Y_{0} = \sqrt{τ} \int_{0}^{t} e^{λ s} d B_{s}^{H}

. Hence, we obtain

X_{t} = Y_{t} e^{- λ t}

, which is

\begin{matrix} \begin{matrix} X_{t} & = & [Y_{0} + \sqrt{τ} \int_{0}^{t} e^{λ s} d B_{s}^{H}] e^{- λ t} \\ = & X_{0} e^{- λ t} + \sqrt{τ} \int_{0}^{t} e^{- λ (t - s)} d B_{s}^{H} . \end{matrix} \end{matrix}

(15)

Based on Equation (15), we find the expectation of

E [X_{t}] = X_{0} e^{- λ t}

. Note that the expected integral of a Brownian motion is zero. For

t \to \infty

, the expected value is

{lim}_{t \to \infty} E [X_{t}] = 0

. Next, utilizing the general condition of

τ = \frac{2 ξ}{n} λ

for

ξ = {inf ∥ \nabla z (x) ∥}^{2}

and

n = 1

in Section 4, we obtain

τ = \frac{2 ξ}{n} λ = \frac{2 inf ∥ a_{1} ∥^{2}}{1} λ = 2 a_{1}^{2} λ \Leftrightarrow \frac{τ}{2 λ} = a_{1}^{2} .

Finally, it remains to show that the SDE in Equation (14) converges to the minimum value. Hence, we study the convergence sequence:

\begin{matrix} κ (t) & = & E [z (X_{t}) | X_{0}] = \frac{1}{2} E [{(X_{t})}^{2} | X_{0}] = \\ = & \frac{1}{2} E [X_{0}^{2} e^{- 2 λ t} + 2 X_{0} e^{- λ t} \sqrt{τ} \int_{0}^{t} e^{- λ (t - s)} d B_{s}^{H} + τ {(\int_{0}^{t} e^{- λ (t - s)} d B_{s}^{H})}^{2}], \end{matrix}

where, for

E [{(X_{t})}^{2}]

, we have substituted Equation (15). Next, we use the property that the expected stochastic integral is zero and the variance of the Brownian motion is

V a r (d B_{s}^{H}) = d s

. Thus, we obtain

\begin{matrix} κ (t) & = & \frac{1}{2} [X_{0}^{2} e^{- 2 λ t} + τ \int_{0}^{t} e^{- 2 λ (t - s)} d s] = \\ = & \frac{1}{2} [X_{0}^{2} e^{- 2 λ t} + τ \frac{e^{- 2 λ (t - s)}}{2 λ} |_{0}^{t}] = \\ = & \frac{1}{2} [X_{0}^{2} e^{- 2 λ t} + \frac{τ}{2 λ} (1 - e^{- 2 λ t})] . \end{matrix}

In order to show the convergence, we compute the limit of the sequence for time to infinity. We obtain the following:

lim_{t \to \infty} κ (t) = lim_{t \to \infty} \frac{1}{2} [0 + \frac{τ}{2 λ}] = \frac{1}{2} \frac{τ}{2 λ} = \frac{1}{2} a_{1}^{2} .

Indeed, we find that the (fractional) stochastic algorithm converges to the same minimum value of our function

z (x) = \frac{x^{2}}{2}

for

x \in (a_{1}, a_{2})

.

5.2. Stochastic Search: Case II

Conversely, suppose a fractional stochastic differential equation with a Hurst index

H \neq \frac{1}{2}

of the form

d X_{t} = μ X_{t} d t + λ X_{t} d B_{t}^{H},

(16)

where we define

μ : = - η

and

X_{0} = x > 0

. We search the minimum of the function

z (X_{t}) = X_{t}

, where

X_{t}

is the solution of the SDE in Equation (16). This equation can be rewritten in the fractional Hida space

{(S)}_{H}^{*}

as

\begin{matrix} \frac{d X_{t}}{d t} & = & μ X_{t} + λ X_{t} ⋄ W_{t}^{H} \\ = & (μ + λ W_{t}^{H}) ⋄ X_{t}, \end{matrix}

where ⋄ is defined as the Wick product. Using Wick calculus, we find the solution as

X_{t} = x e^{⋄ (μ t + λ \int_{0}^{t} W_{s}^{H} d s)} = e^{⋄ (μ t + λ B_{t}^{H})},

(17)

where we have used the definition

B_{t}^{H} = \int_{0}^{t} W_{s}^{H} d s

. By applying the following definitions

e^{⋄ 〈 w, f 〉} = e^{(\int_{R} z_{t} d B_{t}^{H} - \frac{1}{2} {∥ z ∥}_{H}^{2})}

, where

{∥ z ∥}_{H}^{2} = \int_{R} \int_{R} z (s) z (t) ϕ (s, t) d s d t

and

ϕ (s, t) = {H (2 H - 1) | s - t |}^{2 H - 2}

for

s, t \in R

, we obtain the final solution:

X_{t} = x e^{(μ t + λ B_{t}^{H} - \frac{1}{2} λ^{2} t^{2 H})} .

(18)

The solution of Equation (18) has an expectation of

E [X_{t}] = X_{0} e^{- η t}

. Hence, for

t \to \infty

, the expected value is zero:

{lim}_{t \to \infty} E [X_{t}] = 0

. It remains to show the convergence of the fractional SDE in machine learning:

\begin{matrix} κ (t) & = & E [z (X_{t}) | X_{0}] = E [x e^{(μ t + λ B_{t}^{H} - \frac{1}{2} λ^{2} t^{2 H})} | X_{0}] \\ = & x E [e^{(μ t + λ B_{t}^{H} - \frac{1}{2} λ^{2} t^{2 H})}] \\ = & x e^{- η t} . \end{matrix}

In order to show convergence, we compute the limit of the sequence for time to infinity. We equally obtain

{lim}_{t \to \infty} κ (t) = {lim}_{t \to \infty} x e^{- η t} = 0

.

There are notable limitations of fractional stochastic gradient descent in general. Fractional calculus is built around the Riemann–Liouville integral, which is a non-local operator, lacks in uniqueness, and relies on the initial conditions. Given that a fractional process is not a martingale, the common stochastic tools are not applicable. Whether those properties constrain fractional stochastic gradient descent remains an open research question. Computational aspects might also be a limiting factor. However, for the first time, this research studies the idea of fractional search analogous to stochastic gradient descent in machine learning.

6. Conclusions

This article discovers fractional stochastic gradient descent algorithms for the optimization of neural networks. In the standard case, the fractional stochastic approach follows the well-known stochastic gradient descent method in machine learning. We discuss two special cases. First, we exhibit that fractional stochastic algorithms find the minima. This result might enhance algorithmic optimization in machine learning. Second, we discover the generalized patterns and properties of fractional stochastic processes. These insights may create a universal optimization approach in machine learning and AI in the future. We highlight the need for further research in that direction, particularly for the computational issues.

Funding

This research received no external funding except basic financial support from RRI—Reutlingen Research Institute, Reutlingen University. I appreciate the support for the advancement of scientific research and the betterment of society for the future.

Data Availability Statement

All data are available in the paper or upon request from the author.

Acknowledgments

I thank three anonymous reviewers for helpful comments.

Conflicts of Interest

The author declares no conflict of interest.

References

Padhi, S.; Graef, J.; Pati, S. Multiple Positive Solutions for a boundary value problem with nonlinear nonlocal Riemann-Stieltjes Integral Boundary Conditions. Fract. Calc. Appl. Calc. 2018, 21, 716–745. [Google Scholar] [CrossRef]
Ruiz, W. Dynamical system method for investigating existence and dynamical property of solution of nonlinear time-fractional PDEs. Nonlinear Dyn. 2019, 99, 1–20. [Google Scholar]
Kamran, J.W.; Jamal, A.; Li, X. Numerical Solution of Fractional-Order Fredholm Integrodifferentiantial Equation in the Sense of Atangana-Baleanu Derivative. Math. Probl. Eng. 2020, 2021, 6662808. [Google Scholar]
Guarigilia, E. Fractional calculus, zeta functions and Shannon entropy. Open Math. 2021, 19, 87–100. [Google Scholar] [CrossRef]
Mandelbrot, B.; van Ness, J. Fractional Brownian Motions, Fractional Noises and Applications. SIAM Rev. 1968, 10, 422–437. [Google Scholar] [CrossRef]
Monin, A.; Yaglom, A. Statistical Fluid Mechansics: Mechanics of Turbulence; Dover Publication: New York, NY, USA, 2007; Volume 2. [Google Scholar]
Tudor, C. On the Wiener integral with respect to sub-fractional Brownian motion on an interval. J. Math. Anal. Appl. 2009, 351, 456–468. [Google Scholar] [CrossRef]
Tudor, C.; Zili, M. Covariance measure and stochastic heat equation with fractional noise. Fract. Calc. Appl. Anal. 2014, 17, 807–826. [Google Scholar] [CrossRef]
Bojdecki, T.; Gorostiza, L.; Talarczyk, A. Sub-fractional Brownian motion and its relation to occuption times. Statist. Probab. Lett. 2004, 69, 405–419. [Google Scholar] [CrossRef]
Duncan, T.; Hu, Y.; Pasik-Duncan, B. Stochastic Calculus for Fractional Brownian Motion. SIAM J. Control Optim. 2000, 38, 582–612. [Google Scholar] [CrossRef]
Shen, G.; Yan, L. The stochastic integral with respect to the sub-fractional Brownian motion with H> $\frac{1}{2}$ . J. Math. Sci. Sci. Adv. 2010, 6, 219–239. [Google Scholar]
Yan, L.; Shen, G.; He, K. Itô’s formula for a sub-fractional Brownian motion. Commun. Stoch. Anal. 2011, 5, 135–159. [Google Scholar] [CrossRef]
Liu, J.; Yan, L. Remarks on asymptotic behavior of weighted quadratic variation of subfractional Brownian motion. J. Korean Stat. Soc. 2012, 41, 177–187. [Google Scholar] [CrossRef]
Prakasa, R. On some maximal and integral inequailities for sub-fractional Brownian motion. Stoch. Anal. Appl. 2017, 35, 2017. [Google Scholar]
Herzog, B. Adopting Feynman–Kac Formula in Stochastic Differential Equations with (Sub-)Fractional Brownian Motion. Mathematics 2022, 10, 340. [Google Scholar] [CrossRef]
Hopfield, J.J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 1982, 79, 2554–2558. [Google Scholar] [CrossRef] [PubMed]
Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 2018, 60, 223–311. [Google Scholar] [CrossRef]
Kochenderfer, H.; Wheeler, T. Algorithms for Optimization; MIT Press: Cambridge, MA, USA, 2019. [Google Scholar]
Schmidt, M.; Roux, N.L.; Bach, F. Minimizing finite sums with the stochastic average gradient. Math. Program. 2017, 162, 83–112. [Google Scholar] [CrossRef]
Haochen, J.; Sra, S. Random Shuffling Beats SGD after Finite Epochs. Proc. Mach. Learn. Res. 2019, 97, 2624–2633. [Google Scholar]
Gotmare, A.; Keskar, N.S.; Xiong, C.; Socher, R. A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation. arXiv 2018, arXiv:1810.13243. [Google Scholar]
Curtis, F.E.; Scheinberg, K. Adaptive Stochastic Optimization: A Framework for Analyzing Stochastic Optimization Algorithms. IEEE Signal Process 2020, 37, 32–42. [Google Scholar] [CrossRef]
de Roos, F.; Jidling, C.; Wills, A.; Schön, T.; Hennig, P. A Probabilistically Motivated Learning Rate Adaptation for Stochastic Optimization. arXiv 2021, arXiv:2102.10880. [Google Scholar]
Rosenblatt, F. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychol. Rev. 1958, 65, 386–408. [Google Scholar] [CrossRef] [PubMed]
Alòs, E.; Mazet, O.; Nualart, D. Stochastic Calculus with Respect to Gaussian processes. Ann. Probab. 2001, 29, 766–801. [Google Scholar] [CrossRef]
Calin, O. Deep Learning Architectures; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Golub, G.H.; van Loan, C. Matrix Computations; Johns Hopkins Press: Baltimore, MD, USA, 1996. [Google Scholar]
Ornstein, L.; Uhlenbeck, G. On the theory of Brownian motion. Phys. Rev. 1930, 36, 823–841. [Google Scholar]

Figure 1. Different fractional Brownian motions with the following Hurst index: (Left-panel)

H = 0.25

, (middle-panel)

H = 0.50

(standard BM), and (right-panel)

H = 0.75

.

Figure 1. Different fractional Brownian motions with the following Hurst index: (Left-panel)

H = 0.25

, (middle-panel)

H = 0.50

(standard BM), and (right-panel)

H = 0.75

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Herzog, B. Fractional Stochastic Search Algorithms: Modelling Complex Systems via AI. Mathematics 2023, 11, 2061. https://doi.org/10.3390/math11092061

AMA Style

Herzog B. Fractional Stochastic Search Algorithms: Modelling Complex Systems via AI. Mathematics. 2023; 11(9):2061. https://doi.org/10.3390/math11092061

Chicago/Turabian Style

Herzog, Bodo. 2023. "Fractional Stochastic Search Algorithms: Modelling Complex Systems via AI" Mathematics 11, no. 9: 2061. https://doi.org/10.3390/math11092061

APA Style

Herzog, B. (2023). Fractional Stochastic Search Algorithms: Modelling Complex Systems via AI. Mathematics, 11(9), 2061. https://doi.org/10.3390/math11092061

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fractional Stochastic Search Algorithms: Modelling Complex Systems via AI

Abstract

1. Introduction

2. Preliminaries

3. Fractional Stochastic Processes

3.1. General Definitions

3.2. Properties

3.3. Definition of Sub-Fractional Processes

3.4. Properties of Sub-Fractional Processes

4. Fractional Stochastic Search

5. Application of Fractional Stochastic Search

5.1. Stochastic Search: Case I

5.2. Stochastic Search: Case II

6. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI