1. Introduction
In the field of deep learning, the research of optimization algorithms has been an important direction. Among them, the optimization algorithms based on the gradient descent method have become the mainstream. Gradient Descent (GD) is the most basic optimization algorithm. Currently, various improved optimization algorithms in deep learning are based on it. It has a faster convergence speed, its convergence rule can be illustrated simply as
, where
is the learning rate and
is the gradient of
at
[
1]. GD is hardly practical in deep learning. It evaluates the entire datasets at each iteration, and the current datasets are getting larger and larger, which can easily make video memory and memory insufficient. SGD accelerates the convergence speed and solves the problem of excessive video memory and memory occupation. However, it causes the direction of the gradient to fluctuate too much at each iteration. Due to the disadvantages of GD and SGD, MBGD is proposed, which does not cause video memory and memory overflow and overcomes the problem of gradient direction fluctuation [
2]. It is a compromise between GD and SGD. It degenerates to SGD when the batch size is 1. It becomes GD when the batch size is the total sample size. In deep learning, a moderate batch size can speed up sample training [
3]. Thus, the batch size is generally set. Namely, SGD is also equated to MBGD in the application scenario. The same is true for the fractional order gradient descent optimizer in the paper (FCGD_G-L). Polyak introduced the concept of momentum [
4], which was discussed in detail theoretically by Nesterov in the context of convex optimization [
5]. The introduction of momentum in deep learning has long been shown to be beneficial for parameters’ convergence [
6]. It speeds up the convergence and prevents the parameters from falling into local optimal solutions. In addition, scholars have proposed some algorithms for adaptive learning rates. For example, AdaGrad proposed by Duchi [
7], RMSProp proposed by Thieleman [
8], and Adadelta proposed by Zeiler [
9]; all of them use the current gradient state to change the learning rate or as a calibration for changes. Kingma proposed Adam [
10] by combining the momentum method and the adaptive learning rate algorithm. Although the above algorithms and their improvements have their own characteristics, their gradients are based on the first-order derivative. Therefore, further development is limited.
As the research on fractional order gradient descent and deep learning optimization algorithms has intensified, more and more scholars have introduced fractional order calculus into deep learning optimization algorithms. Thus, it is possible for deep learning optimization algorithms to rely on fractional order derivation. Some scholars have achieved some good results on related research by exploiting the fractional order property. Under the convexity condition and Caputo, Li studied the convergence rate of different orders of GD, by jointly using the integer order and fractional order, parameters that finally converge to the integer order extreme value points [
11]. By using Riemann–Liouville (R-L) and Caputo, Chen studied GD under convexity conditions and proposed a deformation’s formula; the formula can converge quickly to integer order extreme value points [
12]. By transforming R-L and looking for special initial parameters, Wang designed a deep learning optimization algorithm that can guarantee the same convergence result as the integer order under the convexity condition, and the related optimizer is validated using the MNIST dataset by experiments [
13]. Yu designed a deep learning optimizer using G-L by setting the step size to two. Its current gradient is determined by the gradient in the past fixed size time window according to a specific weight [
14]. Kan studied the deep learning optimization algorithm using G-L and validated the relevant optimizers using the MNIST dataset and the CIFAR-10 dataset with the inclusion of momentum, discussing the effect of different step sizes on the results [
15]. Khan designed a deep learning optimizer using a power series of fractional order, which was applied to a recommender system with good results [
16,
17,
18]. Due to the constraints of the fractional order power series, it is limited in the kinds of loss functions. In addition, the update of the parameters can only be kept in the positive range [
19]. Constrained optimization problems have been studied by Yaghooti [
20] using Caputo, and Viola [
21] using R-L.
There are various fractional order definitions. The commonly used ones are the G-L fractional order definition, R-L fractional order definition, and Caputo fractional order definition [
22]. SGD is rarely used directly in deep learning, but it is the basis for improved optimization algorithms. SGDM is an optimization algorithm with momentum [
23]. AdaGrad, RMSProp and Adadelta are a class of optimization algorithms with an adaptive learning rate [
24]. Adam is an optimization algorithm for combining momentum and adaptive learning rate property [
25,
26].
Based on the above discussion, FCGD_G-L is designed in this paper using the G-L fractional order definition. Its current order gradient is obtained by summing the current first order gradient and the first order gradients of the past 10 time steps according to the fractional order property. Compared with the integer order, which can only add momentum and disturbances to the gradient descent, FCGD_G-L can add perturbations to its own derivation process to accelerate the descent and prevent falling into the local optimum solution. At the same time, the integer order needs additional momentum per iterative process, and this increases the computational workloads. Because of the fractional order property, the fractional order is equivalent to self-contained momentum. Thus, these computational workloads are eliminated in FCGD_G-L. The designed optimizer in this paper adds small perturbations to the fractional order derivation process; this maximizes the ability of finding the global optimal solution while ensuring the fractional order properties. The major contributions of this paper are as follows:
In this paper, a novel deep learning optimizer is designed, written according to Pytorch documentation specification, with the same invocation methods as the existing optimizers of Pytorch, enriching the variety of optimizers.
Compared to other fractional order deep learning optimizers, the use of the G-L fractional order definition reduces the work of adding momentum and perturbation to the gradient at each iteration; thus, reducing the computational workload and improving the efficiency of the fractional order deep learning optimizer.
The new G-L fractional order definition uses improved Grünwald coefficients, avoiding the use of the Gamma function. In addition, it solves the problem that the previous fractional order optimizer is not perfectly compatible with the integer order.
In this paper, obtaining the global optimal solution is the best result. However, it is easy to fall into local optimal solutions during training. Thus, a constant factor is added before each term of the G-L fractional order definition, and a small internal perturbation is added to the current time step by fine-tuning , which can prevent the parameters from falling into the local optimal solutions well.
The deep learning algorithm in this paper provides a new way of thinking; by introducing fractional order, the optimizer adds a hyperparameter . By adjusting , the optimizer can be adapted to different application scenarios well, and a faster convergence speed and higher convergence accuracy can be obtained than the integer order.
The remainder of this paper is organized as follows.
Section 2 introduces the G-L fractional order definition and SGD and its related improvement algorithm.
Section 3 introduces the definition of fractional order gradient descent in this paper and gives the corresponding fractional order optimizer algorithms for SGD and Adam.
Section 4 validates the optimizers of this paper on deep neural network models using two time series datasets, compares the corresponding integer order optimizers, analyzes the train loss of each optimizer’s loss function, and evaluates the effectiveness of the resulting models on test sets.
Section 5 summarizes FCGD_G-L, pointing out its shortcomings and future improvement directions.
3. G-L Fractional Order Definition
The fractional order derivative formula of this paper is first given to illustrate its rationality. Its SGD algorithm and Adam’s algorithm are also given in the paper.
3.1. G-L Fractional Order Definition of the Model
From the G-L fractional order definition of Equation (1), due to the characteristics of computers, it is clear that the algorithm cannot be expanded infinitely in practical calculations; therefore, a finite expansion is required. Some scholars have shown that, in neural networks, the expansion to the 10th term already characterizes the properties of fractional order derivatives well [
15,
28,
29]. Let
and make a graph of the variation of
with
for Equation (2), as shown in
Figure 1.
Figure 1 shows the curves of
. When Equation (2) is expanded to the 10th term, the effect of the coefficients on the overall fractional order derivation becomes small. Therefore, the fractional order derivation in this paper only accumulates the past 10 time steps.
On the other hand, the step size in Equation (1) is not a continuous value in the parameter update of the neural network [
14,
15], namely
. In this paper, let the step size be the minimum value, namely
, and according to Equations (3) and (4), a derivative equation is obtained that can be used for updating the parameters:
Equation (11) can eliminate the computation of the Gamma function and achieve the unification of the fractional order and integer order optimizers. In order to improve the ability of the algorithm to find the global optimal solution, a coefficient
is added before each accumulated term. At the same time, the probability of each coefficient having 0.9 is 1, and the probability of 0.1 is 0. To obtain Equation (12):
According to Equation (12), when
, the algorithm degenerates to SGD without momentum. In order to make
and the SGD equal, Equation (4) is improved further in the paper and obtains Equation (13):
By using Equation (13), when the order , the fractional order derivative becomes the integer order derivative, and corresponds to SGD. Thus, the unification of the fractional order gradient descent and integer order gradient descent is achieved. In the original formula, denotes integral and denotes differential. Because of the transformation of Equation (4), in this paper also has good gradient descent capability, while retaining the fractional order property.
3.2. FCGD_G-L Algorithm
From Equations (12) and (13), an SGD based on FCGD_G-L and an Adam based on FCGD_G-L are proposed. In these two algorithms, the integer order derivation process becomes a fractional order derivation process. In addition, the extra momentum is no longer needed by taking advantage of the long memory property of the fractional order derivation.
The FCSGD_G-L Algorithm 1, which combines FCGD_G-L and SGD:
Algorithm 1: The SGD optimization Algorithm based on FCGD_G-L |
Input:(lr),(params), (objective), (weight decay), (order), (disturbance coefficient), (fractional coefficient) Initialize:, ,,, |
for to … do if return |
In Algorithm 1, because fractional order derivatives are used, the algorithm adds a hyperparameter that can adjust the order. In addition, the momentum and Nesterov are removed from the algorithm.
The FCAdam_G-L Algorithm 2, which combines FCGD_G-L and Adam:
Algorithm 2: The Adam optimization Algorithm based on FCGD_G-L |
Input: (lr), (betas), (params), (objective), (weight decay), (order), (disturbance coefficient), (fractional coefficient) Initialize: (first moment), (second moment), (eps), , , , , |
for to … do if return |
In Algorithm 2, a hyperparameter that can adjust the order is added.
The above two algorithms are based on FCGD_G-L and two classical gradient descent algorithms. They have the same time complexity as the original algorithm. Because of FCGD_G-L, the deep learning optimization algorithm becomes more flexible.
4. Experiment
In this section, two time series datasets are used to validate FCSGD_G-L and FCAdam_G-L. One of the datasets is the Dow Jones Industrial Average (DJIA), preprocessed with 24,298 rows of data, spanning from 3 February 1930 to 13 October 2022, with five dimensions: Open, High, Low, Volume and Close, predicting Close, and the training sets and test sets are cut according to 8:2 [
30]. The other dataset is the Electricity Transformer dataset (ETTh1), with 17,420 rows of data, which has seven dimensions HUFL, HULL, MUFL, MULL, LUFL, LULL and OT, predicting OT, and the training sets and test sets are cut according to 8:2 [
31].
The computer configuration for the experiment is as follows: CPU is an AMD Ryzen 7 5800H with Radeon Graphics 3.20 GHz; GPU is a RTX 3060 Laptop.
The selection criteria of the order are their fastest convergence speed and highest accuracy when training.
The neural network structure of the whole experiment consists of a three-layer LSTM [
32] and two Linear, whose structure is shown in
Figure 2.
In
Figure 2, let
be the size of each input sample; it is a matrix with rows of Feature size and columns of Sliding window size. The
is a
vector. The
is a
vector. The
is a predicted value; it is a scalar. As can be seen from
Figure 2,
passes through the three-layer LSTM. After that, the
is obtained on the last layer; it is equal to the
. The
is processed by the first Linear, and the
is obtained. Finally, the
is processed by the second Linear, and the
is obtained.
4.1. Metrics
In this paper, the Mean Square Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) are used as evaluation metrics [
32]. They are defined as follows:
Let
be the total number of samples,
is the true value, and
is the predicted value, and
is the sample mean. The following equation is obtained:
4.2. Training for DJIA
In the training of DJIA, FCSGD_G-L and FCAdam_G-L are used as optimizers, respectively. In addition, the train loss and convergence accuracy of different optimizers are recorded. The hyperparameters are set as:
;
. The sliding window size is 30 and the batch size is 256. MSE is used as the loss function, and the
of FCSGD_G-L during the training is set as in Equation (18):
After 250 iterations, the train loss of different orders of FCSGD_G-L is shown in
Figure 3.
Figure 3 shows the decreasing trend of train loss with epoch, where the FCSGD_G-L fractional order is −0.1, 0.0, 0.1, 0.2, 0.3, 0.4. From
Figure 3, it can be seen that train loss of DJIA converges the fastest, and the convergence accuracy is the highest when the hyperparameter
of FCSGD_G-L. Therefore, FCSGD_G-L with
is used for comparison with SGD and SGD with
(SGDM); the other hyperparameters are default, as shown in
Figure 4.
It can be seen from
Figure 4, on DJIA, that the train loss convergence speed of FCSGD_G-L is faster than the SGD and the SGDM, and when
, FCSGD_G-L has a higher convergence accuracy.
When using FCAdam_G-L to train DJIA, due to the characteristics of Adam, the
smaller than the SGD is selected in this paper to avoid divergence, and the other hyperparameters remain unchanged. The
setting is as shown in Equation (19):
After 250 iterations,
Figure 5 shows the decreasing trend of train loss with epoch, where the FCAdam_G-L fractional order is −0.1, 0.0, 0.1, 0.2, 0.3, 0.4.
As can be seen from
Figure 5, when the hyperparameter
of FCAdam_G-L, DJIA has the highest precision of train loss convergence, and the convergence rate of each order are roughly the same. Therefore, FCAdam_G-L with
is used to compare with Adam, and the other hyperparameters are default, resulting in
Figure 6.
As can be seen from
Figure 6, the train loss of FCAdam_G-L with
demonstrates a higher convergence accuracy than Adam on DJIA. In terms of convergence speed, Adam and FCAdam_G-L with
are the same; and combined with
Figure 5, it can be seen that FCAdam_G-L and Adam converge at the same speed on DJIA.
4.3. Training for ETTh1
In the training of ETTh1, FCSGD_G-L and FCAdam_G-L are used as optimizers, respectively, and the train loss and convergence accuracy of different optimizers are recorded. The hyperparameters are set as:
;
. The sliding window size is 72 and the batch size is 256. MSE is used as the loss function. The
of FCSGD_G-L during the training is set as in Equation (18), and the
of FCAdam_G-L during the training is set as in Equation (19). After 250 iterations,
Figure 7 shows the decreasing trend of train loss with epoch, where the FCSGD_G-L fractional order is −0.7, −0.6, −0.5, −0.4, −0.3, −0.2, −0.1, 0.0, 0.1, 0.2, 0.3, 0.4.
Figure 8 shows the decreasing trend of train loss with epoch, where the FCAdam_G-L fractional order is −0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 0.5.
In
Figure 7, for ETTh1, FCSGD_G-L performs best when
, and its convergence speed and convergence accuracy reach the highest. In
Figure 8, when
, FCSGD_G-L performs the best and its convergence accuracy reaches the highest. The two figures show roughly the same train loss descent domain. So, on ETTh1, the SGD, SGDM, Adam, FCSGD_G-L and FCAdam_G-L are compared together in this paper in
Figure 9.
In
Figure 9, FCSGD_G-L with
has the fastest decrease in train loss and the highest convergence accuracy. The SGDM with
also performs well, except that the convergence accuracy is worse than FCSGD_G-L, but at the default
, the SGDM diverges without other hyperparameters being changed. FCGD_G-L in this paper is also essentially an algorithm with momentum, but as can be seen from
Figure 7 and
Figure 8, FCGD_G-L on ETTh1 not only converges quickly and with high convergence accuracy, but is also robust and less likely to diverge. On ETTh1, both Adam and FCAdam_G-L perform poorly; however, using FCAdam_G-L is better than Adam. Among the various optimizers that achieve convergence, the SGD is the least effective, converges slowly, and has the lowest accuracy.
4.4. Evaluation of DJIA and ETTh1
Four evaluation metrics: MSE, RMSE, MAE, and MAPE were obtained by Equations (14)–(17). They are used in this paper to evaluate the effects of the two test sets. The results are recorded in
Table 1 and
Table 2. Further, these metrics are used in order to compare the FCSGD_G-L correlation optimizer without considering the existing optimal network model. Further, the main hyperparameters are also initially set in
Section 4, with the order and momentum settings based on the best results discussed above, namely, on DJIA, the
is as in Equation (18),
,
for FCSGD_G-L and
for FCAdam_G-L; on ETTh1,
is as in Equation (19),
,
for FCSGD_G-L and
for FCAdam_G-L. At the end of each epoch, the train loss is compared, the model with the smallest train loss is saved, and then the model is evaluated by using test sets.
In
Table 1, due to the high volatility of DJIA, using the full test set is not effective and it is difficult to show the advantages and disadvantages of each optimizer. Therefore, only the first half of the test set of DJIA is used in this paper. It can be seen from
Table 1, that the four metrics of FCAdam_G-L are the best among the five optimizers. For FCSGD_G-L, the results of the four metrics are all better than the SGD and the SGDM. This indicates that FCGD_G-L has obvious advantages in DJIA.
In
Table 2, FCSGD_G-L’s MSE, RMSE and MAE have the best results. For FCAdam_G-L, the results of the four metrics are all better than Adam. This indicates that FCGD_G-L has obvious advantages in ETTh1.
5. Conclusions
On DJIA and ETTh1, for the train loss of FCGD_G-L, its convergence speed and convergence accuracy exceed the corresponding integer order optimizer. In addition, the evaluation effect on test sets is also better than the corresponding integer order optimizer. Taking advantage of the fractional order long memory property, FCGD_G-L does not need additional momentum, because it is equivalent to containing momentum inside. In addition, because of the properties of the G-L fractional order definition, the addition of perturbations becomes flexible in the iteration process. By using the transforming formula of the G-L fractional order definition, the Gamma function is removed in the paper. In addition, FCGD_G-L includes the integer order and the fractional order, thus achieving the unification of both.
Algorithm 1 and algorithm 2 make full use of the Autograd package of Pytorch to avoid the complicated derivation process in the complex neural network. The optimizers designed according to Algorithm 1 and algorithm 2 are very compatible with Pytorch. In Pytorch, our optimizer can be used just like any other existing optimizer. Using the order of the fractional order, we can fine-tune the results of the optimizer to obtain better convergence speed and convergence results. In
Table 1 and
Table 2, the evaluation results on the test set also show better results than the integer order through adjusting the order.
In the foreseeable future, we will further explore the influence of the fractional calculus gradient descent on deep neural network, how to select the appropriate order quickly, and how to reduce hyperparameters. Eventually, it is also a significant research direction to make FCGD_G-L play a role in other fields.