SAMD Optimization Algorithm
The traditional gradient descent algorithm is slow when training data are complex. In the process of optimizing an objective function, the gradient descent algorithm is apt to converge to the local minimum [
28]. Moreover, the gradient depends on derivatives, which is imprecise due to the nonsmoothness of the objective function. In addition, numerical schemes need to be used to solve them due to nonsmooth and unknown gradient flows. These gradient flows must have higher-order derivatives than the second-order derivatives. Meanwhile, implementing gradient flow is not easy because numerical artifacts will occur when the derivative is higher than second-order, but the gradient is usually like this.
The optimization objective of the mirror descent algorithm using Bregman divergence as the mirror map is [
13]:
where
L is a Lipschitz constant.
x indicates the parameters to be trained and
is the search space of
x.
However, for the two items
and
, there is only one parameter
L in the projection calculation. In the original mirror descent algorithm (see Equation (
4)), constant
L (called Lipschitz constant) is put forward to regulate the proportion of
and
.
L is preset as a hyperparameter when the mirror descent iteration is performed. In the iterative change process, the algorithm cannot dynamically adjust the influence of
and
on the iterative results according to the iterative steps. This makes the mirror descent algorithm look a little inflexible. Moreover, the single parameter will make the algorithm oscillate in the iterative process, resulting in the poor effect of the optimization algorithm. So, we use the idea of momentum, which introduces the step factor
and the deviation correction factor
to dynamically adjust the change speed in the iterative process of the algorithm so that the change of the process tends to be more stable. In this paper, the optimization objective is changed as:
Among them, is used to control the change of , and is used to control the change of . According to the projection calculation process of the mirror descent algorithm, with iterations, the weights of and change, which can accelerate the speed of the optimization algorithm to the optimal solution to a certain extent. In consideration of the above trends and related knowledge, we set and . changes with the number of iterations, and the change rate is greater than .
The mirror descent algorithm is effective when can be calculated by a numerical calculation algorithm. However, when the objective function is nonsmooth or the derivative does not exist, can not be calculated. At this point, the gradient descent algorithms will not work. Therefore, we use the subgradient instead of traditional gradient in this paper. The subgradient has good performance in solving nondifferentiable convex optimization problems. The subgradients of function can be regarded as a cluster of supported hyperplanes, represented by . Furthermore, the unbiased estimation of the subgradients is used to replace the gradient so that the results of the numerical analysis are unique, unbiased, and bounded. To prove this point, we give Theorem 1.
Theorem 1. Given an objective function , for any , there exists a real number subject to .
Proof. is a finite function in a finite set. That is, it is uniformly bounded. Therefore, the unbiased estimation of subgradients is also bounded. The mathematical expectation of a bounded function is bounded. Obviously, is a bounded function. So, , and there exists a real number subject to . □
Based on some knowledge of mathematical analysis, we can prove Theorem 1. Theorem 1 shows that when we calculate the unbiased estimate of the subgradient of
, the result is bounded. Hence,
can avoid the vanishing gradient problem.
can be calculated by the following equation:
where
is the sampling of the subgradient hyperplane by
at
. By the following equation, the gradient of
can be replaced by
. This can avoid the vanishing gradient problem.
Finally, the iterative weighted average method is used to control the objective function change further so that the algorithm can achieve the optimal solution faster. We define
as the weighted average value in the
iterative step. It can be calculated by the following equation:
where
is the weighted average value in the previous step and
is calculated by Equation (
5).
is a dynamically adjusted parameter, which used to adjust the weight between
and
in different iteration steps.
is determined by accumulating the reciprocal of inner product term weight of the current step and the previous step. The accumulation of the reciprocal of inner product term weight at step
t is represented by
, and the calculation method [
29] is:
This setting can make the iteration step span of each step of the algorithm less large, and can achieve the optimal solution more accurately.
The approach is summarized in Algorithm 1.
Algorithm 1 shows our proposed stochastic unbiased subgradient accelerated mirror descent algorithm. Line 1 gives the initial of the whole algorithm. Lines 3–4 calculate
and
, which are put forward to dynamically regulate the impact of
and
on the iterative equation (see Line 6). Line 5 calculates our proposed unbiased subgradient method. Lines 7–8 show our proposed iterative weighted average method. By using this method, our proposed SAMD algorithm can converge faster and avoid the error caused by too large a step size. Lines 9–12 are used to evaluate whether the SAMD algorithm fulfills the convergence situation. Through iterations, our proposed SAMD algorithm can train the diagnostic network until it reaches the optimal condition. Then, the convergence of our proposed SAMD will be analyzed below.
Algorithm 1 A stochastic unbiased subgradient accelerated mirror descent algorithm |
Input: Loss function |
Output: The optimization result |
1: Initial: , obtain by grid search |
2: while do |
3: Calculate |
4: Calculate |
5: Calculate |
6: Set by |
7: Calculate , and |
8: Set by |
9: if then |
10: Let |
11: break |
12: end if |
13: end while |
14: return . |
Convergence of SAMD
The following lemma can be obtained according to the properties of strongly convex functions and Bregman divergence constructed by Bregman function.
Lemma 1. , we have:where constrained set Ω is a closed convex set. Proof. According to the definition of Bregman divergence (see in Equation (
1)), the following three equations can be deduced:
Then, we have the following process by Equation (
11) minus Equation (
12):
By Equation (
13) minus Equation (
10), we then have
So, we have . □
This lemma is proved by the property of Bregman divergence and is used to obtain the convergence of the SAMD algorithm. Based on some knowledge of mathematical analysis, we can obtain Lemma 2.
To prove the convergence of the unbiased subgradient accelerated mirror descent algorithm, the following lemma can be deduced:
Lemma 2. When , for any , we have the follwing inequality:where is a sigma-field, i.e., , . Especially, . Proof. From the first-order necessary condition, we can obtain
Then, it can be acquired that
From Lemma 1, it can be concluded that
Take Equation (
18) into Equation (
17), and we can obtain
Bregman divergence has the following property (see in Equations (
2) and (
3)):
Then, Equation (
19) can be transformed to
The above inequalities are sorted out:
By using the Fenchel inequality, Equation (
23) is transformed to
The above equation is transformed:
Taking Equation (
21) into Equation (
25), we have
According to Algorithm 1, there has
, so the value range of
is
. Then, we take conditional expectations on
:
□
From Lemma 2, we can deduce the following theorem to converge our proposed unbiased subgradient accelerated mirror descent algorithm.
Theorem 2. For an optimization question , where is a strong convex function, is the optimal solution obtained by the iterative algorithm, and is the optimal solution, we have Proof. In Lemma 2, let
, and we can obtain
By the strong convexity of
, we can obtain
Take Equation (
30) into Equation (
29). Divide both sides of the inequality by
and calculate the total expectation:
When the algorithm iterates to
, we can come to the following conclusion:
□
According to Theorem 2, our proposed SAMD algorithm can converge at , a square order faster than the gradient descent algorithms (such as Adam and SGD (stochastic gradient descent), which have convergence). It is proved theoretically that the SAMD algorithm proposed in our paper can accelerate the convergence of the training process of the diagnostic network. At the same time, through the relevant description and theorem-proof of the algorithm, it can be obtained that the SAMD algorithm can avoid the problems of vanishing gradient and local optimal solution.