**4. The Proposed Method**

The main idea starts with a fixed-point iteration method, the condition of the cost function (0 ≤ *C*(*w*, *b*)), and the condition of the first derivative of the cost function. We define an auxiliary function H such as

$$\mathbb{H}(w) = \lambda \mathbb{C}(w) + \frac{\partial \mathbb{C}(w)}{\partial w}.$$

where *λ* is determined to be positive or negative according to the initial sign of *∂C*(*w*)/*∂w*. We make *w* change if the value of *C*(*w*) is large even if it falls to a local minimum using H(*w*).

#### *Optimization*

*The iteration method* is

$$\begin{array}{rcl} w\_{i+1} &=& w\_i - \eta \frac{\hbar\_i}{\sqrt{\vec{v}\_i + \epsilon}} \end{array} \tag{1}$$

where

$$\begin{aligned} m\_i &= \ \beta\_1 m\_{i-1} + (1 - \beta\_1) \mathbb{H}(w\_i), \\ v\_i &= \ \beta\_2 v\_{i-1} + (1 - \beta\_2) \left(\mathbb{H}(w\_i)\right)^2, \end{aligned} \tag{2}$$

*m*ˆ *<sup>i</sup>* = *mi*/(1 − *β*1), and *v*ˆ*<sup>i</sup>* = *vi*/(1 − *β*2).

**Theorem 1.** *The iteration method in (1) is a way to satisfy convergence.*

**Proof.** Equation (2) can be altered by

$$\begin{aligned} m\_i &= \, \_0\beta\_1^i m\_0 + (1 - \beta\_1) \sum\_{k=1}^i \beta\_1^{k-1} \mathbb{H}(w\_{i-k+1}) \\ &= \, \_0\left(1 - \beta\_1\right) \sum\_{k=1}^i \beta\_1^{k-1} \mathbb{H}(w\_{i-k+1}) \end{aligned}$$

and

$$\begin{aligned} v\_i &=& \beta\_2^i v\_0 + (1 - \beta\_2) \sum\_{k=1}^i \beta\_2^{k-1} \left( \mathbb{H}(w\_{i-k+1}) \right)^2 \\ &=& \left( 1 - \beta\_2 \right) \sum\_{k=1}^i \beta\_2^{k-1} \left( \mathbb{H}(w\_{i-k+1}) \right)^2 \end{aligned}$$

under *m*<sup>0</sup> = 0 and *v*<sup>0</sup> = 0. Therefore, Equation (1) can be altered by

$$\begin{array}{rcl}w\_{i+1} &=& w\_i - \eta \frac{\sqrt{1-\beta\_2}}{1-\beta\_1} \times \frac{\left(1-\beta\_1\right) \sum\_{k=1}^i \beta\_1^{k-1} \mathbb{H}(w\_{i-k+1})}{\sqrt{\left(1-\beta\_2\right) \sum\_{k=1}^i \beta\_2^{k-1} \left(\mathbb{H}(w\_{i-k+1})\right)^2 + c}}\\ &=& w\_i - \eta \frac{\sum\_{k=1}^i \beta\_1^{k-1} \mathbb{H}(w\_{i-k+1})}{\sqrt{\sum\_{k=1}^i \beta\_2^{k-1} \left(\mathbb{H}(w\_{i-k+1})\right)^2 + c}}\\ &=& w\_i - \eta \frac{S\_i}{\sqrt{SS\_i + c}}\end{array} \tag{3}$$

where

$$S\_i = \sum\_{k=1}^i \beta\_1^{k-1} \mathbb{H}(w\_{i-k+1}) \quad \text{and} \quad SS\_i = \sum\_{k=1}^i \beta\_2^{k-1} \left(\mathbb{H}(w\_{i-k+1})\right)^2.$$

Here, is introduced to exclude the case of dividing by 0, and it is 0 unless it is divided by 0. As a result of a simple calculation, the following relation can be obtained:

$$\left(\left(S\_i\right)^2 \le \left(\frac{1 - \left(\frac{\beta\_1^2}{\beta\_2}\right)^i}{1 - \left(\frac{\beta\_1^2}{\beta\_2}\right)}\right) S S\_i.\right)$$

For more information on the relation between *Si* and *SSi*, we explain the calculation process at Corollary 1. From Equation (3), we have

$$w\_{i+1} = \quad w\_i - \eta \frac{S\_i}{\left(\mathcal{S} S\_i + \epsilon\right)^{1/2}}$$

Using the relation between *Si* and *SSi*,

$$\implies \qquad |w\_{i+1} - w\_i| \le \eta \left( \frac{1 - \left(\frac{\beta\_1^2}{\beta\_2}\right)^i}{1 - \left(\frac{\beta\_1^2}{\beta\_2}\right)} \right)$$

*Appl. Sci.* **2020**, *10*, 1073

Here, since is to prevent division by zero, it can be considered to be zero in the calculation. After a sufficiently large number *τ*, a sufficiently small value *η*, and using the Taylor's theorem, we can obtain

$$\begin{array}{rcl} \mathrm{H}(w\_{i+1}) & \approx & \mathrm{H}(w\_{i}) + \frac{\partial \mathrm{H}}{\partial w} \left(w\_{i+1} - w\_{i}\right) \\\\ \mathrm{H}(w\_{i+1}) & \approx & \mathrm{H}(w\_{i}) + \eta \frac{\partial \mathrm{H}}{\partial w} \left(\frac{1 - \left(\frac{\beta\_{1}^{2}}{\beta\_{2}^{2}}\right)^{i}}{1 - \left(\frac{\beta\_{1}^{2}}{\beta\_{2}^{2}}\right)}\right) \\\\ \mathrm{H}(w\_{i+1}) & \approx & \mathrm{H}(w\_{i}) + \xi\_{i} \end{array} \tag{4}$$

where

$$\xi = \eta \frac{\partial \mathcal{H}}{\partial w} \left( \frac{1 - \left(\frac{\beta\_1^2}{\beta\_2}\right)^i}{1 - \left(\frac{\beta\_1^2}{\beta\_2}\right)} \right)^2$$

and is negligibly small. This is possible because we set the *η* value to a very small value.

Looking at the relationship between *Si* and *Si*−1, and using (4),

$$\begin{aligned} \mathcal{S}\_i &= \mathcal{H}(w\_i) + \beta\_1 \mathcal{S}\_{i-1} \\ \left( S\_i - \frac{\mathcal{H}(w\_i)}{1 - \beta\_1} \right) &= \beta\_1 \left( S\_{i-1} - \frac{\mathcal{H}(w\_{i-1})}{1 - \beta\_1} \right) - \frac{\beta\_1}{1 - \beta\_1} \mathcal{S}\_i \\ \left( S\_i - \frac{\mathcal{H}(w\_i)}{1 - \beta\_1} + \frac{\beta\_1}{(1 - \beta\_1)^2} \mathcal{S} \right) &= \beta\_1 \left( S\_{i-1} - \frac{\mathcal{H}(w\_{i-1})}{1 - \beta\_1} + \frac{\beta\_1}{(1 - \beta\_1)^2} \mathcal{S} \right) . \end{aligned}$$

We have

$$\left( S\_i - \frac{\mathcal{H}(w\_i)}{1 - \beta\_1} + \frac{\beta\_1}{(1 - \beta\_1)^2} \xi \right) = \beta\_1^i \left( S\_0 - \frac{\mathcal{H}(w\_0)}{1 - \beta\_1} + \frac{\beta\_1}{(1 - \beta\_1)^2} \xi \right) \dots$$

Therefore,

$$S\_i = \frac{\mathcal{H}(w\_i)}{1 - \beta\_1} - \frac{\beta\_1}{(1 - \beta\_1)^2} \xi + \beta\_1^i \left( S\_0 - \frac{\mathcal{H}(w\_0)}{1 - \beta\_1} + \frac{\beta\_1}{(1 - \beta\_1)^2} \xi \right).$$

If the initial condition *S*<sup>0</sup> is defined as a fairly large value, the following equation can be obtained:

$$\begin{array}{c} \mathcal{S}\_{i} \\ \mathcal{S}\_{i-1} \\ \mathcal{S}\_{i-1} \\ \approx \ \beta\_{1} \end{array} \quad = \begin{array}{c} \frac{\frac{\mathcal{H}(\mathbf{w}\_{i})}{1-\mathcal{\beta}\_{1}} - \gamma\_{1}}{\gamma\_{2}} + \beta\_{1}^{i} \\ \frac{\frac{\mathcal{H}(\mathbf{w}\_{i-1})}{1-\mathcal{\beta}\_{1}} - \gamma\_{1}}{1-\mathcal{\beta}\_{1}} + \beta\_{1}^{i-1} \\ \approx \ \beta\_{1} . \end{array}$$

where

$$\begin{array}{rcl}\gamma\_1 &=& \frac{\beta\_1}{\left(1-\beta\_1\right)^2}\xi, \\\gamma\_2 &=& S\_0 - \frac{\mathcal{H}(w\_0)}{1-\beta\_1} + \frac{\beta\_1}{\left(1-\beta\_1\right)^2}\xi. \end{array}$$

Through a similar process, we have

$$SS\_i = \left(\mathbb{H}(w\_i)\right)^2 + \beta\_2 SS\_{i-1}$$

$$\left(SS\_i - \frac{\left(\mathbb{H}(w\_i)\right)^2}{1 - \beta\_1} - \delta\right)$$

$$= \qquad \beta\_2 \left(SS\_{i-1} - \frac{\left(\mathbb{H}(w\_{i-1})\right)^2}{1 - \beta\_1} - \delta\right).$$

where *δ* is 2*ξ* (H(*wi*−1) + *ξ*/2) /(1 − *β*2) and can remain constant because of a sufficiently small *ξ* value.

$$SS\_i \quad = \ \frac{(\mathbb{H}(w\_i))^2}{1-\beta\_1} + \delta + \beta\_1^i \left( SS\_0 - \frac{(\mathbb{H}(w\_0))^2}{1-\beta\_1} - \delta \right) \dots$$

If the initial condition *SS*<sup>0</sup> is defined as a fairly large value, the following equation can be obtained:

$$\begin{array}{rcl} \frac{\mathcal{SS}\_i}{\mathcal{SS}\_{i-1}} &=& \frac{\left(\mathcal{H}(w\_i)\right)^2 + \delta}{\gamma\_3} + \beta\_2^i\\ &=& \frac{\left(\mathcal{H}(w\_{i-1})\right)^2 + \delta}{\gamma\_3} + \beta\_2^{i-1} \\ &\approx& \beta\_2 \end{array}$$

where

$$\gamma\_3 = \left(1 - \beta\_1\right) S S\_0 - \left(\mathcal{H}(w\_0)\right)^2 - \delta$$

Assuming that the initial values *S*<sup>0</sup> and *SS*<sup>0</sup> are sufficiently large, Equation (3) can be changed as follows:

$$\left| \frac{w^{i+1} - w^i}{w^i - w^{i-1}} \right| \quad = \left| \frac{\frac{S\_i}{S\_{i-1}}}{\left(\frac{SS\_i}{SS\_{i-1}}\right)^{1/2}} \right| = \frac{\beta\_1}{\beta\_2^{1/2}} < 1. \text{ }$$

For *w<sup>i</sup>* to converge (*a Cauchy sequence*), the following condition should be satisfied:

$$\left| \frac{w^{i+1} - w^i}{w^i - w^{i-1}} \right| \le \gamma \le 1.$$

To satisfy this condition, the larger the value of *β*<sup>2</sup> is, the better. However, if the value of *β*<sup>2</sup> is larger than 1, the value of *v<sup>i</sup>* becomes negative and we should compute the complex value. As a result, *β*<sup>2</sup> is preferably as close to 1 as possible. Conversely, the smaller the value of *β*<sup>1</sup> is, the better. However, if the value of *β*<sup>1</sup> is small, the convergence of *w<sup>i</sup>* is fast and the change of *w<sup>i</sup>* is small, so the convergence value that we want cannot be achieved. As a result, *β*<sup>1</sup> is also preferably as close to 1 as possible. Therefore, it is better to decide in the range of *β*<sup>2</sup> <sup>1</sup> ≤ *β*2. Generally, *β*<sup>1</sup> is 0.9 and *β*<sup>2</sup> is 0.999. After computing, we have <sup>|</sup>*wi*+<sup>1</sup> <sup>−</sup> *<sup>w</sup><sup>i</sup>* | ≤ *<sup>γ</sup>i*−*τ*|*wτ*+<sup>1</sup> <sup>−</sup> *<sup>w</sup>τ*|. As the iteration continues, the value of *γi*−*<sup>τ</sup>* converges to zero. Therefore, after a sufficiently large number (greater than *τ*) *wi*+<sup>1</sup> and *w<sup>i</sup>* are equal.

**Corollary 1.** *The relationship between Si and SSi is*

$$\left(\left(S\_i\right)^2 \le \left(\frac{1 - \left(\frac{\beta\_1^2}{\beta\_2}\right)^i}{1 - \left(\frac{\beta\_1^2}{\beta\_2}\right)}\right) S S\_i.\right)$$

**Proof.**

$$\begin{aligned} S\_i &= \sum\_{k=1}^i \beta\_1^{k-1} \mathbb{H}(w\_{i-k+1}) \\ &= \mathbb{H}(w\_i) + \beta\_1 \mathbb{H}(w\_{i-1}) + \dots + \beta\_1^{i-1} \mathbb{H}(w\_1) \\ &= \mathbb{H}(w\_i) + \frac{\beta\_1}{\left(\beta\_2\right)^{1/2}} \left(\beta\_2\right)^{1/2} \mathbb{H}(w\_{i-1}) \\ &+ \dots + \frac{\beta\_1^{i-1}}{\left(\beta\_2^{i-1}\right)^{1/2}} \left(\beta\_2^{i-1}\right)^{1/2} \mathbb{H}(w\_1) \end{aligned}$$

Using the general Cauchy–Schwarz inequality, we have

$$\begin{split} S\_i^2 &\le \quad \left\{ 1 + \left( \frac{\beta\_1}{\beta\_2^{1/2}} \right)^2 + \dots + \left( \frac{\beta\_1^{i-1}}{\beta\_2^{(i-1)/2}} \right)^2 \right\} \\ &\quad \times \left\{ \left( \mathbf{H}(w\_i) \right)^2 + \beta\_2 \left( \mathbf{H}(w\_{i-1}) \right)^2 + \dots + \beta\_2^{i-1} \left( \mathbf{H}(w\_1) \right)^2 \right\} \\ &\le \quad \left( \frac{1 - \left( \frac{\beta\_1^2}{\beta\_2} \right)^i}{1 - \left( \frac{\beta\_1^2}{\beta\_2} \right)} \right) S S\_i. \end{split}$$

**Theorem 2.** *The limit of w<sup>i</sup> satisfies that* lim *<sup>i</sup>*→<sup>∞</sup> *<sup>H</sup>*(*w<sup>i</sup>* ) = 0*.*

**Proof.** When the limit of *w<sup>i</sup>* is *w*∗, using (5) and continuity of H, the following equation is obtained as

$$\begin{aligned} \boldsymbol{w}^\* &= \boldsymbol{w}^\* - \eta \frac{\sum\_{k=\tau}^\* \beta\_1^{k-1} \mathbf{H}(\boldsymbol{w}^{i-k+1})}{\sqrt{\sum\_{k=\tau}^\* \beta\_2^{k-1} \left(\mathbf{H}(\boldsymbol{w}^{i-k+1})\right)^2 + \epsilon}},\\ 0 &= \frac{\sum\_{k=\tau}^\* \beta\_1^{k-1} \mathbf{H}(\boldsymbol{w}^{i-k+1})}{\sqrt{\sum\_{k=\tau}^\* \beta\_2^{k-1} \left(\mathbf{H}(\boldsymbol{w}^{i-k+1})\right)^2 + \epsilon}}.\end{aligned}$$

The effect of is to avoid making the denominator zero, so the denominator part of the above equation is not zero. We can get ∑∗ *<sup>k</sup>*=*<sup>τ</sup> <sup>β</sup>k*−<sup>1</sup> <sup>1</sup> <sup>H</sup>(*wi*−*k*+1) = 0. Since *<sup>β</sup>*<sup>1</sup> <sup>&</sup>lt; 1, *<sup>β</sup>*<sup>2</sup> <sup>1</sup> < *<sup>β</sup>*2, and *<sup>w</sup><sup>i</sup>* converges to

*<sup>w</sup>*<sup>∗</sup> assuming that <sup>∗</sup> is a fairly large number, *<sup>w</sup>*<sup>∗</sup> and *<sup>w</sup>*∗−<sup>1</sup> are close, and *<sup>β</sup><sup>κ</sup>* can be regarded as 0 after *<sup>κ</sup>*, where *κ* is an appropriate large integer. Therefore, we can get

$$\begin{split} 0 &=& \mathbb{H}(w^\*) + \beta\_1 \mathbb{H}(w^{\*-1}) + \beta\_1^2 \mathbb{H}(w^{\*-2}) + \dots + \beta\_1^x \mathbb{H}(w^{\*-x}) \\ &\approx& \mathbb{H}(w^\*) \left( 1 + \beta\_1 + \beta\_1^2 + \dots + \beta\_1^x \right) \\ &=& \mathbb{H}(w^\*) \frac{1 - \beta\_1^{x+1}}{1 - \beta\_1} .\end{split}$$

The extreme value *w*<sup>∗</sup> of the sequence *w<sup>i</sup>* becomes a variable value that brings the cost function H close to zero.

#### **5. Numerical Tests**

In these numerical tests, we perform several experiments to show the novelty of the proposed method. The GD method, the ADAM method, the AdaMax method, and the proposed method are compared according to each experiment. Please note that some of the techniques such as batch, epoch, and drop out are not included. *β*<sup>1</sup> and *β*<sup>2</sup> used in ADAM and AdaMax are fixed as 0.9 and 0.999, respectively, and is used as 10<sup>−</sup>8. These values are the default of *β*1, *β*2, and .

#### *5.1. One Variable Non-Convex Function Test*

Since the cost function has a non-convex property, we test each method with a simple non-convex function in this experiment. The cost function is *C*(*w*)=(*w* + 5)(*w* + 3)(*w* − 1)(*w* − 10)/800 + 3 and has the global minimum at *w* ≈ 7.1047. The starting point, *w*0, is initialized at −9 and the iteration number is 100. The reason this cost function is divided by 800 is that if you do not divide it, the degree of this function is 4, so the value of the function becomes too big and it is too far away from the real problem.

Figure 2 shows the change in the cost function (*C*(*w*)) according to the change in *w*.

**Figure 2.** Cost function with one local minimum.

Figure 3 shows the iterations of four methods over *C*(*w*) with *w*<sup>0</sup> = −9. In this experiment, GD, ADAM, and AdaMax fall into a local minimum and it is confirmed that there is no motion. On the other hand, the proposed method is confirmed to settle at the global minimum beyond the local minimum. Although the global minimum is near 7, the other methods stayed near −4.

**Figure 3.** Compared to the other scheme, the learning rate is 0.2 and the *β*1, *β*2, *λ* used in the proposed method are 0.95, 0.9999, <sup>−</sup>10<sup>−</sup>2.

#### *5.2. Two Variables Non-Convex Function Test*

In this section, we experiment with three two-variable cost functions. The first and second experiments are to find the global minimum of the Beale function and the Styblinski–Tang function, respectively, and the third experiment is to test whether the proposed method works effectively at the saddle point.

#### 5.2.1. Beale function

The Beale function is defined by *<sup>C</sup>*(*w*1, *<sup>w</sup>*2)=(1.5 − *<sup>w</sup>*<sup>1</sup> + *<sup>w</sup>*1*w*2)<sup>2</sup> + (2.25 − *<sup>w</sup>*<sup>1</sup> + *<sup>w</sup>*1*w*<sup>2</sup> <sup>2</sup>)<sup>2</sup> + (2.625 − *<sup>w</sup>*<sup>1</sup> + *<sup>w</sup>*1*w*<sup>3</sup> <sup>2</sup>)<sup>2</sup> and has the global minimum at (3, 0.5). Figure <sup>4</sup> shows the Beale function.

**Figure 4.** Beale Function.

Figure 5 shows the results of each method. GD's learning late was set to 10<sup>−</sup>4, which is different from other methods because only GD has a very large gradient and weight divergence. We confirm that all methods converge well because this function is convex around the given starting point.

**Figure 5.** Result of the Beale function with an initial point of (2, 2). *W*<sup>0</sup> = (2, 2) and the iteration number are 1000. The learning rate is 0.1 and the *β*1, *β*2, and *λ* used in proposed method are 0.9, 0.999, and <sup>−</sup>0.5, respectively. GD's learning late was set to 10<sup>−</sup>4.

Table 1 shows the weight change of each method according to the change of iteration. As you can see from this table, the proposed method shows the best performance.


**Table 1.** Change of weights of each method.

To see the results from another starting point, we experiment with the same cost function using a different starting point (−4, 4), hyperparameter *λ* = −0.2 and the iteration number = 50,000.

Figure 6 shows the results of each method. In this experiment, we confirm that GD, ADAM, and AdaMax fall into a local minimum and stop there, whereas the proposed method reaches the global minimum effectively.

**Figure 6.** Result of Beale function with initial point (−4, 4).

### 5.2.2. Styblinski–Tang function

The Styblinski–Tang function is defined by

$$\mathcal{C}(w\_1, w\_2) = \left( (w\_1^4 - 16w\_1^2 + 5w\_1) + (w\_2^4 - 16w\_2^2 + 5w\_2)^2 \right) / 2 + 80w$$

and has the global minimum at (−2.903534, −2.903534).

Figure 7 shows the Styblinski–Tang function. In this experiment, we present a result with the starting point *W*<sup>0</sup> = (6, 0). Please note that a local minimum point ≈ (2.7468, −2.9035) is located between *W*<sup>0</sup> and the global minimum point.

**Figure 7.** Styblinski–Tang Function.

Figure 8 shows the results of each method. Only the Proposed method find the global minimum, and other methods could not avoid a local minimum.

**Figure 8.** Result of Styblinski–Tang function with initial point (6, 0). The learning rate is 0.1 and the *β*1, *β*2, *λ* used in Proposed method are 0.95, 0.9999, 0.2 and GD's learning late was set to 10<sup>−</sup>3.

5.2.3. Function with a Saddle Point

The cost function *C*(*w*1, *w*2) = *w*<sup>2</sup> <sup>2</sup> − *<sup>w</sup>*<sup>2</sup> <sup>1</sup> + 2 is shown in Figure 9. A Hessian matrix of this cost function is −2 0 0 2 , so this cost function has a saddle point at (0, 0).

In this experiment, we present results based on two different starting points. The first starting point is *W*<sup>0</sup> = (0.001, 0.2).

*Appl. Sci.* **2020**, *10*, 1073

**Figure 9.** Cost function *C*(*w*1, *w*2) = *w*<sup>2</sup> <sup>2</sup> <sup>−</sup> *<sup>w</sup>*<sup>2</sup> <sup>1</sup> + 2.

Figure 10 and Table 2 shows the results of each method. In this experiment, we see that the proposed method also changes the parameters more rapidly than the other three methods near the saddle point.

**Figure 10.** The results near the saddle point with an initial point of (0.001, 2). The iteration number is 50, the learning rate is 10−2, and the *β*1, *β*2, and *λ* used in the proposed method are 0.95, 0.9999, and <sup>−</sup><sup>5</sup> <sup>×</sup> <sup>10</sup><sup>−</sup>3, respectively.



Then, we performed an experiment with the same cost function, but with another starting point (0, 0.01), and with an iteration number of 100. The hyper parameters in this experiment are the same as above. The only thing that was changed was the starting point.

Figure 11 shows the result of this experiment with an initial point of (0, 0.01). In this experiment, since one of the coordinates of the initial value is 0, the other methods do not work.

**Figure 11.** The result of the saddle point with an initial point of (0, 0.01).

### *5.3. Two-Dimensional Region Segmentation Test*

To see how one can divide the area in a more complicated situation, we introduce a problem of region separation in the form of the shape of a whirlwind. The value of 0 and 1 are given along the shape of the whirlwind, and the regions are divided by learning according to each method. Table 3b is the pseudo code for making the learning dataset for this experiment and Table 3a is a visualization of the dataset.



To solve this problem, we use a 2-layer and 25-row neural network. First, the neural network is learned by each method. After that, [−2, 2] × [−2, 2] is divided into 60 equal sections, and each of the 3600 coordinates are checked to decide the parts where they belong.

The results are presented in Figure 12. The proposed method is better expressed in dividing the region according to the direction of the whirlwind, compared with other schemes.

In the same artificial neural network structure, different results were obtained for each method, which is related to the accuracy of location learning.
