**Dokkyun Yi 1, Jaehyun Ahn <sup>2</sup> and Sangmin Ji 2,\***


Received: 11 December 2019; Accepted: 25 January 2020; Published: 5 February 2020

**Abstract:** A machine is taught by finding the minimum value of the cost function which is induced by learning data. Unfortunately, as the amount of learning increases, the non-liner activation function in the artificial neural network (ANN), the complexity of the artificial intelligence structures, and the cost function's non-convex complexity all increase. We know that a non-convex function has local minimums, and that the first derivative of the cost function is zero at a local minimum. Therefore, the methods based on a gradient descent optimization do not undergo further change when they fall to a local minimum because they are based on the first derivative of the cost function. This paper introduces a novel optimization method to make machine learning more efficient. In other words, we construct an effective optimization method for non-convex cost function. The proposed method solves the problem of falling into a local minimum by adding the cost function in the parameter update rule of the ADAM method. We prove the convergence of the sequences generated from the proposed method and the superiority of the proposed method by numerical comparison with gradient descent (GD, ADAM, and AdaMax).

**Keywords:** numerical optimization; ADAM; machine learning; stochastic gradient methods

### **1. Introduction**

Machine learning is a field of computer science that gives computer systems the ability to learn with data, without being explicitly programmed. For machines to learn data, a machine learns how to minimize it by introducing a cost function. The cost function is mostly made up of the difference between the true value and the value calculated from the Artificial Neural Network (ANN) [1–7]. Therefore, the cost function varies with the amount of training data, non-linear activation function in ANN, and the structure of ANN. These changes generate both a singularity and local minimum within the cost function. We know that all the differentiable functions at this point (singularity or local minimum) have a first derivative value of zero.

The gradient-based optimization (gradient descent optimization) is widely used to find the minimum value of the cost function [8–15]. Gradient descent (GD) is a method that was first introduced and uses a fixed-point method to make the first derivative of the cost function zero. This method works somewhat well; however, it causes many difficulties in complex ANNs. To overcome this difficulty, a method called ADAM (Adaptive moment estimation) [15] was introduced to add the momentum method and to control the distance of each step. In other words, this method uses the sum of the gradient values multiplied by the weights calculated in the past, which is the idea of momentum. This is called the 'first momentum', and the sum of squares of the gradient is calculated in the same way. This is called the 'second momentum', and the ratio of the first and the second momentum values is calculated and the minimum value is searched for according to the ratio. More detailed

information can be found in [16–22]. This method is still widely used and works well in most areas. There is a more advanced method called AdaMax (Adaptive moment estimation with Maximum). The AdaMax method uses the maximum value of the calculation method of the second momentum part in ADAM. This provides a more stable method. Various other methods exist. We are particularly interested in GD, ADAM, and AdaMax, because GD was the first to be introduced, ADAM is still widely used, and AdaMax is a modified method of ADAM. It has been empirically observed that these algorithms fail to converge toward an optimal solution [23]. We numerically analyze the most basic GD method, the most widely used ADAM method, and the modified AdaMax method. As a result of these numerical interpretations, we introduce the proposed method. The existing methods based on gradient descent operate by changing the parameter so that the first derivative of the cost function becomes zero, which results in finding the minimum value of the cost function. For this method to be established, it is assumed that the cost function has a convex property [24,25]. However, the first derivative of the cost function is also zero at a local minimum. Therefore, this existing method may converge to the local minimum in the cost function where the local minimum exists. If the value of the cost function at the local minimum is not as small as is desired, the value of the parameter should change. To solve this problem, we solve the optimization problem by using the first derivative of the cost function and the cost function itself. In this way, if the value of the cost function is non-zero, the parameter changes even if the first derivative of the cost function becomes zero. This is the reason for adding the cost function. Here, we also use the idea of the ADAM method by using these data to add to the parameter changes, and also demonstrate the convergence of the created sequence.

In summary, our research question is why neural networks are often poorly trained by known optimization methods. Our research goal is to find a new optimization method which resolve this phenomenon. For this, first we prove the convergence of the new method. Next, we use the simplest cost function to visualize the movements of our method and basic methods near a local minimum. In addition, then, we compare performances of our method and ADAM on practical datasets such as MNIST and CIFAR10.

This paper is organized as follows. In Section 2, the definition of the cost function and the cause of the non-convex cost function are explained. In Section 3, we briefly describe the known Gradient-Descent-based algorithms. In particular, we introduce the first GD (Gradient Descent) method, the most recently used ADAM method, and finally, the improved ADAM method. In Section 4, we explain the proposed method, the convergence of the proposed method, and other conditions. In Section 5, we present several numerical comparisons between the proposed method and the discussed methods. The first case is the numerical comparison of a one variable non-convex function. We then perform numerical comparisons of non-convex functions in a two-dimensional space. The third case is a numerical comparison between four methods in two-dimensional region separation. Finally, we test with MNIST (Modified National Institute of Standards and Technology) and CIFAR10 (The Canadian Institute for Advanced Research)—the most basic examples of image analysis. MNIST classifies 10 types of grayscale images, as seen in Figure 1, and this experiment shows that our method is also efficient at analyzing images (https://en.wikipedia.org/wiki/MNIST\_database). CIFAR10 (The Canadian Institute for Advanced Research) is a dataset that classifies 10 types like MNIST (Modified National Institute of Standards and Technology), but CIFAR10 has RGB images. Therefore, CIFAR10 requires more computation than MNIST. Through numerical comparisons, we confirm that the proposed method is more efficient than the existing methods described in this paper. In Section 6, we present the conclusions and future work.

**Figure 1.** Examples of MNIST dataset.

#### **2. Cost Function**

In this section, we explain basic machine learning among the various modified machine learning algorithms. For convenience, machine learning refers to basic machine learning. To understand the working principle of machine learning, we try to understand the principle from the structure with one neuron. Let *x* be input data and *H*(*x*) be output data, which is obtained by

$$H(\mathbf{x}) \quad = \quad \sigma(w\mathbf{x} + b)\_r$$

where *w* is weight, *b* is bias, and *σ* is a sigmoid function (it is universally called an activation function, and various functions can be used). Therefore, the result of the function *H* is a value between 0 and 1. Non-linearity is added to the cost function by using the non-linear activation function. For machine learning, let *LS* be the set of learning data and let *l* > 2 be the number of the size of *LS*. In other words, when the first departure point of learning data is 1, the learning dataset is *LS* = {(*x*1, *y*1),(*x*2, *y*2), ...,(*xl*, *yl*)}, where *xs* is a real number and *ys* is a value between 0 and 1. From *LS*, we can define a cost function as follows:

$$\mathcal{C}(w, b) = \frac{1}{l} \sum\_{s=1}^{l} \left( y\_s - H(\mathbf{x}\_s) \right)^2.$$

The machine learning is completed by finding *w* and *b*, which satisfies the minimum value of the cost function. Unfortunately, there are several local minimum values of the cost function, because the cost function is not convex. Furthermore, the deepening of the structure of an ANN means that the activation function performs the synthesis many times. This results in an increase in the number of local minimums of the cost function. More complete interpretations and analysis are currently being made in more detail and these will be reported soon.

#### **3. Learning Methods**

In this section, we provide a brief description of the well-known optimization methods—GD, ADAM, and AdaMax.

#### *3.1. Gradient Descent Method*

GD is the most basic method and the first introduced. In GD, a fixed-point iteration method is introduced with the first derivative of the cost function. A parameter is changed in each iteration as follows.

$$w\_{i+1} = \quad w\_i - \eta \frac{\partial \mathcal{C}(w\_i)}{\partial w},$$

The pseudocode version of this method is as follows in Algorithm 1.


*η* : Learning rate *C*(*w*) : Cost function with parameters *w w*<sup>0</sup> : Initial parameter vector *i* ← 0 (Initialize time step) **while** *w* not converged **do** *i* ← *i* + 1 *wi*+<sup>1</sup> ← *wi* − *η ∂C <sup>∂</sup><sup>w</sup>* (*wi*) **end while return** *wi* (Resulting parameters)

As in the above formula, if the gradient is zero, the parameter does not change and does not continue to the local minimum.

#### *3.2. ADAM Method*

The ADAM method is the most widely used method based on the GD method and the momentum method, and additionally, a variation of the interval. The first momentum is obtained by

$$m\_i = \beta\_1 m\_{i-1} + (1 - \beta\_1) \frac{\partial C}{\partial w}.$$

The second momentum is obtained by

$$\begin{aligned} v\_i &= \beta\_2 v\_{i-1} + (1 - \beta\_2) \left( \frac{\partial C}{\partial w} \right)^2, \\\\ w\_{i+1} &= \quad w\_i - \eta \frac{\dot{m}\_i}{\sqrt{\vec{v}\_i + \epsilon}'}, \end{aligned}$$

where *m*ˆ *<sup>i</sup>* = *mi*/(1 − *β*1) and *v*ˆ*<sup>i</sup>* = *vi*/(1 − *β*2). The pseudocode version of this method is as follows in Algorithm 2.

,

#### **Algorithm 2:** Pseudocode of ADAM Method.

*η* : Learning rate *β*1, *β*<sup>2</sup> ∈ [0, 1) : Exponential decay rates for the moment estimates *C*(*w*) : Cost function with parameters *w w*<sup>0</sup> : Initial parameter vector *m*<sup>0</sup> ← 0 *v*<sup>0</sup> ← 0 *i* ← 0 (Initialize timestep) **while** *w* not converged **do** *i* ← *i* + 1 *mi* ← *<sup>β</sup>*<sup>1</sup> · *mi*−<sup>1</sup> + (<sup>1</sup> − *<sup>β</sup>*1) · *∂C <sup>∂</sup><sup>w</sup>* (*wi*) *vi* ← *<sup>β</sup>*<sup>2</sup> · *vi*−<sup>1</sup> + (<sup>1</sup> − *<sup>β</sup>*2) · *∂C <sup>∂</sup><sup>w</sup>* (*wi*)<sup>2</sup> *<sup>m</sup>*ˆ*<sup>i</sup>* <sup>←</sup> *mi*/(<sup>1</sup> <sup>−</sup> *<sup>β</sup><sup>i</sup>* 1) *<sup>v</sup>*ˆ*<sup>i</sup>* <sup>←</sup> *vi*/(<sup>1</sup> <sup>−</sup> *<sup>β</sup><sup>i</sup>* 2) *wi*+<sup>1</sup> ← *wi* − *η* · *m*ˆ*i*/( <sup>√</sup>*v*ˆ*<sup>i</sup>* <sup>+</sup> ) **end while return** *wi* (Resulting parameters)

The ADAM method is a first-order method. Thus, it has low time complexity. As the parameter changes repeat, the learning rate becomes smaller due to the influence of *m*ˆ *<sup>i</sup>*/ <sup>√</sup>*v*ˆ*<sup>i</sup>* <sup>+</sup> and varies slowly around the global minimum.

#### *3.3. AdaMax Method*

The AdaMax method is based on the ADAM method and uses the maximum value of the calculation of the second momentum.

$$\begin{aligned} u\_i &= \max \left( \beta\_2 \cdot u\_{i-1}, \left| \frac{\partial \mathcal{C}}{\partial w} (w\_i) \right| \right). \\\\ w\_{i+1} &= \quad w\_i - \eta \frac{\mathfrak{H}\_i}{u\_i}. \end{aligned}$$

The pseudocode version of this method is as follows in Algorithm 3.

#### **Algorithm 3:** Pseudocode of AdaMax.

*η* : Learning rate *β*1, *β*<sup>2</sup> ∈ [0, 1) : Exponential decay rates for the moment estimates *C*(*w*) : Cost function with parameters *w w*<sup>0</sup> : Initial parameter vector *m*<sup>0</sup> ← 0 *u*<sup>0</sup> ← 0 *i* ← 0 (Initialize time step) **while** *w* not converged **do** *i* ← *i* + 1 *mi* ← *<sup>β</sup>*<sup>1</sup> · *mi*−<sup>1</sup> + (<sup>1</sup> − *<sup>β</sup>*1) · *∂C <sup>∂</sup><sup>w</sup>* (*wi*) *ui* <sup>←</sup> *max β*<sup>2</sup> · *ui*−1, - - - *∂C <sup>∂</sup><sup>w</sup>* (*wi*) - - - *wi*+<sup>1</sup> <sup>←</sup> *wi* <sup>−</sup> (*η*/(<sup>1</sup> <sup>−</sup> *<sup>β</sup><sup>i</sup>* <sup>1</sup>)) · *mi*/*ui* **end while return** *wi* (Resulting parameters)
