*5.4. MNIST with CNN*

In the previous sections, we have seen simple cases of classification problems using ANN. Here, we try a more practical experiment using a CNN (Convolutional Neural Network). As we all know, we use an ANN to increase the number of layers to solve more difficult and complex problems. However, as the number of layers increases, the number of parameters also increases, and the memory and operating time of the computer, likewise, also increase. For this reason, a CNN was introduced and the performance is shown at The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [11]. Therefore, in this experiment, we want to see how our method works with the CNN structure. In this experiment, TensorFlow was used, and for the fairness of the experiment, we used the MNIST problem using a CNN, which is one of the basic examples provided in TensorFlow (see https://github.com/ tensorflow/models/tree/master/tutorials/image/mnist). There were 8600 iterations, which output the Minibatch cost and validation error of each method as a result. The results are shown in Figure 13. As shown in Figure 13, all methods were found to reduce both cost and error.

(**a**) Minibatch cost of each method (**b**) Validation error of each method

**Figure 13.** Comparing the proposed scheme with other schemes.

For further investigation, we magnify the data from Figure 13 from the 6000th iteration, and show it in Figure 14.

**Figure 14.** Comparing the proposed scheme with other schemes with an iteration number over 6000.

Figure 14a shows that GD converges the slowest and the proposed method is more accurate than ADAM. Figure 14b shows the accuracies of the proposed method and the ADAM method, as measured on the test data. From this, we see that the proposed method works better for the classification of images than ADAM.

Figure 15 shows four sample digits from the MNIST test data that the proposed method classifies effectively but that ADAM failed to classify in this experiment.

(**a**) Answer is 1, but ADAM predicts as 7. (**b**) Answer is 2, but ADAM predicts as 1. (**c**) Answer is 7, but ADAM predicts as 3. (**d**) Answer is 9, but ADAM predicts as 8.

**Figure 15.** Examples that the proposed method predicted correctly, while the ADAM method does not.

#### *5.5. CIFAR-10 with RESNET*

We show in Section 5.4 that the proposed method is more effective than ADAM in the CNN structure using the MNIST dataset. Since the MNIST dataset is a grayscale image, the size of one image is 28 × 28 × 1. In this section, we show that the proposed method is more effective than ADAM when using a dataset larger than the MNIST, such as the CIFAR10 (https://www.cs.toronto.edu/~kriz/ cifar.html). CIFAR10 is a popular dataset for classifying 10 categories (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck) 32 × 32 × 3 size RGB images. This dataset consists of 50,000 training data and 10,000 test data. In this experiment, as in Section 5.4, we used the example code (the same model and the same hyperparameter settings) provided by TensorFlow (see https://github. com/tensorflow/models/tree/master/tutorials/image/cifar10\_estimator). In Section 5.4, we used a simple network using a CNN, but in this subsection, we use a more complex and high-performance network called RESNET (Residual network) based on CNN. RESNET is a well-known network that has good performance in ILSVRC, and we used RESNET 44 in this experiment.

Figure 16 shows the results of the training CIFAR10 dataset using the RESNET structure, and are compared with ADAM. In Figure 16a, the training cost of each method was calculated and plotted every 100th iteration. In Figure 16b, the validation cost of each method was calculated and plotted every 5000th iteration. This shows that the proposed method works effectively for more complex networks as well as simple networks.

**Figure 16.** Result of CIFAR10 dataset using the RESNET 44 model. The following parameters are default values; iteration : 80000, batch size : 128, weight decay : 2 <sup>×</sup> <sup>10</sup><sup>−</sup>4, learning rate : 0.1, batch norm decay : 0.997.

#### **6. Conclusions**

In this paper, we propose a new optimization method based on ADAM. Many of the existing optimization methods (including ADAM) may not work properly according to the initial point. However, the proposed method finds the global minimum better than other methods, even if there is a local minimum near the starting point, and has better overall performance. We tested our method only on models for image datasets such as MNIST and CIFAR10. Our future work is to test our method on various models such as RNN models for time series prediction, various models for natural language processing, etc.

**Author Contributions:** Conceptualization, D.Y.; Data curation, S.J.; Formal analysis, D.Y. and J.A.; Funding acquisition, D.Y.; Investigation, D.Y.; Methodology, D.Y. and J.A.; Project administration, D.Y.; Resources, S.J.; Software, S.J.; Supervision, D.Y. and J.A.; Validation, J.A. and S.J.; Visualization, S.J.; Writing—original draft, D.Y.; Writing—review & editing, S.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the basic science research program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (grant number NRF-2017R1E1A1A03070311). Also, the second author Ahn was supported by research fund of Chungnam National University.

**Acknowledgments:** We sincerely thank anonymous reviewers whose suggestions helped improve and clarify this manuscript greatly.

**Conflicts of Interest:** The authors declare no conflict of interest.
