An Improvement of Adam Based on a Cyclic Exponential Decay Learning Rate and Gradient Norm Constraints
Abstract
:1. Introduction
2. CN-Adam Algorithm Design
2.1. Adam Optimization Algorithm
2.2. Cyclic Exponential Decay Learning Rate
2.3. Gradient Norm Constraint Strategy
2.4. CN-Adam Algorithm
Algorithm 1 CN-Adam |
1: Input: initial point , first moment decay , second moment decay , regularization constant 2: Initialize and , ,,,,, 3: For 0 to do 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: While do 14: If or 15: end while 16: If 17: 18: 19: end for Return |
3. Experimental Design and Analysis of Results
3.1. Configuration of the Experimental Environment
3.2. Experimental Results and Analysis
- (1)
- Application domain: The algorithm used three datasets in this study, namely, the MNIST dataset on handwritten digit recognition, the CIFAR10 dataset of color images with 10 classifications, and the medical dataset in healthcare. The download paths of the corresponding datasets can be found in the author’s GitHub code and the data availability statement of this article.
- (2)
- Optimization algorithms: The experiments covered seven optimization algorithms, namely, SGD, AdaGrad, Adadelta, Adam, NAdam, StochGradAdam, and CN-Adam, aiming at comparing the performance differences between them.
- (3)
- Batch size: The batch size used in each experiment was 128 to ensure the consistency and fairness of the experiment.
- (4)
- Learning rate setting: The initial learning rate for all three datasets was 0.001. For the CN-Adam algorithm, the maximum learning rate was 0.01, and the minimum learning rate was 0.0001.
- (5)
- Epoch size: To accurately assess the performance of each optimizer, all experiments were conducted with 100 epochs to ensure adequate and accurate model training.
- (6)
- Adjustment of key parameters: Key parameters in the algorithm were fine-tuned based on different datasets to ensure the comparability and accuracy of the experimental results.
- (7)
- Data preprocessing: Before conducting the experiments, necessary data processing operations, such as normalization, standardization, and data augmentation, were performed to ensure the quality and consistency of the input data.
- (8)
- Experimental results: Several comparison experiments were conducted on the MNIST, CIFAR10, and medical datasets, taking into account factors such as Acc, loss, and GPU power consumption to fully demonstrate the advantages of the CN-Adam algorithm.
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Jiang, Y.; Liu, J.; Xu, D.; Mandic, D.P. UAdam: Unified Adam-Type Algorithmic Framework for Non-Convex Stochastic Optimization. arXiv 2023. [Google Scholar] [CrossRef]
- Liu, L.; Jiang, H.; He, P.; Chen, W.; Liu, X.; Gao, J.; Han, J. On the Variance of the Adaptive Learning Rate and Beyond. arXiv 2021. [Google Scholar] [CrossRef]
- Yuan, W.; Gao, K.-X. EAdam Optimizer: HowεImpact Adam. arXiv 2020. [Google Scholar] [CrossRef]
- Liu, M.; Zhang, W.; Orabona, F.; Yang, T. Adam+: A Stochastic Method with Adaptive Variance Reduction. arXiv 2020. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017. [Google Scholar] [CrossRef]
- Guan, L. AdaPlus: Integrating Nesterov Momentum and Precise Stepsize Adjustment on Adamw Basis. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 5210–5214. [Google Scholar] [CrossRef]
- Dozat, T. Incorporating Nesterov Momentum into Adam. February 2016. Available online: https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ (accessed on 19 February 2024).
- Zhuang, J.; Tang, T.; Ding, Y.; Tatikonda, S.C.; Dvornek, N.; Papademetris, X.; Duncan, J. AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2020; Volume 33, pp. 18795–18806. [Google Scholar]
- Yun, J. StochGradAdam: Accelerating Neural Networks Training with Stochastic Gradient Sampling. arXiv 2023. [Google Scholar] [CrossRef]
- Zhang, C.; Shao, Y.; Sun, H.; Xing, L.; Zhao, Q.; Zhang, L. The WuC-Adam algorithm based on joint improvement of Warmup and cosine annealing algorithms. Math. Biosci. Eng. 2023, 21, 1270–1285. [Google Scholar] [CrossRef]
- Tang, Q.; Lécuyer, M. DP-Adam: Correcting DP Bias in Adam’s Second Moment Estimation. arXiv 2023. [Google Scholar] [CrossRef]
- Tang, Q.; Shpilevskiy, F.; Lécuyer, M. DP-AdamBC: Your DP-Adam Is Actually DP-SGD (Unless You Apply Bias Correction). arXiv 2023. [Google Scholar] [CrossRef]
- Xia, L.; Massei, S. AdamL: A fast adaptive gradient method incorporating loss function. arXiv 2023. [Google Scholar] [CrossRef]
- Asadi, K.; Fakoor, R.; Sabach, S. Resetting the Optimizer in Deep RL: An Empirical Study. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2023; Volume 36, pp. 72284–72324. [Google Scholar]
- Bieringer, S.; Kasieczka, G.; Steffen, M.F.; Trabs, M. AdamMCMC: Combining Metropolis Adjusted Langevin with Momentum-based Optimization. arXiv 2023. [Google Scholar] [CrossRef]
- Xie, X.; Zhou, P.; Li, H.; Lin, Z.; Yan, S. Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models. arXiv 2023. [Google Scholar] [CrossRef]
- Shao, Y.; Fan, S.; Sun, H.; Tan, Z.; Cai, Y.; Zhang, C.; Zhang, L. Multi-Scale Lightweight Neural Network for Steel Surface Defect Detection. Coatings 2023, 13, 1202. [Google Scholar] [CrossRef]
- Shao, Y.; Zhang, C.; Xing, L.; Sun, H.; Zhao, Q.; Zhang, L. A new dust detection method for photovoltaic panel surface based on Pytorch and its economic benefit analysis. Energy AI 2024, 16, 100349. [Google Scholar] [CrossRef]
- Gupta, A.; Dixit, M.; Mishra, V.K.; Singh, A.; Dayal, A. Brain Tumor Segmentation from MRI Images Using Deep Learning Techniques. In Advanced Computing; Springer Nature: Cham, Switzerland, 2023; pp. 434–448. [Google Scholar] [CrossRef]
- Tang, L.Y.W. Severity classification of ground-glass opacity via 2-D convolutional neural network and lung CT scans: A 3-day exploration. arXiv 2023. [Google Scholar] [CrossRef]
- Pandit, B.R.; Alsadoon, A.; Prasad, P.W.; Al Aloussi, S.; Rashid, T.A.; Alsadoon, O.H.; Jerew, O.D. Deep Learning Neural Network for Lung Cancer Classification: Enhanced Optimization Function. Multimed. Tools Appl. 2023, 82, 6605–6624. [Google Scholar] [CrossRef]
- Nanni, L.; Manfe, A.; Maguolo, G.; Lumini, A.; Brahnam, S. High performing ensemble of convolutional neural networks for insect pest image detection. Ecol. Inform. 2022, 67, 101515. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017. [Google Scholar] [CrossRef]
- Smith, L.N. Cyclical Learning Rates for Training Neural Networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 464–472. [Google Scholar] [CrossRef]
Software | Version |
---|---|
Python | 3.10 |
CUDA | cu118 |
torch | 2.0.1 |
torchvision | 0.15.0 |
Lightning | 2.1.2 |
wandb | 0.16.0 |
Dataset | Sample Size | Training Set | Validation Set | Test Set | Categories | Features |
---|---|---|---|---|---|---|
MNIST | 70,000 | 55,000 | 5000 | 10,000 | 10 | Grayscale images, few classifications, and difficult to recognize |
CIFAR10 | 60,000 | 45,000 | 5000 | 10,000 | 10 | Color RGB images, fewer classifications, and difficult to recognize |
Medical | 1885 | 900 | 485 | 500 | 8 | Color RGB images, fewer classifications, and difficult to recognize |
Dataset | Optimization Algorithm | Accuracy | Loss |
---|---|---|---|
MNIST | SGD | 98.42% | 0.081 |
AdaGrad | 98.51% | 0.057 | |
Adadelta | 96.44% | 0.13 | |
Adam | 98.53% | 0.056 | |
NAdam | 98.48% | 0.061 | |
StochGradAdam | 98.09% | 0.07 | |
CN-Adam | 98.54% | 0.06 | |
CIFAR10 | SGD | 49.68% | 1.434 |
AdaGrad | 30.21% | 1.935 | |
Adadelta | 23.95% | 2.032 | |
Adam | 68.49% | 1.21 | |
NAdam | 68.67% | 1.638 | |
StochGradAdam | 68.07% | 1.04 | |
CN-Adam | 72.10% | 0.902 | |
Medical | SGD | 63.60% | 1.185 |
AdaGrad | 67.40% | 1.012 | |
Adadelta | 56.00% | 2.261 | |
Adam | 72.40% | 0.872 | |
NAdam | 72.80% | 0.857 | |
StochGradAdam | 70.20% | 1.055 | |
CN-Adam | 78.80% | 0.7245 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shao, Y.; Yang, J.; Zhou, W.; Sun, H.; Xing, L.; Zhao, Q.; Zhang, L. An Improvement of Adam Based on a Cyclic Exponential Decay Learning Rate and Gradient Norm Constraints. Electronics 2024, 13, 1778. https://doi.org/10.3390/electronics13091778
Shao Y, Yang J, Zhou W, Sun H, Xing L, Zhao Q, Zhang L. An Improvement of Adam Based on a Cyclic Exponential Decay Learning Rate and Gradient Norm Constraints. Electronics. 2024; 13(9):1778. https://doi.org/10.3390/electronics13091778
Chicago/Turabian StyleShao, Yichuan, Jiapeng Yang, Wen Zhou, Haijing Sun, Lei Xing, Qian Zhao, and Le Zhang. 2024. "An Improvement of Adam Based on a Cyclic Exponential Decay Learning Rate and Gradient Norm Constraints" Electronics 13, no. 9: 1778. https://doi.org/10.3390/electronics13091778
APA StyleShao, Y., Yang, J., Zhou, W., Sun, H., Xing, L., Zhao, Q., & Zhang, L. (2024). An Improvement of Adam Based on a Cyclic Exponential Decay Learning Rate and Gradient Norm Constraints. Electronics, 13(9), 1778. https://doi.org/10.3390/electronics13091778