An Improved Medical Image Classification Algorithm Based on Adam Optimizer

Sun, Haijing; Zhou, Wen; Yang, Jiapeng; Shao, Yichuan; Xing, Lei; Zhao, Qian; Zhang, Le

doi:10.3390/math12162509

Open AccessArticle

An Improved Medical Image Classification Algorithm Based on Adam Optimizer

by

Haijing Sun

¹,

Wen Zhou

²,

Jiapeng Yang

²

,

Yichuan Shao

^1,*,

Lei Xing

³,

Qian Zhao

⁴ and

Le Zhang

¹

School of Intelligent Science and Engineering, Shenyang University, Shenyang 110044, China

²

School of Information Engineering, Shenyang University, Shenyang 110044, China

³

School of Chemistry and Chemical Engineering, University of Surrey, Guildford GU2 7XH, UK

⁴

School of Science, Shenyang University of Technology, Shenyang 110044, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(16), 2509; https://doi.org/10.3390/math12162509

Submission received: 8 July 2024 / Revised: 9 August 2024 / Accepted: 13 August 2024 / Published: 14 August 2024

(This article belongs to the Special Issue Mathematics in Machine Learning-Based Image Processing with Their Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Due to the complexity and illegibility of medical images, it brings inconvenience and difficulty to the diagnosis of medical personnel. To address these issues, an optimization algorithm called GSL(Gradient sine linear) based on Adam algorithm improvement is proposed in this paper, which introduces gradient pruning strategy, periodic adjustment of learning rate, and linear interpolation strategy. The gradient trimming technique can scale the gradient to prevent gradient explosion, while the periodic adjustment of the learning rate and linear interpolation strategy adjusts the learning rate according to the characteristics of the sinusoidal function, accelerating the convergence while reducing the drastic parameter fluctuations, improving the efficiency and stability of training. The experimental results show that compared to the classic Adam algorithm, this algorithm can demonstrate better classification accuracy, the GSL algorithm achieves an accuracy of 78% and 75.2% on the MobileNetV2 network and ShuffleNetV2 network under the Gastroenterology dataset; and on the MobileNetV2 network and ShuffleNetV2 network under the Glaucoma dataset, an accuracy of 84.72% and 83.12%. The GSL optimizer achieved significant performance improvement on various neural network structures and datasets, proving its effectiveness and practicality in the field of deep learning, and also providing new ideas and methods for solving the difficulties in medical image recognition.

Keywords:

deep learning; Adam algorithm; gradient cropping; linear interpolation; periodic adjustment of learning rate

MSC:

68T07

1. Introduction

Medical image classification plays a crucial role in medical diagnosis. Accurate and efficient classification methods can help doctors quickly identify lesions and develop more precise treatment plans. In recent years, the rapid development of deep learning techniques has provided new possibilities for medical image classification. Optimization algorithms are crucial in deep learning and have achieved great success in image classification and medical image analysis. Lingyi Ouyang, Tao He, and Yiqiao Xing [1] summarized the changes in the structure and function of the retinal neurovascular unit (RNVU) in glaucoma and explored the potential pathogenesis of glaucoma, aiming to provide new ideas for the intervention and treatment of glaucoma. In some cases, doctors are not able to make a good judgment of whether the disease is present or not. Therefore, diagnosis with the help of computer-aided diagnosis [2] is particularly important in the treatment of glaucoma.

Commonly used optimization algorithms include Adam [3], SGD [4] and NAdam [5]. Chakrabarti et al. [6] systematically investigated the optimal hyperparameter settings for different gradient descent algorithms to improve their performance. Kuangyu Ding et al. [7] focused on the problem of insufficient weight attenuation in the Adam algorithm, and focused on the Adam series of optimization methods combined with decoupled weight attenuation to improve the performance of deep learning models. Adam⁺ [8] addressed the limitations of the Adam optimizer in variance handling, and proposed an improved Adam⁺ by adaptive variance reduction method. EAdam [9] optimizes the Adam optimizer for the deficiencies in convergence speed and stability, analyzes the design principle of EAdam in detail, and demonstrates its performance improvement in different application scenarios. Lu Xia et al. [10] proposed a fast adaptive gradient method called AdamL that combines a loss function, which investigates the limitations of the Adam optimizer in terms of gradient. Ran Tian and Ankur P. Parikh [11] proposed an optimizer called Amos that combines Adam style optimization and adaptive weight decay to adapt to the scale of the model. Byeongho Heo et al. [12] proposed an optimization algorithm called AdamP that aims to mitigate the deceleration of momentum optimizers on scale invariant weights. The Radar proposed by Liu and Jiang et al. [13] introduces a learning rate correction mechanism to address the problem of unstable training caused by large fluctuations in Adam learning rate variance due to certain reasons. Yichuan Shao et al. [14] proposed an improved Adam’s algorithm based on cyclic exponential decay learning rate and gradient paradigm constraints. Liu Hailiang [15] introduced an adaptive gradient method with energy and momentum, which provides new ideas for the improvement of optimization algorithms. Sedjro S. Hotegni et al. [16] multi-objective optimization method for sparse deep multi-task learning also provides new perspectives and methods for optimization of multi-task learning. StochGradAdam [17] improves Adam’s algorithm for the performance degradation that may occur when dealing with large-scale data, and utilizes stochastic gradient sampling to accelerate neural network training. Can Zhang et al. [18] proposed the WuC-Adam algorithm by combining the Warmup and cosine annealing algorithm strategies for the instability problem in the Adam algorithm. Yichuan Shao, Jiantao Wang et al. [19] introduced a BGE-Adam optimization algorithm combining entropy weighting and adaptive gradient strategy improvement for the Adam optimizer in terms of handling sparse gradients and convergence speed.

This study provides insights into the challenges faced by the Adam algorithm in terms of convergence and robustness. This study find that the performance of the Adam algorithm may suffer when the dataset is more complex or when the learning rate parameter is not set properly. To address this problem, this study introduced three core techniques: a gradient trimming strategy and a periodic learning rate variation and linear interpolation [20] strategy. By using dynamic gradient pruning techniques to restrict gradients and solve the problem of gradient explosion, the total norm of all gradients is calculated. If the total norm exceeds the threshold, all gradients are scaled proportionally so that their total norm does not exceed the threshold. Periodic changes in learning rate can allow the model to escape from the current local optimal region and accelerate convergence, thereby finding better solutions. Linear interpolation technology combines the current updated value with the previous parameter value during the parameter update process, which can smooth the changes in parameters and avoid the drastic changes caused by a single update. These improvements increase the robustness and effectiveness of the model during training.

2. GSL Algorithm Design

2.1. Dynamic Gradient Trimming Strategy

Gradient explosion is a common problem in deep learning, especially when training deep neural networks. Gradient explosion leads to excessive gradient values, which in turn leads to excessive model parameter updates and even numerical overflow, making the training process unable to proceed normally. In order to effectively solve the gradient explosion problem, this study adopts the dynamic gradient trimming technique to limit the gradient.

First, the L2 normalization of the gradient is calculated. Among them, the L2 normalization refers to the Euclidean norm. The calculation formula is shown in Equation (1):

| | \nabla_{θ} | |_{2} = \sqrt{\sum_{p} {(| | \nabla p | |_{2})}^{2}}

(1)

In the formula

| | \nabla_{θ} | |_{2}

denotes the total number of paradigms of the gradient and

| | \nabla p | |_{2}

denotes the L2 normalization of the gradient of the parameter

p

.

In order to control the size of the gradient of the parameters, a trimming factor (

c l i p_c o e f

) is defined, which takes the total gradient paradigm as the denominator, and a maximum paradigm is preset

m a x_g r a d_n o r m

, if the total gradient paradigm exceeds the preset maximum paradigm, i.e.,

c l i p_c o e f

< 1, then the gradient of each parameter is scaled proportionally, otherwise it remains at its original value. The purpose is to prevent gradient explosion caused by too large or too small gradient and to maintain the stability of gradient update. The calculation formula is shown in Equations (2) and (3):

c l i p_c o e f = \frac{m a x_g r a d_n o r m}{| | \nabla_{θ} | |_{2} + ε}

(2)

\nabla p \leftarrow \nabla p \times c l i p_c o e f

(3)

In the formula ε is a very small value used to prevent the denominator from being zero; and

\nabla p

denotes the gradient of the parameter

p

.

In this way, the gradients of all parameters are scaled so that the total gradient paradigm is equal to or less than the maximum paradigm

m a x_g r a d_n o r m

. Traditional gradient cropping usually restricts the total number of paradigms of all gradients to a predetermined threshold. That is, the total number of paradigms of all gradients is calculated, and if the total number of paradigms exceeds the threshold, all gradients are scaled in equal proportions so that their total number of paradigms does not exceed the threshold. In contrast, a more flexible approach to gradient cropping is used in this study. This means that gradients with different parameters may be cropped to different degrees, rather than just being scaled equally overall.

2.2. Periodic Adjustment of Learning Rate and Linear Interpolation Strategies

The learning rate determines the step size of each parameter update. Periodic increase of the learning rate allows the model to jump out of the current local optimal region and accelerate convergence to find a better solution as soon as possible. Decreasing the learning rate allows the model to more robustly converge to a new region and allows the model to make fine adjustments near the better solution, thus finding a better solution.

The linear interpolation technique combines the current update value with the previous parameter value during the parameter update process, which can smooth out the parameter changes and avoid the drastic changes brought about by a single update. This is useful for stabilizing the training process, especially when large gradient changes occur, i.e., when the learning rate is high or the gradient changes are large to help prevent the gradient explosion or gradient vanishing problem.

The learning rate is periodically adjusted according to a sinusoidal function, calculated as shown in Equation (4):

l r_{a d j u s t e d} = l r \times (1 + l r_f a c t o r \times \sin (\frac{2 π t}{l r_p e r i o d}))

(4)

In the formula

l r

represents the base learning rate;

t

represents the current number of steps;

l r_p e r i o d

represents the cycle length of the learning rate change;

l r_f a c t o r

represents the magnitude of the learning rate change; and

l r_{a d j u s t e d}

represents the adjusted learning rate. This periodic change in the learning rate allows for a smoother training process, avoiding oscillations due to too large a learning rate, and avoiding too slow training due to too small a learning rate.

The standard Adam optimizer dynamically adjusts the learning rate during the training process while performing parameter updates. Its parameter update rule formula is shown in Equation (5):

θ_{t} = θ_{t - 1} - α \frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ε}

(5)

In the formula

α

denotes the learning rate; ε is a very small value used to prevent the denominator from being zero; deviation correction first-order moment estimation

{\hat{m}}_{t}

and deviation correction second-order moment estimation

{\hat{v}}_{t}

.

Linear interpolation is a mathematical method of estimating unknown data points between two known data points. The underlying assumption is that the data changes in a linear fashion between these two known data points. Linear interpolation is the process of estimating new points in a straight line between two known points. Given two points

(x_{0}, y_{0})

and

(x_{1}, y_{1})

, linear interpolation estimates the value of

y

corresponding to

x

between these two points. For two known points

(x_{0}, y_{0})

and

(x_{1}, y_{1})

, and a point

x

between

x_{0}

and

x_{1}

, the linear interpolation formula is shown in Equation (6):

y = y_{0} + \frac{(y_{1} - y_{0})}{(x_{1} - x_{0})} (x - x_{0})

(6)

In the process of parameter update, the changes of parameters are smoothed by adding the idea of linear interpolation, which combines the current update value and the last parameter value to reduce the drastic parameter fluctuations, so as to improve the stability of the training process and prevent overfitting. It combines the weights of the parameter values calculated from the current update with the previous parameter values to ensure a smooth and progressive parameter update. The calculation formula is shown in Equations (7) and (8):

θ_{t}^{n e w} = θ_{t - 1} - l r_{a d j u s t e d} \frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ε}

(7)

θ_{t l} = (1 - i n t e r p_f a c t o r) θ_{t}^{n e w} + i n t e r p_f a c t o r \cdot θ_{t - 1}

(8)

In the formula

θ_{t l}

is the parameter value after interpolation at step

t

;

θ_{t - 1}

is the parameter value at step

t - 1

;

θ_{t}^{n e w}

is the new parameter value calculated by the optimization algorithm without interpolation; and

i n t e r p_f a c t o r

denotes the interpolation factor, which controls the proportion of linear interpolation. When

i n t e r p_f a c t o r

= 0, it means no interpolation, i.e., all weights are on the current update value, implying that the parameter update is performed directly using the standard Adam update rule formula; when

i n t e r p_f a c t o r

= 1, it means full interpolation, i.e., all weights are on the last parameter value, implying that no update is performed; When 0 <

i n t e r p_f a c t o r

< 1 is partial interpolation, the size of the parameter set according to the interpolation factor indicates the proportional size of the linear combination of the current update value and the previous parameter value according to the weights, and this kind of interpolation can smooth out the update of the parameter to avoid drastic changes.

2.3. GSL Algorithm

An improved GSL algorithm based on Adam is proposed in this study, which combines a gradient trimming strategy with a periodic adjustment of the learning rate and a linear interpolation strategy in the following steps:

Initialization: set the initial learning rate

l r

, the period length of the learning rate change

l r_p e r i o d

, the amplitude of the learning rate change

l r_f a c t o r

, the interpolation factor

i n t e r p_f a c t o r

, the maximum number of paradigms

m a x_g r a d_n o r m

, the first-order momentum

m_{t}

and the second-order momentum

v_{t}

are all initialized to zero.

Gradient trimming step: calculate the L2 paradigm of the gradient, define a trimming factor (

c l i p_c o e f

), i.e., preset a maximum paradigm

m a x_g r a d_n o r m

as the numerator, take the total gradient paradigm as the denominator, calculate the size of the trimming factor, and if the trimming factor exceeds the threshold, scale the gradient in equal proportions, so as to keep it from exceeding the threshold size. The purpose is to prevent too large a gradient from causing a gradient explosion and to maintain the balance and stability of the parameter update.

Periodic learning rate adjustment step: the learning rate is periodically adjusted according to a sinusoidal function.

Linear interpolation parameter update step: by changing the value of the interpolation factor, the size of the ratio between the current update value and the last parameter value linearly combined by weights is adjusted to reduce drastic parameter fluctuations, thus improving the stability of the training process and preventing overfitting.

The GSL optimizer not only implements the periodic adjustment of the learning rate, combined with the idea of linear interpolation for parameter updating, but also introduces the concept of gradient trimming, which is performed by calculating the L2 paradigm of the gradient to set a threshold for trimming, preventing the gradient from exploding and increasing the stability of the algorithm. The specific algorithm is shown in Algorithm 1.

Algorithm 1: GSL Algorithm

1: Input: initial point

θ_{0}

, the first-order momentum

m_{t}

, the second-order momentum

v_{t}

, first-moment decay

β_{1}

, second-moment decay

β_{2}

, avoid constants with zero denominator

ε

.

2 : Initialize m_{0} = 0

and v_{0} = 0

, l r

, l r_p e r i o d

, l r_f a c t o r

, i n t e r p_f a c t o r

, m a x_g r a d_n o r m

3 : For 0 to l r_p e r i o d

do

4 : θ_{t} = \nabla f_{t} (θ_{t - 1})

5 : m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) (\nabla θ_{t})

6 : v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) (\nabla θ_{t})^{2}

7 : \hat{m_{t}} = m / (1 - β_{1}^{t})

8 : \hat{v_{t}} = v / (1 - β_{2}^{t})

9 : | | \nabla_{θ} | |_{2} = \sqrt{\sum_{p} {(| | \nabla p | |_{2})}^{2}}

10 : c l i p_c o e f = \frac{m a x_g r a d_n o r m}{| | \nabla_{θ} | |_{2} + ε}

11 : If c l i p_c o e f

< 1

12 : \nabla p = \nabla p \times c l i p_c o e f

13 : l r_{a d j u s t e d} = l r (1 + l r_f a c t o r \cdot \sin (\frac{2 π t}{l r_p e r i o d}))

14 : s t e p_s i z e = \frac{l r_{a d j u s t e d}}{\sqrt{{\hat{v}}_{t}} + ε}

15 : θ_{t}^{n e w} = θ_{t - 1} - l r_{a d j u s t e d} \frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ε}

16 : θ_{t l} = (1 - i n t e r p_f a c t o r) θ_{t}^{n e w} + i n t e r p_f a c t o r \cdot θ_{t - 1}

End for

Return θ_{t l}

3. Experimental

3.1. Experimental Dataset and Pre-Processing

In order to evaluate the effectiveness of the GSL algorithm, three public datasets were selected for image classification experiments. three datasets, the CIFAR10 dataset is a fixed-size color image dataset in which all images are 32 × 32 pixels. The dataset contains 10 different categories including airplanes, cars, birds, cats, deer, dogs, and trucks. These categories cover a wide range of real-world objects, allowing the model to learn a wide range of features. Gastroenterology [21] is a color medical image dataset containing eight classifications. Glaucoma [22,23,24] is a dataset of fundus images for glaucoma detection, containing two classifications: 1 (Glaucoma Present) and 0 (Glaucoma not Present), and 8621 images are taken for the experiment. The amount of data in the experimental dataset, and the division of the dataset are shown in Table 1.

The images are preprocessed using data augmentation operations in deep learning, including random rotation, center cropping, image enlargement, and normalization. The processed images are fed into deep learning network models (MobileNetV2, ShuffleNetV2) and experimented with different optimization algorithms (including SGD, Adagrad, Adam, NAdam, and GSL algorithm). Several experiments were conducted to comprehensively evaluate their performance. Taking one of the glaucoma presence images in the glaucoma dataset as an example, the experimental training diagram for the GSL algorithm glaucoma dataset is shown in Figure 1. The processed images are fed into the neural network model architecture for training, and the output of the obtained experimental results is used to evaluate the performance of the GSL algorithm. The experimental procedure flowchart is shown in Figure 2.

Taking one of the polyps images in the Gastroenterology dataset as an example, the image is enlarged by 1.25 times after random rotation and center cropping operations, and the enlarged image is normalized, normalize the mean and standard deviation of the R, G, and B channels in color images. The Gastroenterology image preprocessing process is shown in Figure 3.

3.2. Experimental Model Architecture

This study evaluated the performance of the GSL algorithm using MobileNetV2 and ShuffleNetV2 image classification model architectures. MobileNetV2 [25] is a lightweight convolutional neural network model proposed by Google that uses depth-separable convolution to decompose the standard convolution into deep and point-by-point convolutions, which significantly reduces the computational cost and the number of parameters, making it suitable for use on mobile and embedded devices that have limited computational resources and power consumption while maintaining a high level of accuracy in a variety of tasks. ShuffleNetV2 is a lightweight convolutional neural network architecture designed for efficient mobile computing that reduces the overall computational complexity by introducing fewer parameters and computational effort. This neural network improves the architecture of ShuffleNetV1 by introducing new unit structures. These units combine deep separable convolution and point-by-point convolution with channel shuffle operations to enhance information flow and maintain high accuracy and performance.

3.3. Experimental Evaluation Criteria

The performance of GSL algorithm was compared with other optimization algorithms from the following different aspects. It involves training the model and evaluating its accuracy, loss value on validation or test set. Observe its versatility ability by training on different datasets and different neural network architectures. The GSL algorithm is experimentally compared with other classical optimization algorithms such as SGD, Adam, and Adagrad algorithms. This allows for a clearer assessment of the strengths and weaknesses of the GSL algorithm relative to other algorithms. The following aspects are specifically considered:

(1): Experimental setup:

Three typical datasets were selected, including CIFAR10, Gastroenterology and Glaucoma, and on each dataset this study used the same neural network architecture and hyperparameter settings to ensure a fair comparison.

(2): Experimental comparison of different algorithms:

Compare the performance of GSL algorithm with classical optimization algorithms such as SGD, Adam, Adagrad, and NAdam algorithms on the target task. This study observe the performance of different algorithms within the same number of training iterations, including accuracy, loss value. Accuracy is a basic measure of the performance of a classification model, which indicates the number of correctly classified samples as a proportion of the total number of samples. The accuracy is calculated as shown in Equation (9):

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(9)

In the formula TP denotes the number of true positives, correctly predicted as positive categories; TN denotes the number of true negatives, correctly predicted as negative categories; FP denotes the number of false positives, incorrectly predicted as positive categories; and FN denotes the number of false negative, incorrectly predicted as negative categories.

The loss function is used to measure the gap between the predicted and true values of the model. The loss value is calculated as shown in Equation (10):

H (p, q) = - \sum_{i = 1}^{n} p (x_{i}) \log (q (x_{i}))

(10)

In the formula

H (p, q)

represents the cross-entropy loss;

x_{i}

represents different possible events that may occur;

p (x_{i})

is the expected output of the sample;

q (x_{i})

is the actual output of the sample, and

n

represents the number of sample classes.

If the GSL algorithm achieves better performance within the same number of training iterations, then it can be considered as an effective improvement.

(3): Versatility ability evaluation:

This study conduct experiments on different datasets (including CIFAR10, Gastroenterology, and Glaucoma) and different neural network architectures (two lightweight neural networks, MobileNetV2, and ShuffleNetV2) to observe the versatility ability of the GSL algorithm. By observing the performance of the GSL algorithm on different datasets and different neural network architectures, the versatility ability of the GSL algorithm as well as its strengths and weaknesses relative to other algorithms can be evaluated.

3.4. Experimental Results and Analysis

The appropriate choice of hyperparameters is very important for the performance of the GSL algorithm, and different hyperparameter values have different performance when training with different neural networks, and the same is true for the dataset. This study uses the open classical dataset CIFAR10 for experiments to illustrate this issue. The experiment was designed with three different combinations of the cycle length of the learning rate change

l r_p e r i o d

, the magnitude of the learning rate change

l r_f a c t o r

and the interpolation factor

i n t e r p_f a c t o r

, and the experimental training was set to 20 epochs. The performance of the algorithm varies when the cycle length of the learning rate change

l r_p e r i o d

, the magnitude of the learning rate change

l r_f a c t o r

and the interpolation factor

i n t e r p_f a c t o r

are different. The results of experiments with different hyperparameters and different neural network architectures on the CIFAR10 dataset are shown in Table 2. The table shows the performance of the GSL algorithm for different combinations of hyperparameter settings on MobileNetV2 neural network and ShuffleNetV2 neural network under the CIFAR10 dataset, respectively.

According to the observations in Table 2, different hyperparameter settings have different experimental results when training on the same neural network is used, at which point the GSL algorithm performs differently. The same is true when training on different neural networks are used. Thanks to the hyperparameter settings, the best performance can be obtained by adjusting the values of the hyperparameters when facing different situations in order to achieve the desired experimental results.

In performing the following training, the hyperparameters were set to a set of data at 2000-0.8-0.1 when the GSL algorithm performed hyperparameter experiments on the CIFAR10 dataset. Table 3 shows the training results of the five algorithms on the CIFAR10 dataset. The experimental results of different algorithms and different neural network architectures on the CIFAR10 dataset are shown in Table 3, and the cases where the GSL algorithm performs well in the table are denoted in bold font.

The experimental results in Table 3 show that the GSL algorithm performs better when trained on the MobileNetV2 neural network using the CIFAR10 dataset, and its accuracy outperforms the remaining four algorithms. The algorithm improved the accuracy by 5.93% compared to the Adam algorithm and 4.57% compared the NAdam algorithm which is improvement of Adam’s algorithm; When trained on the ShuffleNetV2 neural network, the accuracy of the GSL algorithm improved by 5.46% compared to the accuracy of the Adam algorithm and by 4.74% compared to the NAdam algorithm, which was improvement of the Adam algorithm. The GSL algorithm has a better performance when facing different neural network model training tasks, which also reflects that the GSL algorithm has a good versatility ability.

Table 4 shows the training results of the five algorithms on the Gastroenterology dataset. The experimental results of different algorithms and different neural network architectures on the Gastroenterology dataset are shown in Table 4, and the cases where the GSL algorithm performs well in the table are denoted by bold font.

Figure 4 and Figure 5 show the training results of the five algorithms for different neural network models on the Gastroenterology dataset, respectively.

The results in Figure 4 show that on the Gastroenterology dataset, the GSL algorithm is slightly more accurate than the other four algorithms as the number of training iterations increases, and the algorithm converges faster compared to the other algorithms. The results in Figure 5 show that although the accuracy of the GSL algorithm is slightly lower than that of Adam’s algorithm at the beginning of training, the GSL algorithm outperforms the remaining four algorithms as the training progresses. This advantage can be attributed to the gradient cropping strategy employed by the GSL optimizer. During the training process, the GSL algorithm restricts the size of the total gradient paradigm through the clipping coefficients to ensure the gradient update smoothness. By adjusting the change in the size of the parameter gradient, the GSL algorithm can effectively avoid drastic fluctuations in the gradient, thus improving the stability of model training.

Table 5 shows the training results of the five algorithms on the Glaucoma dataset. The experimental results of different algorithms and different neural network architectures on the Glaucoma dataset are shown in Table 5, and the cases where the GSL algorithm performs well in the table are denoted by bold font.

Figure 6 and Figure 7 show the training results of the five algorithms for different neural network models on the Glaucoma dataset, respectively.

According to the results in Figure 6, the GSL algorithm has better performance throughout the training process as the number of training rounds increases on the MobileNetV2 network. The GSL algorithm improves by 1.93% and 1.98% compared to the Adam algorithm and the NAdam algorithm which is improved based on the Adam algorithm. The results in Figure 7 show that the GSL algorithm improves the accuracy of the GSL algorithm by 1.22% compared to the Adam algorithm on the ShuffleNetV2 network. This advantage can be attributed to the periodic control of changing the learning rate strategy, where the learning rate determines the step size of each parameter update, the value of the learning rate is constantly changed during training, and linear interpolation is utilized to adjust the magnitude of the ratio of the current update value to the last parameter value linearly combined by weights to reduce drastic parameter fluctuations.

4. Discussion

Although research has shown significant progress in the performance of the GSL algorithm on these datasets and neural networks, there are also some limitations. One limitation is that when parameters have different scales and sensitivities, further research is needed to change the strategy of gradient clipping to achieve optimal performance. In addition, the performance of the GSL algorithm may be affected by the characteristics of the dataset or neural network, and additional hyperparameter adjustments may be required to achieve optimal performance.

5. Conclusions

In order to improve the problems and deficiencies of the Adam optimization algorithm, a new optimization algorithm, GSL algorithm, is proposed in this study. Experimental validation by combining different public datasets and neural networks confirms that the GSL algorithm has a better performance compared to traditional optimization methods. GSL algorithm provides a more flexible and efficient training approach by combining gradient cropping, periodic adjustment of the learning rate, and linear interpolation to adjust the weight proportions of the parameter update values. The results show that the GSL algorithm not only improves the performance of the model on the standard test set, but also provides a feasible path for performance improvement and stability enhancement of deep learning models in practical applications, such as helping doctors to diagnose whether a patient’s intestines are diseased as well as the type of the disease and to identify glaucomatous lesions efficiently, thus helping doctors to recognize whether the patient has glaucomatous features or not. In the future, we will investigate how the GSL algorithm can achieve optimal results when facing different datasets and models.

Author Contributions

Conceptualization, methodology, and writing—original draft preparation, W.Z.; software, project administration, and resources, Y.S.; data curation, J.Y.; writing—review and editing, supervision, and formal analysis, H.S.; supervision, Q.Z., L.X. and L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

1. Liaoning Provincial Department of Education Basic Research Project for Higher Education Institutions (General Project), Shenyang University of Technology, Research on Optimization Design of Wind Turbine Cone Angle Based on Fluid Physics Method (LJKZ0159). 2. Basic Research Project of Liaoning Provincial Department of Education “Training and Application of Multimodal Deep Neural Network Models for Vertical Fields” Project Number: JYTMS20231160. 3. Research on the Construction of a New Artificial Intelligence Technology and High-Quality Education Service Supply System in the 14th Five-Year Plan for Education Science in Liaoning Province, 2023–2025, Project Number: JG22DB488. 4. “Chun hui Plan” of the Ministry of Education, Re-search on Optimization Model and Algorithm for Microgrid Energy Scheduling Based on Biological Behavior, Project No. 202200209. 5. Shenyang Science and Technology Plan “Special Mission for Leech Breeding and Traditional Chinese Medicine Planting in Dengshibao Town, Faku County”, Project No. 22-319-2-26.

Data Availability Statement

Optimizer and basic program code: zhou0618/GSL (github.com) (accessed on 18 Jun 2024); CIFAR10 dataset: https://www.kaggle.com/datasets/gazu468/cifar10-classification-image (accessed on 6 May 2024); and Gastroenterology dataset: https://doi.org/10.1038/s41597-020-00622-y (accessed on 6 May 2024); Glaucoma dataset: https://www.kaggle.com/datasets/sabari50312/fundus-pytorch (accessed on 6 May 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The list of abbreviations and symbols is shown below.

GSL algorithm	Gradient Sine Linear algorithm
lr_period	cycle length of the learning rate change
lr_factor	amplitude of the learning rate change
Interp_factor	interpolation factor
max_grad_norm	maximum paradigm
clip_coef	trimming factor

References

Ouyang, L.Y.; He, T.; Xing, Y.Q. Progress of retinal neurovascular unit injury in glaucoma. Int. J. Ophthalmol. 2024, 24, 230–235. [Google Scholar]
Song, H.; Nguyen, A.D.; Gong, M.; Lee, S. A review of computer vision methods for purpose on computer-aided diagnosis. J. Lnternational Soc. Simul. Surg. 2016, 3, 1–8. [Google Scholar] [CrossRef]
Kingma, D.K.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]
Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
Dozat, T. Incorporating Nesterov Momentum into Adam. 2016. Available online: https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ (accessed on 6 April 2024).
Chen, A.C. Exploring the Optimized Value of Each Hyperparameter in Various Gradient Descent Algorithms. arXiv 2022, arXiv:2212.12279. [Google Scholar] [CrossRef]
Ding, K.; Xiao, N.; Toh, K.-C. Adam-family Methods with Decoupled Weight Decay in Deep Learning. arXiv 2023, arXiv:2310.08858. [Google Scholar] [CrossRef]
Liu, M.; Zhang, W.; Orabona, F.; Yang, T. Adam⁺: A Stochastic Method with Adaptive Variance Reduction. arXiv 2020, arXiv:2011.11985. [Google Scholar] [CrossRef]
Yuan, W.; Gao, K.-X. EAdam Optimizer: How ε Impact Adam. arXiv 2020, arXiv:2011.02150. [Google Scholar] [CrossRef]
Xia, L.; Massei, S. AdamL: A fast adaptive gradient method incorporating loss function. arXiv 2023, arXiv:2312.15295. [Google Scholar] [CrossRef]
Tian, R.; Parikh, A.P. Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale. arXiv 2022, arXiv:2210.11693. [Google Scholar] [CrossRef]
Heo, B.; Chun, S.; Oh, S.J.; Han, D.; Yun, S.; Kim, G.; Uh, Y.; Ha, J.-W. AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights. arXiv 2021, arXiv:2006.08217. [Google Scholar] [CrossRef]
Liu, L.; Jiang, H.; He, P.; Chen, W.; Liu, X.; Gao, J.; Han, J. On the variance of the adaptive learning rate and beyond. arXiv 2019, arXiv:1908.03265. [Google Scholar]
Shao, Y.; Yang, J.; Zhou, W.; Xing, L.; Zhao, Q.; Zhang, L. An Improvement of Adam Based on a Cyclic Exponential Decay Learning Rate and Gradient Norm Constraints. Electronics 2024, 13, 1778. [Google Scholar] [CrossRef]
Liu, H.; Tian, X. An Adaptive Gradient Method with Energy and Momentum. Ann. Appl. Math. 2022, 38, 183–222. [Google Scholar] [CrossRef]
Hotegni, S.S.; Berkemeier, M.; Peitz, S. Multi-Objective Optimization for Sparse Deep Multi-Task Learning. arXiv 2024, arXiv:2308.12243. [Google Scholar] [CrossRef]
Yun, J. StochGradAdam: Accelerating Neural Networks Training with Stochastic Gradient Sampling. arXiv 2024, arXiv:2310.17042. [Google Scholar] [CrossRef]
Zhang, C.; Shao, Y.; Sun, H.; Xing, L.; Zhao, Q.; Zhang, L. The WuC-Adam algorithm based on joint improvement of Warmup and cosine annealing algorithms. Math. Biosci. Eng. MBE 2024, 21, 1270–1285. [Google Scholar] [CrossRef] [PubMed]
Shao, Y.; Wang, J.; Sun, H.; Yu, H.; Xing, L.; Zhao, Q.; Zhang, L. An Improved BGE-Adam Optimization Algorithm Based on Entropy Weighting and Adaptive Gradient Strategy. Symmetry 2024, 16, 623. [Google Scholar] [CrossRef]
Sun, H.; Zhou, W.; Shao, Y.; Cui, J.; Xing, L.; Zhao, Q.; Zhang, L. A Linear Interpolation and Curvature-Controlled Gradient Optimization Strategy Based on Adam. Algorithms 2024, 17, 185. [Google Scholar] [CrossRef]
Borgli, H.; Thambawita, V.; Smedsrud, P.H.; Hicks, S.; Jha, D.; Eskeland, S.L.; Randel, K.R.; Pogorelov, K.; Lux, M.; Nguyen, D.T.D.; et al. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Sci. Data 2020, 7, 283. [Google Scholar] [CrossRef]
Kiefer, R.; Abid, M.; Steen, J.; Ardali, M.R.; Amjadian, E. A Catalog of Public Glaucoma Datasets for Machine Learning Applications: A detailed description and analysis of public glaucoma datasets available to machine learning engineers tackling glaucoma-related problems using retinal fundus images and OCT images. In Proceedings of the 2023 7th International Conference on Information System and Data Mining, Atlanta, CA, USA, 10–12 May 2023. [Google Scholar]
Kiefer, R.; Abid, M.; Ardali, M.R.; Steen, J.; Amjadian, E. Automated Fundus Image Standardization Using a Dynamic Global Foreground Threshold Algorithm. In Proceedings of the 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 27–29 July 2023; IEEE: New York, NY, USA, 2023; pp. 460–465. [Google Scholar]
Kiefer, R.; Steen, J.; Abid, M.; Ardali, M.R.; Amjadian, E. A Survey of Glaucoma Detection Algorithms using Fundus and OCT Images. In Proceedings of the 2022 IEEE 13th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Virtual, 12–15 October 2022; IEEE: New York, NY, USA, 2022; pp. 0191–0196. [Google Scholar]
Sun, H.; Cai, Y.; Tao, R.; Shao, Y.; Xing, L.; Zhang, C.; Zhao, Q. An Improved Reacceleration Optimization Algorithm Based on the Momentum Method for Image Recognition. Mathematics 2024, 12, 1759. [Google Scholar] [CrossRef]

Figure 1. Experimental training plot for the GSL algorithm Glaucoma dataset.

Figure 2. Experimental procedure flowchart.

Figure 3. Gastroenterology dataset image preprocessing process.

Figure 4. Performance comparison of different algorithms on MobileNetV2 for the Gastroenterology dataset on the validation set: (a) accuracy comparison; (b) loss value comparison.

Figure 5. Performance comparison of different algorithms on ShuffleNetV2 for the Gastroenterology dataset on the validation set: (a) accuracy comparison; (b) loss value comparison.

Figure 6. Performance comparison of different algorithms on MobileNetV2 for the Glaucoma dataset on the validation set: (a) accuracy comparison; (b) loss value comparison.

Figure 7. Performance comparison of different algorithms on ShuffleNetV2 for the Glaucoma dataset on the validation set: (a) accuracy comparison; (b) loss value comparison.

Table 1. Data volume of the experimental dataset, dataset partitioning.

Dataset	Number of Samples	Training Set	Test Set	Validation Set	Classes
CIFAR10	60,000	45,000	5000	10,000	10
Gastroenterology	1885	900	485	500	8
Glaucoma	8621	5000	1500	2121	2

Table 2. Experimental results with different hyperparameters and different neural network architectures on the CIFAR10 dataset.

	MobileNetV2		ShuffleNetV2
Hyperparameter ¹	MobileNetV2		ShuffleNetV2
	Accuracy	Loss	Accuracy	Loss
1500, 0.5, 0.2	0.6986	0.9903	0.6533	1.4250
1500, 0.8, 0.2	0.7111	0.9481	0.6775	0.9875
2000, 0.8, 0.1	0.7184	0.8641	0.6902	1.0810

¹ The order of hyperparameter settings is

l r_p e r i o d

,

l r_f a c t o r

,

i n t e r p_f a c t o r

, Example 1500, 0.5, 0.2 means that

l r_p e r i o d

is set to 1500,

l r_f a c t o r

is set to 0.5, and

i n t e r p_f a c t o r

is set to 0.2.

Table 3. Experimental results of different algorithms and neural network architectures on the CIFAR10 dataset on the test set.

	MobileNetV2		ShuffleNetV2
Algorithm	MobileNetV2		ShuffleNetV2
	Accuracy	Loss	Accuracy	Loss
SGD	0.4832	1.4660	0.4623	1.5750
Adagrad	0.2987	1.9500	0.3398	1.9090
Adam	0.6849	1.2200	0.6419	1.4320
NAdam	0.6985	1.6850	0.6494	1.5040
GSL	0.7442	0.9232	0.6965	1.2350

Table 4. Experimental results of different algorithms and different neural network architectures on Gastroenterology dataset on the test set.

	MobileNetV2		ShuffleNetV2
Algorithm	MobileNetV2		ShuffleNetV2
	Accuracy	Loss	Accuracy	Loss
SGD	0.6000	1.0800	0.5560	2.0190
Adagrad	0.6740	1.0120	0.6620	1.0230
Adam	0.6980	1.0460	0.7040	0.9261
NAdam	0.7380	0.9189	0.7340	1.0400
GSL	0.7800	1.0300	0.7520	1.0410

Table 5. Experimental results of different algorithms and different neural network architectures on Glaucoma dataset on the test set.

	MobileNetV2		ShuffleNetV2
Algorithm	MobileNetV2		ShuffleNetV2
	Accuracy	Loss	Accuracy	Loss
SGD	0.7841	0.4751	0.6209	0.6562
Adagrad	0.7765	0.7450	0.7529	0.7160
Adam	0.8152	0.4510	0.8190	0.6830
Nadam	0.8147	0.4713	0.8223	0.8146
GSL	0.8345	0.7161	0.8312	0.5959

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, H.; Zhou, W.; Yang, J.; Shao, Y.; Xing, L.; Zhao, Q.; Zhang, L. An Improved Medical Image Classification Algorithm Based on Adam Optimizer. Mathematics 2024, 12, 2509. https://doi.org/10.3390/math12162509

AMA Style

Sun H, Zhou W, Yang J, Shao Y, Xing L, Zhao Q, Zhang L. An Improved Medical Image Classification Algorithm Based on Adam Optimizer. Mathematics. 2024; 12(16):2509. https://doi.org/10.3390/math12162509

Chicago/Turabian Style

Sun, Haijing, Wen Zhou, Jiapeng Yang, Yichuan Shao, Lei Xing, Qian Zhao, and Le Zhang. 2024. "An Improved Medical Image Classification Algorithm Based on Adam Optimizer" Mathematics 12, no. 16: 2509. https://doi.org/10.3390/math12162509

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Medical Image Classification Algorithm Based on Adam Optimizer

Abstract

1. Introduction

2. GSL Algorithm Design

2.1. Dynamic Gradient Trimming Strategy

2.2. Periodic Adjustment of Learning Rate and Linear Interpolation Strategies

2.3. GSL Algorithm

3. Experimental

3.1. Experimental Dataset and Pre-Processing

3.2. Experimental Model Architecture

3.3. Experimental Evaluation Criteria

3.4. Experimental Results and Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI