A Gastrointestinal Image Classification Method Based on Improved Adam Algorithm

Sun, Haijing; Cui, Jiaqi; Shao, Yichuan; Yang, Jiapeng; Xing, Lei; Zhao, Qian; Zhang, Le

doi:10.3390/math12162452

Open AccessArticle

A Gastrointestinal Image Classification Method Based on Improved Adam Algorithm

by

Haijing Sun

¹,

Jiaqi Cui

²,

Yichuan Shao

^1,*,

Jiapeng Yang

²

,

Lei Xing

³

,

Qian Zhao

⁴ and

Le Zhang

¹

School of Intelligent Science & Engineering, Shenyang University, Shenyang 110044, China

²

School of Information Engineering, Shenyang University, Shenyang 110044, China

³

School of Chemistry and Chemical Engineering, University of Surrey, Surrey GU2 7XH, UK

⁴

School of Science, Shenyang University of Technology, Shenyang 110044, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(16), 2452; https://doi.org/10.3390/math12162452

Submission received: 8 July 2024 / Revised: 1 August 2024 / Accepted: 5 August 2024 / Published: 7 August 2024

(This article belongs to the Special Issue Advances in Image-Based Decision Support Systems for Personalized Healthcare and Computational Biology)

Download

Browse Figures

Versions Notes

Abstract

:

In this study, a gastrointestinal image classification method based on the improved Adam algorithm is proposed. Gastrointestinal image classification is of great significance in the field of medical image analysis, but it presents numerous challenges, including slow convergence, susceptibility to local minima, and the complexity and imbalance of medical image data. Although the Adam algorithm is widely used in stochastic gradient descent, it tends to suffer from overfitting and gradient explosion issues when dealing with complex data. To address these problems, this paper proposes an improved Adam algorithm, AdamW_AGC, which combines the weight decay and Adaptive Gradient Clipping (AGC) strategies. Weight decay is a common regularization technique used to prevent machine learning models from overfitting. Adaptive gradient clipping avoids the gradient explosion problem by restricting the gradient to a suitable range and helps accelerate the convergence of the optimization process. In order to verify the effectiveness of the proposed algorithm, we conducted experiments on the HyperKvasir dataset and validation experiments on the MNIST and CIFAR10 standard datasets. Experimental results on the HyperKvasir dataset demonstrate that the improved algorithm achieved a classification accuracy of 75.8%, compared to 74.2% for the traditional Adam algorithm, representing an improvement of 1.6%. Furthermore, validation experiments on the MNIST and CIFAR10 datasets resulted in classification accuracies of 98.69% and 71.7%, respectively. These results indicate that the AdamW_AGC algorithm has advantages in handling complex, high-dimensional medical image classification tasks, effectively improving both classification accuracy and training stability. This study provides new ideas and expansions for future optimizer research.

Keywords:

deep learning; Adam algorithm; gastrointestinal image; adaptive gradient clipping; weight decay; computer-aided diagnostic

MSC:

68T07

1. Introduction

Gastrointestinal diseases are common digestive disorders, such as gastric ulcers, colon polyps, gastrointestinal tumors, etc. They have important diagnostic and therapeutic value in clinical practice. Gastrointestinal endoscopy is a commonly used clinical diagnostic tool to observe patients’ lesions by taking images of the inside of the gastrointestinal tract. However, since it is a laborious task for doctors to analyze and classify a large number of gastrointestinal images, it is important to automate the classification of gastrointestinal images using computer-aided diagnostic techniques.

In recent years, with the rapid development of deep learning technology, deep learning-based gastrointestinal image classification [1] methods have gradually attracted the attention of researchers. Feature extraction and classification of gastrointestinal images can be effectively performed using deep learning models to achieve automated disease detection and diagnosis. However, traditional optimization algorithms have problems such as slow convergence and ease of falling into local extremes when dealing with large-scale data, which limit the performance of the algorithms. Therefore, this paper aims to propose a gastrointestinal image classification method based on the improved Adam algorithm to improve the accuracy and stability of gastrointestinal image classification by optimizing the convergence speed and performance of the Adam algorithm. In this paper, the Adam algorithm will be improved by combining the weight decay and adaptive gradient shear strategies to better adapt to the gastrointestinal image classification task, and the proposed method will be experimentally verified and analyzed.

The Adam (Adaptive Moment Estimation) algorithm [2] is an effective stochastic gradient descent method, which can largely reduce the fluctuations in training by adaptively adjusting the learning rate, and this method has the advantages of high efficiency and fast learning speed. Although the Adam algorithm has been successful in several fields, there are fewer studies to improve its performance in gastrointestinal image classification tasks. In addition, the Adam algorithm has some shortcomings. In recent years, researchers have proposed various strategies to improve the Adam algorithm. Reddi et al. [3] have identified the shortcomings of the Adam convergence algorithm and proposed a variant of the algorithm called the AMSGrad algorithm. This algorithm solves the problem of model non-convergence by enhancing the second-order moment iteration method and preventing the learning rate from oscillating. All variants of the AdaFamily algorithm proposed by Fassold [4] can be regarded as a hybrid of the AdamW [5], AdaBelief [6], and AdaMomentum algorithms, a family of Adam-like algorithms that are parameterized by the hyperparameter µ located in the range [0, 1] to improve picture classification accuracy. Chen et al. [7] proposed a global optimization method with adaptive momentum estimation (Adam-CBO), aiming to improve the success and accuracy of the global minimum search for non-convex objective functions while reducing the cost. Xie et al. [8] proposed a faster version of the Adam algorithm called Adan that employs a new Nesterov momentum estimation method to estimate the first- and second-order moments of the gradient as in the Adam adaptive gradient algorithm, which improves the convergence speed of the Adam algorithm. HN_Adam, proposed by Reyad et al. [9], improves the Adam algorithm by automatically adjusting the parameter update step size during training and combining the hybrid mechanism of the standard Adam algorithm and the AMSGrad algorithm to improve the accuracy and convergence speed, and this improvement enables the HN_Adam algorithm to have a better generalization performance when dealing with large-scale datasets and outperform several comparative algorithms in terms of accuracy and convergence speed. AdaPlus, proposed by Guan [10], improves the optimization efficiency by integrating Nesterov momentum and exact step-size tuning on top of AdamW, which combines the advantages of AdamW, Nadam [11], and AdaBelief, and demonstrates excellent performance in a variety of machine learning [12] tasks, especially in image classification, language modeling tasks, and the training of GANs [13], showing high stability, as shown by the achieved results of 90.55%, 86.22%, and 63.7%, respectively. Zhang et al. [14] proposed the WuC-Adam optimization algorithm to improve the learning rate instability, local optimum, and overfitting problems by integrating Warm Up [15] and dynamic cosine annealing techniques. The ACGB-Adam algorithm proposed by Liu et al. [16] improves the optimization speed and accuracy and reduces the computational cost by adjusting the gradient through adaptive coefficients, as well as combining composite gradient and stochastic block coordinate descent. Yun [17] proposed the Stoch GradAdam algorithm, which improves the traditional Adam algorithm by reducing the noise effect through the gradient sampling technique, which effectively improves the training stability and performance of the model in image processing tasks. Malviya et al. [18] proposed a new memory-enhanced version of the Adam algorithm that facilitates exploration towards flatter minima by using a buffer of key momentum terms in training to overcome the problem that traditional adaptive methods tend to select sharp minima that affect generalization ability. Although the improved Adam algorithm has achieved good results in many aspects, it still has difficulties, such as being sensitive to unbalanced data and ease of falling into local extremes, and there is still room for improvement when used on medical images.

In this context, we propose an improved optimization algorithm for the gastrointestinal image classification task, which adds the weight decay and Adaptive Gradient Clipping (AGC) techniques [19] to the Adam algorithm, and generally combines the advantages of AdamW and AGC to improve the generalization ability and training stability of the model. This improved algorithm overcomes some limitations of the traditional Adam algorithm. Due to the lack of effective regularization means, the traditional Adam algorithm is prone to overfitting [20] during the training process. By introducing a weight decay strategy, the method proposed in this paper alleviates the overfitting phenomenon. In addition, the traditional Adam algorithm as well as the algorithm after combining weight decay may encounter the problem of gradient explosion or gradient disappearance when dealing with complex data. In order to further improve the stability and performance of the model, we introduce the AGC technique. The AGC technique improves the stability of the training process by dynamically adjusting the gradient clipping threshold. Combining these two techniques, our AdamW_AGC algorithm can provide higher accuracy and a more stable training process when dealing with complex medical image classification tasks.

The innovations proposed in this study are as follows:

(1) We propose an improved Adam algorithm that combines the weight decay and Adaptive Gradient Clipping (AGC) strategies and experimentally verify its performance on several image classification datasets. The experiments demonstrate that the improved Adam algorithm proposed in this paper has higher accuracy.

(2) The traditional Adam algorithm is prone to overfitting and gradient explosion problems when dealing with complex data. We solve these common problems by introducing the weight decay and AGC strategies, which effectively improve the generalization ability and training stability of the model.

(3) Although the Adam algorithm has been successful in several fields, relatively little research has been carried out on its application to medical image classification. For the first time, we apply the improved Adam algorithm to gastrointestinal image classification and experimentally validate its effectiveness, demonstrating its advantages in this specific application area.

(4) An improved optimization algorithm using our method effectively copes with the high dimensionality and sample imbalance problems in the classification of complex medical images. This is important in improving the accuracy and efficiency of automatic diagnosis of gastrointestinal diseases.

2. Design of AdamW_AGC Algorithm

2.1. Adam Algorithm

The Adam (Adaptive Moment Estimation) algorithm is an efficient adaptive optimization algorithm, one of the gradient descent algorithms used to optimize neural network training. It combines the Momentum method and RMSprop method and dynamically adjusts the learning rate of each parameter by calculating the first-order moment estimation and the second-order moment estimation of the gradient to achieve more efficient training. Momentum methods accelerate the convergence of the gradient descent algorithm by taking into account the direction of the previous update and the direction of the current gradient in each parameter update. The RMSprop method solves the learning rate selection problem by adjusting the learning rate of each parameter by normalizing the exponentially decaying mean of the gradient squared, so that the step size of the parameter update is independent of the magnitude of the gradient, avoiding the problem of inappropriate learning rate due to too large or too small a gradient. The Adam algorithm combines the benefits of two optimization methods, Momentum and RMSprop, making it particularly effective for solving problems with large amounts of data and high parameter dimensions. The formulas for moment estimation vectors for the gradient used in parameter updating are shown in Equations (1) and (2):

m_{t} = β_{1} \cdot m_{t - 1} + (1 - β_{1}) \cdot \nabla θ

(1)

v_{t} = β_{2} \cdot v_{t - 1} + (1 - β_{2}) \cdot {(\nabla θ)}^{2}

(2)

In Equations (1) and (2),

t

is the gradient at the time step,

m_{t}

and

v_{t}

are the first-order and second-order moment estimates, respectively, the parameters

β_{1}, β_{2} \in [0, 1)

represent the exponential decay rates controlling the first-order and second-order moments, and

\nabla θ

is the partial derivative of

θ

. The moment estimates are biased towards 0 for small initial times and decay rates.

During the training process, the exponential moving average and the gradient squared may be affected by the initialization bias. In order to eliminate the initialization bias, we can correct the deviation of the exponential moving average and gradient squared during the decay process. This means that some correction operations need to be performed when calculating the exponential moving average and gradient squared to ensure their accuracy and stability. The corrected expressions for the exponential moving average and squared gradient are given in Equation (3) and Equation (4), respectively:

{\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}}

(3)

{\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}}

(4)

In Equations (3) and (4),

{\hat{m}}_{t}

and

{\hat{v}}_{t}

represent, respectively, the bias-corrected first moment estimate and bias-corrected second moment estimate. The Adam algorithm is updated as shown in Equation (5):

θ = θ - \frac{η}{\sqrt{{\hat{v}}_{t} + ε}} \cdot {\hat{m}}_{t}

(5)

In Equation (5),

θ

is the the parameter vector,

η

is the learning rate, and

ε

is a sufficiently small value greater than zero.

2.2. Weight Decay

Weight decay, also known as

L 2

regularization, is a common technique used to reduce the phenomenon of overfitting. In neural network training, weight decay penalizes larger weight values by adding a squared paradigm of weights to the loss function, thus encouraging the model to learn a simpler distribution of weights to improve generalization. Overfitting occurs when a model learns the noise in the training data rather than the actual signal, resulting in poor performance on unseen data. By penalizing large weights, weight decay helps to constrain model complexity, ensuring that it captures the underlying data pattern rather than overfitting the noise. Weight decay can be achieved by adding a regularization term to the original loss function

L (θ)

to obtain a new loss function

L^{'} (θ)

, as shown in Equation (6):

L^{'} (θ) = L (θ) + \frac{λ}{2} ∥ θ ∥^{2}

(6)

In Equation (6),

L (θ)

is the original loss function, which measures the difference between the model prediction and the actual value;

θ

denotes the model parameters, including the weights and biases;

∥ θ ∥^{2}

denotes the

L 2

parameter of the model parameters; and

λ

is a positive hyperparameter, which is used to control the strength of the decay of the weights. Larger

λ

values result in stronger regularization, but can also lead to model underfitting if

λ

too large. The regularization term

\frac{λ}{2} ∥ θ ∥^{2}

added to the loss function

L (θ)

serves as a penalty for large weights. This term ensures that the model parameters

θ

remain small, which in turn reduces model complexity. The parameter

λ

controls the extent of this penalty. A larger

λ

imposes a heavier penalty on the magnitude of the weights, promoting simpler models that are less likely to overfit the training data. Conversely, a smaller

λ

results in a lighter penalty, allowing the model to potentially learn more complex patterns but with an increased risk of overfitting.

In the Adam algorithm, weight decay is tightly integrated with the parameter update process. The Adam algorithm optimizes the model by adaptively adjusting the learning rate of each parameter. At each parameter update, in addition to adjusting the parameters according to the gradient, an additional weight decay term is introduced to ensure that the weights are not too large. This combination ensures that the model remains fast converging and stable during training while avoiding overfitting problems.

2.3. Adaptive Gradient Clipping

In deep learning, gradient explosion and gradient vanishing are two common problems. Gradient explosion leads to excessive weight updates, making the model unstable or even training failure. The traditional gradient clipping method limits the size of the gradient by setting a fixed threshold, but this method needs to be repeatedly adjusted in different models and tasks, which is not flexible enough. In contrast, the Adaptive Gradient Clipping (AGC) method can dynamically adjust the clipping threshold of the gradient according to the size of the parameter itself, which improves the stability and training effect of the model. For each parameter, the ratio of the paradigm of the gradient to the paradigm of the parameter is calculated, and based on the obtained ratio, a clipping threshold is dynamically set for each parameter, and the threshold is adaptively adjusted by the state of the parameter, and if the ratio of the gradient to the weight exceeds the set threshold, the gradient is scaled proportionally to keep the ratio from exceeding the threshold. Adaptive gradient clipping is an effective gradient control method to improve the training stability and performance of deep learning models by dynamically adjusting the clip threshold, and its adaptive nature makes model training more flexible and robust. When the number of gradient paradigms exceeds the cropping threshold, the gradient is scaled to within the threshold, while keeping the gradient unchanged when the number of gradient paradigms is within the threshold, as shown in Equations (7) and (8):

τ = \min (\frac{τ \cdot ∥ θ ∥}{∥ \nabla θ ∥ + ε}, 1)

(7)

\nabla θ = τ \cdot \nabla θ

(8)

In Equations (7) and (8),

τ

is the clipping threshold of AGC,

∥ θ ∥

is the

L 2

-norm of the original gradient, and

ε

is a sufficiently small value greater than zero. Finely adjusting the gradient value, ensures the stability and efficiency of the training process, prevents the gradient explosion or disappearance problem, and is a key stabilization control mechanism in the algorithm.

2.4. AdamW_AGC Algorithm

In the standard Adam optimization algorithm, weight decay is added directly to the gradient, which sometimes leads to conflict with the gradient direction and affects the tuning of the learning rate. In contrast, the AdamW algorithm treats weight decay as a separate step and regularizes by adjusting the parameter values rather than modifying the gradient. This approach provides better training results in many cases, but stability remains an issue. To solve this problem, the AdamW_AGC algorithm is proposed, which combines weight decay and adaptive gradient clipping to improve the stability and efficiency of the model. In the AdamW_AGC algorithm, the initialization phase begins with determining the initial time step and initializing the first- and second-order momentum parameters,

m

and

v

. An existence test is performed for each parameter gradient of the model; if the gradient exists, then processing continues, otherwise the parameter is not updated. In addition, the algorithm specifically states that it does not support the processing of sparse gradients and that encountering a sparse gradient will cause the algorithm to report an error and terminate. The model parameters are regularized by adding a weight decay term to the loss function to reduce the complexity of the model and enhance the generalization ability of the model, as shown in Equation (9):

θ = θ \cdot (1 - η \cdot λ)

(9)

In Equation (9),

λ

is the value of weight decay and

η

is the learning rate. Adaptive gradient clipping is performed before the parameter update, and the clipping threshold is dynamically adjusted based on the number of gradient paradigms to control the gradient within the appropriate range.

The AdamW_AGC algorithm improves the stability and efficiency of training by combining weight decay and adaptive gradient clipping techniques. The specific steps of this algorithm are shown in Algorithm 1.

Algorithm 1: AdamW_AGC Algorithm

1. Inputs: parameters of the model

θ

, learning rate

η

, exponential decay rates for moment estimates

(β_{1}, β_{2})

, term to improve numerical stability

ε

, weight_decay

λ

and clip_threshold

τ

2. Output: The optimized parameters

θ

3. Initialize

t \leftarrow 0

4.

t \leftarrow t + 1

5. For each parameter

θ

in params do
6. if

\nabla θ

(gradient of

θ

) exists then
7. if

\nabla θ

is sparse then
8. Raise an error: “Sparse gradients are not supported.”
9. end if
10.

θ \leftarrow θ * (1 - η * λ)

11.

m_{t} \leftarrow β_{1} * m_{t} + (1 - β_{1}) * \nabla θ

12.

v_{t} \leftarrow β_{2} * v_{t} + (1 - β_{2}) * {(\nabla θ)}^{2}

13.

{\hat{m}}_{t} \leftarrow m_{t} / (1 - β_{1}^{t})

14.

{\hat{v}}_{t} \leftarrow v_{t} / (1 - β_{2}^{t})

15.

τ \leftarrow \min (τ * ∥ θ ∥ / (∥ \nabla θ ∥ + ε), 1)

16.

\nabla θ \leftarrow τ * \nabla θ

17.

θ \to θ - η * {\hat{m}}_{t} / (\sqrt{{\hat{v}}_{t}} + ε)

18. end if
19. end for
20. return

θ

3. Experiments and Analysis

3.1. Configuration of Experimental Environment

The experiment utilizes the popular deep learning framework PyTorch, specifically version 2.0.1, to train classification models. The software tool used for the experiments was Python version 3.10 and the lightning-hydra-template framework was implemented using the PyCharm version 2023.3.4. Comparative experiments were conducted using the AdamW_AGC algorithm and other existing optimization algorithms. The evaluation criteria chosen for this experiment are classification accuracy (Acc) and loss value (Loss), which are used to assess the accuracy of the optimization algorithm in image classification. Accuracy measures the model’s ability to correctly classify an image. After each loop iteration, the model is saved and the evaluation metrics are calculated. The metrics values are then compared and the saved model is updated. The final result is recorded after all iterations are completed.

3.2. Experimental Datasets

The HyperKvasir dataset [21] is a medical image dataset focusing on gastrointestinal disorders, which was collected from the same hospital (Bærum Hospital, Norway) from 2008 to 2016 and contains over 110,000 high-quality images and a large amount of video data, annotated by professional medical experts. This dataset is valuable for the automated detection of gastrointestinal diseases, classification research, and medical education. By applying machine learning and data analytics, HyperKvasir can facilitate technological development in the medical field and improve the accuracy of clinical diagnosis. In this paper, 1885 labeled images from the HyperKvasir public dataset are utilized for experiments.

The MNIST dataset is very classic in the field of machine learning, consisting of 70,000 images containing 10 categories, each sample is a 28 × 28-pixel grayscale picture of handwritten numbers from 0–9.

CIFAR-10 is a widely used image classification dataset containing 60,000, 32 × 32-pixel color images divided into 10 categories of 6000 images each. These images cover a wide range of common objects such as airplanes, cars, birds, cats, deer, dogs, frogs, horses, and boats, providing a rich variety and challenge because of the visual similarities that may exist between categories. The division of the three datasets into training, testing, and validation sets is shown in Table 1.

3.3. Data Preprocessing

To ensure the quality of the data and the validity of the model, we performed the following preprocessing steps on the gastrointestinal images:

(1) Image Resizing: All images were resized to 224 × 224 pixels to match the input requirements of the deep learning models. This step ensures that all input images have the same dimensions, facilitating batch processing and model training.

(2) Data Normalization: normalization of the image data is performed to accelerate the training of the model and to improve the stability of the model.

(3) Data Augmentation: To enhance the diversity of the training data and mitigate overfitting, various data augmentation techniques were applied, including random horizontal flips, random resized cropping, and color jitter. Random horizontal flipping was applied with a 50% probability to increase variability. Random resized cropping involves cropping a random region of the image and resizing it to the target size, improving robustness to different scales and positions. Color jitter adjusted the brightness, contrast, saturation, and hue of the images to enhance the model’s adaptability to color variations.

3.4. Experimental Results and Analysis

The AdamW_AGC optimization algorithm improves the Adam algorithm by combining weight decay and adaptive gradient clipping, which effectively prevents the overfitting and gradient explosion problems during the training process and thus improves the model’s performance on gastrointestinal image classification. Six commonly used optimization algorithms, SGD (Stochastic Gradient Descent), Adam, Nadam, AdamW, Adagrad (Adaptive Gradient), and Adadelta were selected for comparative experiments. The performance of the improved optimization algorithm is determined by comparing the evaluation metrics of each optimization algorithm in the test set with the validation set, as well as the accuracy and loss of each optimization algorithm in the test set under the HyperKvasir dataset. The accuracy and loss of each optimized algorithm in the test set are shown in Table 2.

The traditional Adam algorithm is prone to overfitting and gradient explosion problems when dealing with complex data, while AdamW_AGC effectively solves these problems by combining weight decay and adaptive gradient clipping, which improves the training stability and model generalization ability. Comparing the Nadam and AdamW algorithms, Nadam combines the advantages of Nesterov momentum and the Adam algorithm, but the gradient explosion problem may still occur when dealing with high-dimensional complex data, while AdamW improves the regularization effect by treating weight decay as an independent step but lacks an effective gradient control mechanism. AdamW_AGC improves the training stability and efficiency after introducing adaptive gradient clipping; it not only retains the regularization effect of AdamW but also improves the training stability and efficiency by dynamically adjusting the gradient clipping threshold. In Table 2, it is shown that the AdamW_AGC algorithm has the highest accuracy of 75.8% and the lowest loss of 0.7556. Compared to the second-best-performing NAdam and Adam, the accuracy is improved by 1.0% and 1.6%, respectively. This shows that AdamW_AGC is superior to other algorithms. The results in Table 2 demonstrate the performance of each optimization algorithm on the test set, highlighting the superiority of the AdamW_AGC algorithm, which not only achieves the highest accuracy but also exhibits the lowest loss, proving its superior performance and stability in complex model training. These results also emphasize the importance of choosing the right optimization algorithm for the performance of deep learning models.

The experimental configuration is as follows:

(1): Algorithms: the experiments are compared using six optimization algorithms SGD, Adam, Nadam, AdamW, Adagrad, and Adadelta.
(2): Learning Rate: the initial learning rate is set to 0.001.
(3): Epochs: the number of epochs is set to 100 to ensure that all algorithms are trained the same number of times.
(4): Network: to test the performance of the improved algorithm on different neural networks, different types of neural networks were trained on different datasets. A simple fully connected neural network was trained on the MNIST dataset and a MobileNetV2 lightweight neural network was trained on the CIFAR10 dataset and the gastrointestinal tract dataset.
(5): Weight decay: the parameters were adjusted several times during the experiment, and the finalized value of the weight decay was 0.01.
(6): Comparison experiment: an experiment with different optimization algorithms in one dataset was carried out and the results were recorded.
(7): Experimental results: an analysis is carried out to reflect the advantages of the AdamW_AGC algorithm in solving the problems with the standard Adam algorithm and improving the generalization ability.

The performance of the AdamW_AGC algorithm on the gastrointestinal tract dataset for each optimization algorithm is shown in Figure 1.

Compared to other algorithms such as SGD, Adadelta, and Adagrad, the AdamW_AGC algorithm not only improves performance faster but also consistently maintains high accuracy, which reflects its excellent global search capabilities and efficient use of data. Moreover, the loss value is very low in the early stage of training, and as the training progresses, the loss value also stays low and fluctuates less, which indicates that the model has a very good generalization ability during the learning process.

To verify the effectiveness of the AdamW_AGC optimization algorithm proposed in the paper, the MNIST and CIFAR10 datasets, which are popular in deep learning, were selected for validation experiments, and the results are shown in Table 3 and Table 4.

Different optimization algorithms perform differently on the MNIST and CIFAR10 datasets. For the simpler MNIST dataset, most of the optimization algorithms achieve high accuracy and low loss values, and the AdamW_AGC algorithm achieves 98.69% accuracy; however, the differences in performance are more obvious on the more complex CIFAR10 dataset. On both datasets, AdamW_AGC shows excellent performance with the highest accuracy and also low loss values, which indicates that it is not only adaptable but also provides stable and efficient optimization on different types of problems. In particular, with the CIFAR10 dataset, it has a higher advantage over other optimization algorithms, which may be attributed to the effective management of gradients performed by AGC, which allows AdamW_AGC to better avoid overfitting and gradient problems when dealing with more diverse and complex data.

In the experimental results of the MINIST dataset, the accuracy of the Adadelta algorithm improved the most at the beginning of training, which corresponded to a rapid decline in its losses. As the training progresses, the loss of all algorithms gradually decreases and AdamW_AGC shows faster convergence than the other algorithms, which suggests that adaptive gradient cropping helps optimization, not only that all algorithms gradually increase in accuracy and eventually level off. The AdamW_AGC algorithm almost always maintains high accuracy from the beginning to the end of the training period, which shows a good learning ability and final performance. After 100 epochs of training, AdamW_AGC shows the best results in terms of accuracy, closely followed by the NAdam algorithm. Combining the information from the graphs, we can see that the AdamW_AGC algorithm not only converges quickly but also manages to achieve a high level of accuracy, which suggests that the algorithm may have been optimized for both stability and performance. The performance of each optimization algorithm in the validation set under the MNIST dataset is shown in Figure 2.

The experimental results of the improved AdamW_AGC algorithm on the CIFAR10 dataset are shown in Figure 3.

Most optimization algorithms improve their accuracy rapidly after a few epochs and then stabilize. The accuracy of the three algorithms, AdamW, Adam, and AdamW_AGC, improves faster; in particular, the accuracy of the AdamW_AGC algorithm is consistently higher than that of the other optimized algorithms and maintains a high stability throughout the training process.

In order to prove the applicability of the optimization algorithm, in addition to its accuracy and loss, the use of computing time and memory should also be considered. The computation time and GPU usage required to train each algorithm on the CIFAR-10 dataset are shown in Figure 4. During the whole training process, the GPU utilization of the AdamW_AGC algorithm is lower than that of the other six algorithms, and the GPU utilization fluctuates around 20%. However, considering the increase in the number of iterations of the AdamW_AGC algorithm during the training process, its training time slightly increases compared to the other algorithms.

The experimental results fully validate the advantages of the AdamW_AGC optimization algorithm in effectively preventing the overfitting phenomenon and the gradient explosion problem by combining weight decay and adaptive gradient clipping, thus improving the performance of the model on gastrointestinal image classification. Compared with traditional optimization algorithms, AdamW_AGC shows better generalization ability and stability in dealing with highly complex medical image classification tasks, especially in gastrointestinal image classification. In summary, the introduction of the AdamW_AGC optimization algorithm provides a more efficient and reliable model training method in the field of medical image processing, which is expected to promote the progress and application of related technologies.

4. Conclusions

In this paper, we propose a gastrointestinal image classification method based on the AdamW_AGC algorithm, which combines the weight decay and adaptive gradient clipping strategies to enhance model generalization and training stability. This innovative algorithm addresses the issues of overfitting and gradient explosion commonly encountered with traditional optimization algorithms when handling complex data by introducing weight decay and adaptive gradient clipping techniques.

In the experimental results, we can see that the improved Adam algorithm shows superiority in image classification tasks. Compared with the traditional Adam algorithm and other improved algorithms, AdamW_AGC improves the image classification accuracy and has higher accuracy and lower loss values on the gastrointestinal picture classification task. It demonstrates good generalization in practical applications.

Despite the positive outcomes of this research, there are still some limitations. Compared to other algorithms, Adaptive Gradient Clipping (AGC) increases the computational burden in each iteration, so the algorithm takes slightly longer. Additionally, the method primarily targets dense gradient optimization and has not fully explored the capability to handle sparse gradients, which may limit its applicability in certain sparse data scenarios.

Given the existing limitations of the research, our future work will focus on several directions: Firstly, we plan to further test and optimize the AdamW_AGC algorithm on larger and more diverse medical image datasets to comprehensively assess its performance and adaptability. Secondly, to expand the algorithm’s application scope, we will explore effective strategies for managing sparse gradients. This involves adjusting the algorithm to maintain performance while ensuring computational efficiency when dealing with sparse data. Future research can continue to deepen the exploration and improvement of various aspects of the Adam algorithm, with the aim of further improving the performance and application of the optimization algorithm in tasks such as medical image classification

Author Contributions

Conceptualization and methodology and writing—original draft preparation, J.C.; software and project administration and resources, Y.S.; data curation, J.Y.; writing—review and editing and supervision and formal analysis, H.S.; funding acquisition, Q.Z., L.X. and L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

1. Liaoning Provincial Department of Education Basic Research Project for Higher Education Institutions (General Project), Shenyang University of Technology, Research on Optimization Design of Wind Turbine Cone Angle Based on Fluid Physics Method (LJKZ0159). 2. Basic Research Project of Liaoning Provincial Department of Education “Training and Application of Multimodal Deep Neural Network Models for Vertical Fields” Project Number: JYTMS20231160. 3. Research on the Construction of a New Artificial Intelligence Technology and High Quality Education Service Supply System in the 14th Five Year Plan for Education Science in Liaoning Province, 2023–2025, Project Number: JG22DB488. 4. “Chunhui Plan” of the Ministry of Education, Research on Optimization Model and Algorithm for Microgrid Energy Scheduling Based on Biological Behavior, Project No. 202200209. 5. Shenyang Science and Technology Plan “Special Mission for Leech Breeding and Traditional Chinese Medicine Planting in Dengshibao Town, Faku County”, Project No. 22-319-2-26.

Data Availability Statement

The standard dataset MNIST dataset was derived from the following resources available in the public domain: http://yann.lecun.com/exdb/mnist/ (accessed on 19 May 2024). The website for the standard dataset CIFAR10 dataset is https://www.kaggle.com/datasets/gazu468/cifar10-classification-image (accessed on 19 May 2024). The website for HyperKvasir datasets is https://doi.org/10.1038/s41597-020-00622-y (accessed on 19 May 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gao, S.Q. A Research on Traditional Tangka Image Classification Based on Visual Features. In Proceedings of the 2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 12–14 May 2023; pp. 13–16. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Reddi, S.J.; Kale, S.; Kumar, S. On the Convergence of Adam and Beyond. arXiv 2019. [Google Scholar] [CrossRef]
Fassold, H. AdaFamily: A Family of Adam-like Adaptive Gradient Methods. arXiv 2022. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019. [Google Scholar] [CrossRef]
Zhuang, J.; Tang, T.; Tatikonda, S.; Dvornek, N.; Duncan, J.S. AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients. arXiv 2020. [Google Scholar] [CrossRef]
Chen, J.; Jin, S.; Lyu, L. A Consensus-Based Global Optimization Method with Adaptive Momentum Estimation. CiCP 2022, 31, 1296–1316. [Google Scholar] [CrossRef]
Xie, X.; Zhou, P.; Li, H.; Lin, Z.; Yan, S. Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models. arXiv 2023. [Google Scholar] [CrossRef]
Reyad, M.; Sarhan, A.M.; Arafa, M. A Modified Adam Algorithm for Deep Neural Network Optimization. Neural Comput. Appl. 2023, 35, 17095–17112. [Google Scholar] [CrossRef]
Guan, L. AdaPlus: Integrating Nesterov Momentum and Precise Stepsize Adjustment on AdamW Basis. arXiv 2023. [Google Scholar] [CrossRef]
Dozat, T. Incorporating Nesterov Momentum into Adam. In Proceedings of the International Conference on Learning Representations ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Kumar, N.; Hashmi, A.; Gupta, M.; Kundu, A. Automatic Diagnosis of Covid-19 Related Pneumonia from CXR and CT-Scan Images. Eng. Technol. Appl. Sci. Res. 2022, 12, 7993–7997. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014. Available online: http://arxiv.org/abs/1406.2661 (accessed on 16 March 2024).
Zhang, C.; Shao, Y.; Sun, H.; Xing, L.; Zhao, Q.; Zhang, L. The WuC-Adam Algorithm Based on Joint Improvement of Warmup and Cosine Annealing Algorithms. MBE 2023, 21, 1270–1285. [Google Scholar] [CrossRef] [PubMed]
Shao, Y.; Zhang, C.; Xing, L.; Sun, H.; Zhao, Q.; Zhang, L. A New Dust Detection Method for Photovoltaic Panel Surface Based on Pytorch and Its Economic Benefit Analysis. Energy AI 2024, 16, 100349. [Google Scholar] [CrossRef]
Liu, M.; Yao, D.; Liu, Z.; Guo, J.; Chen, J. An Improved Adam Optimization Algorithm Combining Adaptive Coefficients and Composite Gradients Based on Randomized Block Coordinate Descent. Comput. Intell. Neurosci. 2023, 2023, 4765891. [Google Scholar] [CrossRef] [PubMed]
Yun, J. StochGradAdam: Accelerating Neural Networks Training with Stochastic Gradient Sampling. arXiv 2024. [Google Scholar] [CrossRef]
Malviya, P.; Mordido, G.; Baratin, A.; Harikandeh, R.B.; Huang, J.; Lacoste-Julien, S.; Pascanu, R.; Chandar, S. Promoting Exploration in Memory-Augmented Adam using Critical Momenta. arXiv 2023. [CrossRef]
Lin, G.; Yan, H.; Kou, G.; Huang, T.; Peng, S.; Zhang, Y.; Dong, C. Understanding Adaptive Gradient Clipping in DP-SGD, Empirically. Int. J. Intell. Syst. 2022, 37, 9674–9700. [Google Scholar] [CrossRef]
Shao, Y.; Fan, S.; Sun, H.; Tan, Z.; Cai, Y.; Zhang, C.; Zhang, L. Multi-Scale Lightweight Neural Network for Steel Surface Defect Detection. Coatings 2023, 13, 1202. [Google Scholar] [CrossRef]
Borgli, H.; Thambawita, V.; Smedsrud, P.H.; Hicks, S.; Jha, D.; Eskeland, S.L.; Randel, K.R.; Pogorelov, K.; Lux, M.; Nguyen, D.T.D.; et al. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Sci. Data 2020, 7, 283. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Performance of each optimization algorithm in the validation set under the HyperKvasir dataset. The left figure (a) shows a comparison of accuracy, while the right figure (b) shows a comparison of loss values.

Figure 2. Performance of each optimization algorithm in the validation set under the MNIST dataset. The left figure (a) shows a comparison of accuracy, while the right figure (b) shows a comparison of loss values.

Figure 3. Accuracy of each optimization algorithm in the validation set under the CIFAR10 dataset.

Figure 4. GPU occupancy for the seven tested algorithms.

Table 1. Division of the datasets.

Dataset	Train	Test	Validation
HyperKvasir	1400	250	235
MNIST	55,000	10,000	5000
CIFAR10	45,000	10,000	5000

Table 2. The accuracy and loss of each optimization algorithm in the test set.

	Algorithm	Acc	Loss
HyperKvasir	SGD	60.0%	1.0780
	Adam	74.2%	0.8825
	NAdam	74.8%	0.8166
	AdamW	72.0%	0.8147
	Adagrad	68.4%	0.9926
	Adadelta	55.2%	2.1140
	AdamW_AGC	75.8%	0.7556

Bold font represents the best-performing data in the same settings.

Table 3. Accuracy and loss of each optimization algorithm in the test set under the MNIST dataset.

	Algorithm	Acc	Loss
MNIST	SGD	97.59%	0.0878
	Adam	98.52%	0.0640
	NAdam	98.55%	0.0580
	AdamW	98.47%	0.0640
	Adagrad	97.87%	0.0700
	Adadelta	96.52%	0.1285
	AdamW_AGC	98.69%	0.0610

Bold font represents the best-performing data in the same settings.

Table 4. Accuracy and loss of each optimization algorithm in the test set under the CIFAR10 dataset.

	Algorithm	Acc	Loss
CIFAR10	SGD	49.18%	1.466
	Adam	69.85%	1.440
	NAdam	69.67%	1.359
	AdamW	69.78%	1.240
	Adagrad	33.54%	1.852
	Adadelta	25.68%	1.950
	AdamW_AGC	71.70%	1.100

Bold font represents the best-performing data in the same settings.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, H.; Cui, J.; Shao, Y.; Yang, J.; Xing, L.; Zhao, Q.; Zhang, L. A Gastrointestinal Image Classification Method Based on Improved Adam Algorithm. Mathematics 2024, 12, 2452. https://doi.org/10.3390/math12162452

AMA Style

Sun H, Cui J, Shao Y, Yang J, Xing L, Zhao Q, Zhang L. A Gastrointestinal Image Classification Method Based on Improved Adam Algorithm. Mathematics. 2024; 12(16):2452. https://doi.org/10.3390/math12162452

Chicago/Turabian Style

Sun, Haijing, Jiaqi Cui, Yichuan Shao, Jiapeng Yang, Lei Xing, Qian Zhao, and Le Zhang. 2024. "A Gastrointestinal Image Classification Method Based on Improved Adam Algorithm" Mathematics 12, no. 16: 2452. https://doi.org/10.3390/math12162452

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Gastrointestinal Image Classification Method Based on Improved Adam Algorithm

Abstract

1. Introduction

2. Design of AdamW_AGC Algorithm

2.1. Adam Algorithm

2.2. Weight Decay

2.3. Adaptive Gradient Clipping

2.4. AdamW_AGC Algorithm

3. Experiments and Analysis

3.1. Configuration of Experimental Environment

3.2. Experimental Datasets

3.3. Data Preprocessing

3.4. Experimental Results and Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI