1. Introduction
Gastrointestinal diseases are common digestive disorders, such as gastric ulcers, colon polyps, gastrointestinal tumors, etc. They have important diagnostic and therapeutic value in clinical practice. Gastrointestinal endoscopy is a commonly used clinical diagnostic tool to observe patients’ lesions by taking images of the inside of the gastrointestinal tract. However, since it is a laborious task for doctors to analyze and classify a large number of gastrointestinal images, it is important to automate the classification of gastrointestinal images using computer-aided diagnostic techniques.
In recent years, with the rapid development of deep learning technology, deep learning-based gastrointestinal image classification [
1] methods have gradually attracted the attention of researchers. Feature extraction and classification of gastrointestinal images can be effectively performed using deep learning models to achieve automated disease detection and diagnosis. However, traditional optimization algorithms have problems such as slow convergence and ease of falling into local extremes when dealing with large-scale data, which limit the performance of the algorithms. Therefore, this paper aims to propose a gastrointestinal image classification method based on the improved Adam algorithm to improve the accuracy and stability of gastrointestinal image classification by optimizing the convergence speed and performance of the Adam algorithm. In this paper, the Adam algorithm will be improved by combining the weight decay and adaptive gradient shear strategies to better adapt to the gastrointestinal image classification task, and the proposed method will be experimentally verified and analyzed.
The Adam (Adaptive Moment Estimation) algorithm [
2] is an effective stochastic gradient descent method, which can largely reduce the fluctuations in training by adaptively adjusting the learning rate, and this method has the advantages of high efficiency and fast learning speed. Although the Adam algorithm has been successful in several fields, there are fewer studies to improve its performance in gastrointestinal image classification tasks. In addition, the Adam algorithm has some shortcomings. In recent years, researchers have proposed various strategies to improve the Adam algorithm. Reddi et al. [
3] have identified the shortcomings of the Adam convergence algorithm and proposed a variant of the algorithm called the AMSGrad algorithm. This algorithm solves the problem of model non-convergence by enhancing the second-order moment iteration method and preventing the learning rate from oscillating. All variants of the AdaFamily algorithm proposed by Fassold [
4] can be regarded as a hybrid of the AdamW [
5], AdaBelief [
6], and AdaMomentum algorithms, a family of Adam-like algorithms that are parameterized by the hyperparameter µ located in the range [0, 1] to improve picture classification accuracy. Chen et al. [
7] proposed a global optimization method with adaptive momentum estimation (Adam-CBO), aiming to improve the success and accuracy of the global minimum search for non-convex objective functions while reducing the cost. Xie et al. [
8] proposed a faster version of the Adam algorithm called Adan that employs a new Nesterov momentum estimation method to estimate the first- and second-order moments of the gradient as in the Adam adaptive gradient algorithm, which improves the convergence speed of the Adam algorithm. HN_Adam, proposed by Reyad et al. [
9], improves the Adam algorithm by automatically adjusting the parameter update step size during training and combining the hybrid mechanism of the standard Adam algorithm and the AMSGrad algorithm to improve the accuracy and convergence speed, and this improvement enables the HN_Adam algorithm to have a better generalization performance when dealing with large-scale datasets and outperform several comparative algorithms in terms of accuracy and convergence speed. AdaPlus, proposed by Guan [
10], improves the optimization efficiency by integrating Nesterov momentum and exact step-size tuning on top of AdamW, which combines the advantages of AdamW, Nadam [
11], and AdaBelief, and demonstrates excellent performance in a variety of machine learning [
12] tasks, especially in image classification, language modeling tasks, and the training of GANs [
13], showing high stability, as shown by the achieved results of 90.55%, 86.22%, and 63.7%, respectively. Zhang et al. [
14] proposed the WuC-Adam optimization algorithm to improve the learning rate instability, local optimum, and overfitting problems by integrating Warm Up [
15] and dynamic cosine annealing techniques. The ACGB-Adam algorithm proposed by Liu et al. [
16] improves the optimization speed and accuracy and reduces the computational cost by adjusting the gradient through adaptive coefficients, as well as combining composite gradient and stochastic block coordinate descent. Yun [
17] proposed the Stoch GradAdam algorithm, which improves the traditional Adam algorithm by reducing the noise effect through the gradient sampling technique, which effectively improves the training stability and performance of the model in image processing tasks. Malviya et al. [
18] proposed a new memory-enhanced version of the Adam algorithm that facilitates exploration towards flatter minima by using a buffer of key momentum terms in training to overcome the problem that traditional adaptive methods tend to select sharp minima that affect generalization ability. Although the improved Adam algorithm has achieved good results in many aspects, it still has difficulties, such as being sensitive to unbalanced data and ease of falling into local extremes, and there is still room for improvement when used on medical images.
In this context, we propose an improved optimization algorithm for the gastrointestinal image classification task, which adds the weight decay and Adaptive Gradient Clipping (AGC) techniques [
19] to the Adam algorithm, and generally combines the advantages of AdamW and AGC to improve the generalization ability and training stability of the model. This improved algorithm overcomes some limitations of the traditional Adam algorithm. Due to the lack of effective regularization means, the traditional Adam algorithm is prone to overfitting [
20] during the training process. By introducing a weight decay strategy, the method proposed in this paper alleviates the overfitting phenomenon. In addition, the traditional Adam algorithm as well as the algorithm after combining weight decay may encounter the problem of gradient explosion or gradient disappearance when dealing with complex data. In order to further improve the stability and performance of the model, we introduce the AGC technique. The AGC technique improves the stability of the training process by dynamically adjusting the gradient clipping threshold. Combining these two techniques, our AdamW_AGC algorithm can provide higher accuracy and a more stable training process when dealing with complex medical image classification tasks.
The innovations proposed in this study are as follows:
(1) We propose an improved Adam algorithm that combines the weight decay and Adaptive Gradient Clipping (AGC) strategies and experimentally verify its performance on several image classification datasets. The experiments demonstrate that the improved Adam algorithm proposed in this paper has higher accuracy.
(2) The traditional Adam algorithm is prone to overfitting and gradient explosion problems when dealing with complex data. We solve these common problems by introducing the weight decay and AGC strategies, which effectively improve the generalization ability and training stability of the model.
(3) Although the Adam algorithm has been successful in several fields, relatively little research has been carried out on its application to medical image classification. For the first time, we apply the improved Adam algorithm to gastrointestinal image classification and experimentally validate its effectiveness, demonstrating its advantages in this specific application area.
(4) An improved optimization algorithm using our method effectively copes with the high dimensionality and sample imbalance problems in the classification of complex medical images. This is important in improving the accuracy and efficiency of automatic diagnosis of gastrointestinal diseases.
2. Design of AdamW_AGC Algorithm
2.1. Adam Algorithm
The Adam (Adaptive Moment Estimation) algorithm is an efficient adaptive optimization algorithm, one of the gradient descent algorithms used to optimize neural network training. It combines the Momentum method and RMSprop method and dynamically adjusts the learning rate of each parameter by calculating the first-order moment estimation and the second-order moment estimation of the gradient to achieve more efficient training. Momentum methods accelerate the convergence of the gradient descent algorithm by taking into account the direction of the previous update and the direction of the current gradient in each parameter update. The RMSprop method solves the learning rate selection problem by adjusting the learning rate of each parameter by normalizing the exponentially decaying mean of the gradient squared, so that the step size of the parameter update is independent of the magnitude of the gradient, avoiding the problem of inappropriate learning rate due to too large or too small a gradient. The Adam algorithm combines the benefits of two optimization methods, Momentum and RMSprop, making it particularly effective for solving problems with large amounts of data and high parameter dimensions. The formulas for moment estimation vectors for the gradient used in parameter updating are shown in Equations (1) and (2):
In Equations (1) and (2), is the gradient at the time step, and are the first-order and second-order moment estimates, respectively, the parameters represent the exponential decay rates controlling the first-order and second-order moments, and is the partial derivative of . The moment estimates are biased towards 0 for small initial times and decay rates.
During the training process, the exponential moving average and the gradient squared may be affected by the initialization bias. In order to eliminate the initialization bias, we can correct the deviation of the exponential moving average and gradient squared during the decay process. This means that some correction operations need to be performed when calculating the exponential moving average and gradient squared to ensure their accuracy and stability. The corrected expressions for the exponential moving average and squared gradient are given in Equation (3) and Equation (4), respectively:
In Equations (3) and (4),
and
represent, respectively, the bias-corrected first moment estimate and bias-corrected second moment estimate. The Adam algorithm is updated as shown in Equation (5):
In Equation (5), is the the parameter vector, is the learning rate, and is a sufficiently small value greater than zero.
2.2. Weight Decay
Weight decay, also known as
regularization, is a common technique used to reduce the phenomenon of overfitting. In neural network training, weight decay penalizes larger weight values by adding a squared paradigm of weights to the loss function, thus encouraging the model to learn a simpler distribution of weights to improve generalization. Overfitting occurs when a model learns the noise in the training data rather than the actual signal, resulting in poor performance on unseen data. By penalizing large weights, weight decay helps to constrain model complexity, ensuring that it captures the underlying data pattern rather than overfitting the noise. Weight decay can be achieved by adding a regularization term to the original loss function
to obtain a new loss function
, as shown in Equation (6):
In Equation (6), is the original loss function, which measures the difference between the model prediction and the actual value; denotes the model parameters, including the weights and biases; denotes the parameter of the model parameters; and is a positive hyperparameter, which is used to control the strength of the decay of the weights. Larger values result in stronger regularization, but can also lead to model underfitting if too large. The regularization term added to the loss function serves as a penalty for large weights. This term ensures that the model parameters remain small, which in turn reduces model complexity. The parameter controls the extent of this penalty. A larger imposes a heavier penalty on the magnitude of the weights, promoting simpler models that are less likely to overfit the training data. Conversely, a smaller results in a lighter penalty, allowing the model to potentially learn more complex patterns but with an increased risk of overfitting.
In the Adam algorithm, weight decay is tightly integrated with the parameter update process. The Adam algorithm optimizes the model by adaptively adjusting the learning rate of each parameter. At each parameter update, in addition to adjusting the parameters according to the gradient, an additional weight decay term is introduced to ensure that the weights are not too large. This combination ensures that the model remains fast converging and stable during training while avoiding overfitting problems.
2.3. Adaptive Gradient Clipping
In deep learning, gradient explosion and gradient vanishing are two common problems. Gradient explosion leads to excessive weight updates, making the model unstable or even training failure. The traditional gradient clipping method limits the size of the gradient by setting a fixed threshold, but this method needs to be repeatedly adjusted in different models and tasks, which is not flexible enough. In contrast, the Adaptive Gradient Clipping (AGC) method can dynamically adjust the clipping threshold of the gradient according to the size of the parameter itself, which improves the stability and training effect of the model. For each parameter, the ratio of the paradigm of the gradient to the paradigm of the parameter is calculated, and based on the obtained ratio, a clipping threshold is dynamically set for each parameter, and the threshold is adaptively adjusted by the state of the parameter, and if the ratio of the gradient to the weight exceeds the set threshold, the gradient is scaled proportionally to keep the ratio from exceeding the threshold. Adaptive gradient clipping is an effective gradient control method to improve the training stability and performance of deep learning models by dynamically adjusting the clip threshold, and its adaptive nature makes model training more flexible and robust. When the number of gradient paradigms exceeds the cropping threshold, the gradient is scaled to within the threshold, while keeping the gradient unchanged when the number of gradient paradigms is within the threshold, as shown in Equations (7) and (8):
In Equations (7) and (8), is the clipping threshold of AGC, is the -norm of the original gradient, and is a sufficiently small value greater than zero. Finely adjusting the gradient value, ensures the stability and efficiency of the training process, prevents the gradient explosion or disappearance problem, and is a key stabilization control mechanism in the algorithm.
2.4. AdamW_AGC Algorithm
In the standard Adam optimization algorithm, weight decay is added directly to the gradient, which sometimes leads to conflict with the gradient direction and affects the tuning of the learning rate. In contrast, the AdamW algorithm treats weight decay as a separate step and regularizes by adjusting the parameter values rather than modifying the gradient. This approach provides better training results in many cases, but stability remains an issue. To solve this problem, the AdamW_AGC algorithm is proposed, which combines weight decay and adaptive gradient clipping to improve the stability and efficiency of the model. In the AdamW_AGC algorithm, the initialization phase begins with determining the initial time step and initializing the first- and second-order momentum parameters,
and
. An existence test is performed for each parameter gradient of the model; if the gradient exists, then processing continues, otherwise the parameter is not updated. In addition, the algorithm specifically states that it does not support the processing of sparse gradients and that encountering a sparse gradient will cause the algorithm to report an error and terminate. The model parameters are regularized by adding a weight decay term to the loss function to reduce the complexity of the model and enhance the generalization ability of the model, as shown in Equation (9):
In Equation (9), is the value of weight decay and is the learning rate. Adaptive gradient clipping is performed before the parameter update, and the clipping threshold is dynamically adjusted based on the number of gradient paradigms to control the gradient within the appropriate range.
The AdamW_AGC algorithm improves the stability and efficiency of training by combining weight decay and adaptive gradient clipping techniques. The specific steps of this algorithm are shown in Algorithm 1.
Algorithm 1: AdamW_AGC Algorithm |
1. Inputs: parameters of the model , learning rate , exponential decay rates for moment estimates , term to improve numerical stability , weight_decay and clip_threshold 2. Output: The optimized parameters 3. Initialize 4. 5. For each parameter in params do 6. if (gradient of ) exists then 7. if is sparse then 8. Raise an error: “Sparse gradients are not supported.” 9. end if 10. 11. 12. 13. 14. 15. 16. 17. 18. end if 19. end for 20. return |
3. Experiments and Analysis
3.1. Configuration of Experimental Environment
The experiment utilizes the popular deep learning framework PyTorch, specifically version 2.0.1, to train classification models. The software tool used for the experiments was Python version 3.10 and the lightning-hydra-template framework was implemented using the PyCharm version 2023.3.4. Comparative experiments were conducted using the AdamW_AGC algorithm and other existing optimization algorithms. The evaluation criteria chosen for this experiment are classification accuracy (Acc) and loss value (Loss), which are used to assess the accuracy of the optimization algorithm in image classification. Accuracy measures the model’s ability to correctly classify an image. After each loop iteration, the model is saved and the evaluation metrics are calculated. The metrics values are then compared and the saved model is updated. The final result is recorded after all iterations are completed.
3.2. Experimental Datasets
The HyperKvasir dataset [
21] is a medical image dataset focusing on gastrointestinal disorders, which was collected from the same hospital (Bærum Hospital, Norway) from 2008 to 2016 and contains over 110,000 high-quality images and a large amount of video data, annotated by professional medical experts. This dataset is valuable for the automated detection of gastrointestinal diseases, classification research, and medical education. By applying machine learning and data analytics, HyperKvasir can facilitate technological development in the medical field and improve the accuracy of clinical diagnosis. In this paper, 1885 labeled images from the HyperKvasir public dataset are utilized for experiments.
The MNIST dataset is very classic in the field of machine learning, consisting of 70,000 images containing 10 categories, each sample is a 28 × 28-pixel grayscale picture of handwritten numbers from 0–9.
CIFAR-10 is a widely used image classification dataset containing 60,000, 32 × 32-pixel color images divided into 10 categories of 6000 images each. These images cover a wide range of common objects such as airplanes, cars, birds, cats, deer, dogs, frogs, horses, and boats, providing a rich variety and challenge because of the visual similarities that may exist between categories. The division of the three datasets into training, testing, and validation sets is shown in
Table 1.
3.3. Data Preprocessing
To ensure the quality of the data and the validity of the model, we performed the following preprocessing steps on the gastrointestinal images:
(1) Image Resizing: All images were resized to 224 × 224 pixels to match the input requirements of the deep learning models. This step ensures that all input images have the same dimensions, facilitating batch processing and model training.
(2) Data Normalization: normalization of the image data is performed to accelerate the training of the model and to improve the stability of the model.
(3) Data Augmentation: To enhance the diversity of the training data and mitigate overfitting, various data augmentation techniques were applied, including random horizontal flips, random resized cropping, and color jitter. Random horizontal flipping was applied with a 50% probability to increase variability. Random resized cropping involves cropping a random region of the image and resizing it to the target size, improving robustness to different scales and positions. Color jitter adjusted the brightness, contrast, saturation, and hue of the images to enhance the model’s adaptability to color variations.
3.4. Experimental Results and Analysis
The AdamW_AGC optimization algorithm improves the Adam algorithm by combining weight decay and adaptive gradient clipping, which effectively prevents the overfitting and gradient explosion problems during the training process and thus improves the model’s performance on gastrointestinal image classification. Six commonly used optimization algorithms, SGD (Stochastic Gradient Descent), Adam, Nadam, AdamW, Adagrad (Adaptive Gradient), and Adadelta were selected for comparative experiments. The performance of the improved optimization algorithm is determined by comparing the evaluation metrics of each optimization algorithm in the test set with the validation set, as well as the accuracy and loss of each optimization algorithm in the test set under the HyperKvasir dataset. The accuracy and loss of each optimized algorithm in the test set are shown in
Table 2.
The traditional Adam algorithm is prone to overfitting and gradient explosion problems when dealing with complex data, while AdamW_AGC effectively solves these problems by combining weight decay and adaptive gradient clipping, which improves the training stability and model generalization ability. Comparing the Nadam and AdamW algorithms, Nadam combines the advantages of Nesterov momentum and the Adam algorithm, but the gradient explosion problem may still occur when dealing with high-dimensional complex data, while AdamW improves the regularization effect by treating weight decay as an independent step but lacks an effective gradient control mechanism. AdamW_AGC improves the training stability and efficiency after introducing adaptive gradient clipping; it not only retains the regularization effect of AdamW but also improves the training stability and efficiency by dynamically adjusting the gradient clipping threshold. In
Table 2, it is shown that the AdamW_AGC algorithm has the highest accuracy of 75.8% and the lowest loss of 0.7556. Compared to the second-best-performing NAdam and Adam, the accuracy is improved by 1.0% and 1.6%, respectively. This shows that AdamW_AGC is superior to other algorithms. The results in
Table 2 demonstrate the performance of each optimization algorithm on the test set, highlighting the superiority of the AdamW_AGC algorithm, which not only achieves the highest accuracy but also exhibits the lowest loss, proving its superior performance and stability in complex model training. These results also emphasize the importance of choosing the right optimization algorithm for the performance of deep learning models.
The experimental configuration is as follows:
- (1)
Algorithms: the experiments are compared using six optimization algorithms SGD, Adam, Nadam, AdamW, Adagrad, and Adadelta.
- (2)
Learning Rate: the initial learning rate is set to 0.001.
- (3)
Epochs: the number of epochs is set to 100 to ensure that all algorithms are trained the same number of times.
- (4)
Network: to test the performance of the improved algorithm on different neural networks, different types of neural networks were trained on different datasets. A simple fully connected neural network was trained on the MNIST dataset and a MobileNetV2 lightweight neural network was trained on the CIFAR10 dataset and the gastrointestinal tract dataset.
- (5)
Weight decay: the parameters were adjusted several times during the experiment, and the finalized value of the weight decay was 0.01.
- (6)
Comparison experiment: an experiment with different optimization algorithms in one dataset was carried out and the results were recorded.
- (7)
Experimental results: an analysis is carried out to reflect the advantages of the AdamW_AGC algorithm in solving the problems with the standard Adam algorithm and improving the generalization ability.
The performance of the AdamW_AGC algorithm on the gastrointestinal tract dataset for each optimization algorithm is shown in
Figure 1.
Compared to other algorithms such as SGD, Adadelta, and Adagrad, the AdamW_AGC algorithm not only improves performance faster but also consistently maintains high accuracy, which reflects its excellent global search capabilities and efficient use of data. Moreover, the loss value is very low in the early stage of training, and as the training progresses, the loss value also stays low and fluctuates less, which indicates that the model has a very good generalization ability during the learning process.
To verify the effectiveness of the AdamW_AGC optimization algorithm proposed in the paper, the MNIST and CIFAR10 datasets, which are popular in deep learning, were selected for validation experiments, and the results are shown in
Table 3 and
Table 4.
Different optimization algorithms perform differently on the MNIST and CIFAR10 datasets. For the simpler MNIST dataset, most of the optimization algorithms achieve high accuracy and low loss values, and the AdamW_AGC algorithm achieves 98.69% accuracy; however, the differences in performance are more obvious on the more complex CIFAR10 dataset. On both datasets, AdamW_AGC shows excellent performance with the highest accuracy and also low loss values, which indicates that it is not only adaptable but also provides stable and efficient optimization on different types of problems. In particular, with the CIFAR10 dataset, it has a higher advantage over other optimization algorithms, which may be attributed to the effective management of gradients performed by AGC, which allows AdamW_AGC to better avoid overfitting and gradient problems when dealing with more diverse and complex data.
In the experimental results of the MINIST dataset, the accuracy of the Adadelta algorithm improved the most at the beginning of training, which corresponded to a rapid decline in its losses. As the training progresses, the loss of all algorithms gradually decreases and AdamW_AGC shows faster convergence than the other algorithms, which suggests that adaptive gradient cropping helps optimization, not only that all algorithms gradually increase in accuracy and eventually level off. The AdamW_AGC algorithm almost always maintains high accuracy from the beginning to the end of the training period, which shows a good learning ability and final performance. After 100 epochs of training, AdamW_AGC shows the best results in terms of accuracy, closely followed by the NAdam algorithm. Combining the information from the graphs, we can see that the AdamW_AGC algorithm not only converges quickly but also manages to achieve a high level of accuracy, which suggests that the algorithm may have been optimized for both stability and performance. The performance of each optimization algorithm in the validation set under the MNIST dataset is shown in
Figure 2.
The experimental results of the improved AdamW_AGC algorithm on the CIFAR10 dataset are shown in
Figure 3.
Most optimization algorithms improve their accuracy rapidly after a few epochs and then stabilize. The accuracy of the three algorithms, AdamW, Adam, and AdamW_AGC, improves faster; in particular, the accuracy of the AdamW_AGC algorithm is consistently higher than that of the other optimized algorithms and maintains a high stability throughout the training process.
In order to prove the applicability of the optimization algorithm, in addition to its accuracy and loss, the use of computing time and memory should also be considered. The computation time and GPU usage required to train each algorithm on the CIFAR-10 dataset are shown in
Figure 4. During the whole training process, the GPU utilization of the AdamW_AGC algorithm is lower than that of the other six algorithms, and the GPU utilization fluctuates around 20%. However, considering the increase in the number of iterations of the AdamW_AGC algorithm during the training process, its training time slightly increases compared to the other algorithms.
The experimental results fully validate the advantages of the AdamW_AGC optimization algorithm in effectively preventing the overfitting phenomenon and the gradient explosion problem by combining weight decay and adaptive gradient clipping, thus improving the performance of the model on gastrointestinal image classification. Compared with traditional optimization algorithms, AdamW_AGC shows better generalization ability and stability in dealing with highly complex medical image classification tasks, especially in gastrointestinal image classification. In summary, the introduction of the AdamW_AGC optimization algorithm provides a more efficient and reliable model training method in the field of medical image processing, which is expected to promote the progress and application of related technologies.
4. Conclusions
In this paper, we propose a gastrointestinal image classification method based on the AdamW_AGC algorithm, which combines the weight decay and adaptive gradient clipping strategies to enhance model generalization and training stability. This innovative algorithm addresses the issues of overfitting and gradient explosion commonly encountered with traditional optimization algorithms when handling complex data by introducing weight decay and adaptive gradient clipping techniques.
In the experimental results, we can see that the improved Adam algorithm shows superiority in image classification tasks. Compared with the traditional Adam algorithm and other improved algorithms, AdamW_AGC improves the image classification accuracy and has higher accuracy and lower loss values on the gastrointestinal picture classification task. It demonstrates good generalization in practical applications.
Despite the positive outcomes of this research, there are still some limitations. Compared to other algorithms, Adaptive Gradient Clipping (AGC) increases the computational burden in each iteration, so the algorithm takes slightly longer. Additionally, the method primarily targets dense gradient optimization and has not fully explored the capability to handle sparse gradients, which may limit its applicability in certain sparse data scenarios.
Given the existing limitations of the research, our future work will focus on several directions: Firstly, we plan to further test and optimize the AdamW_AGC algorithm on larger and more diverse medical image datasets to comprehensively assess its performance and adaptability. Secondly, to expand the algorithm’s application scope, we will explore effective strategies for managing sparse gradients. This involves adjusting the algorithm to maintain performance while ensuring computational efficiency when dealing with sparse data. Future research can continue to deepen the exploration and improvement of various aspects of the Adam algorithm, with the aim of further improving the performance and application of the optimization algorithm in tasks such as medical image classification