Next Article in Journal
PARKTag: An AI–Blockchain Integrated Solution for an Efficient, Trusted, and Scalable Parking Management System
Previous Article in Journal
An Innovative Vision-Guided Feeding System for Robotic Picking of Different-Shaped Industrial Components Randomly Arranged
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MAMGD: Gradient-Based Optimization Method Using Exponential Decay

1
Financial University under the Government of the Russian Federation, Moscow 109456, Russia
2
MIREA—Russian Technological University, 78 Vernadsky Avenue, Moscow 119454, Russia
*
Author to whom correspondence should be addressed.
Technologies 2024, 12(9), 154; https://doi.org/10.3390/technologies12090154
Submission received: 10 August 2024 / Revised: 27 August 2024 / Accepted: 31 August 2024 / Published: 6 September 2024

Abstract

:
Optimization methods, namely, gradient optimization methods, are a key part of neural network training. In this paper, we propose a new gradient optimization method using exponential decay and the adaptive learning rate using a discrete second-order derivative of gradients. The MAMGD optimizer uses an adaptive learning step, exponential smoothing and gradient accumulation, parameter correction, and some discrete analogies from classical mechanics. The experiments included minimization of multivariate real functions, function approximation using multilayer neural networks, and training neural networks on popular classification and regression datasets. The experimental results of the new optimization technology showed a high convergence speed, stability to fluctuations, and an accumulation of gradient accumulators. The research methodology is based on the quantitative performance analysis of the algorithm by conducting computational experiments on various optimization problems and comparing it with existing methods.

1. Introduction

Deep learning is widely used to solve problems such as image recognition and classification [1,2,3,4], target localization [4,5], object detection [6,7,8], speech and action recognition [9,10], face recognition [11,12], and machine translation [13,14,15], improving the efficiency of agribusiness [16] and intelligent industrial robotics [17], as well as in the field of cybersecurity [18,19,20,21]. The model structure and optimization algorithm play a key role in the performance of a deep neural network. Based on this, the optimization algorithm significantly improves the performance and accuracy of deep neural network models. The choice of optimization algorithms is a priority in the model. In the case of a fixed network architecture and dataset, applying different optimization algorithms will give different training efficiencies.
Optimization methods, namely, gradient-based optimization methods, are a key part of training neural networks, which play an important role in various fields of science and business. With the development of deep learning and its widespread application in various tasks, there is a need for efficient optimization methods that can accelerate learning and improve the accuracy of models. Including optimization methods minimizes multivariate real functions. Because most mathematical models are some functions from several and usually from many variables, optimizers will help to find the extrema of such functions, which are solutions to problems that are applied in various fields of mathematical modeling. Neural networks are a branch of machine learning called deep machine learning. In deep learning, a neural network with a different number of layers and parameters is created, which generates a multidimensional function; in turn, the optimizer tries to change the parameters of the neural network so as to minimize the deviations of the predictions from the training data. To measure such deviations, there are a large number of metrics, which are the target function for optimization and, more specifically, minimization. Currently, there are various optimization methods that combine the ideas of stochastic gradient descent, the adaptive learning rate, exponential smoothing, and the iterative correction of optimization algorithm parameters. Currently, the most popular optimization methods are SGD [22], Adagrad [23], RMSprop [24], and Adam [25], as well as Adam-based algorithms that apply various ideas to it, such as Nadam [22], AdaMax [25], and AdaFactor [26]. All these algorithms are implemented in the keras deep machine learning library and are used in training neural networks.
The authors of [27] propose a method that does not require computational costs and does not require backward iterations; this method is based on step-size adaptation, and the entire performance of the learning trajectory is considered as a function of the step size. Leslie N. Smith uses a cyclic learning rate method that does not need to find the best values and plan global learning rates, which reduces the number of iterations [28]. Zhibin Zhu et al. [29] propose a method that does not add additional computational effort, and the Powell restart strategy developed by the authors allows for improving numerical performance. The online adaptive gradient descent with communication event-triggered method [30] provides minimization of dynamic regret by a group of agents, which is a cumulative loss for agent evaluations of the optimal strategy.
SAM [31] is an efficient sharpness-aware minimization optimizer for training deep learning (DL) models. The main advantage of the optimizer is improved generalization for training deep learning models. The improvement is achieved by adding additional perturbation steps, which results in the flattening of the deep learning model landscape. The SAM optimizer is applied in various machine learning problems [32,33,34,35]. Recently, many new data-driven and network-based layout generation methods have been introduced in the field of deep learning, which provide performance improvements and automation after the training stage. The authors of [36] propose a new geometric representation that can encode correlations between objects, the ground, and the camera position.
Izuchukwu et al. [37] propose a method that demonstrates the improvement by including inertial terms in the reflected gradient projection methods, and the method includes one projection onto the feasible set and one cost operator estimate per iteration. Meruza Kubentayeva et al. [38] propose a method for solving dual optimization problems using accelerated gradient methods. They present a network equilibrium model, and this model is formulated as a convex minimization program. Xiaolin Zhou et al. [39] propose a reflected gradient projection method with a new step size to solve the variational inequality problem in Hilbert spaces. The authors of [40] propose a first-order gradient descent optimization algorithm based on NWM-Adam. The authors’ proposed algorithm eliminates the undesirable convergence behavior of some optimization algorithms that use a fixed window of past gradients to scale gradient updates and improves the performance of Adam and AMSGrad.
Traditional pixel-based classifications face problems, such as misclassification, omissions, and noise. Object-based classifications effectively deal with these problems by taking into account the provided domain information and dividing the image into several objects based on the specified criteria. Regarding the importance of pixel-based and sub-pixel-based methods, the study by Valjarević et al. [41] provides a valuable insight.
This paper presents a new optimization method that combines the ideas of existing optimizers and new methods. The algorithm shows good convergence speed and accuracy results, including outperforming the current optimization methods with the proper selection of hyperparameters. The MAMGD optimizer uses an adaptive learning step, exponential smoothing and gradient accumulation, parameter correction, and some discrete analogies from classical mechanics. The MAMGD optimizer is designed to address the slow convergence and instability inherent in conventional gradient optimization methods, such as SGD and its variations. The main goal of MAMGD is to provide a high convergence speed and stability when dealing with different types of data, including those characterized by high gradient variability. This is achieved by adaptively tuning the learning rate and using a discrete second-order representation of the derivatives to better account for the curvature of the loss function. Like all gradient-based optimization methods, the MAMGD method can be used for training neural networks, in optimization problems in which it is possible to take partial derivatives of the target function, and in mathematical modeling problems that are based on previous problems.

2. Materials and Methods

2.1. Classic Gradient Descent

The simplest and most basic algorithm is gradient descent [21], and all other optimization algorithms are formed based on this method. Classical gradient descent is easy to implement and understand, which makes it attractive to a wide range of users and researchers. Also, this method guarantees convergence to the local minimum of the loss function if certain conditions, such as the convexity and continuity of the loss function, are met (Algorithm 1).
Algorithm 1 Gradient descent
g t θ t 1 f ( θ t 1 )
θ t θ t 1 η g t
The classical gradient descent algorithm (Algorithm 1 Gradient descent) is simple to implement and can work well with a good choice of initial approximations and the learning rate parameter. However, gradient descent is very dependent on the choice of initial values and may be subject to build-ups in the local minima. All subsequent optimization methods are based on the idea of gradient descent and are improvements of its idea. To improve the gradient descent, derived optimization algorithms implement adaptive stepping and gradient accumulation methods to eliminate the problem of building up in the local minimum.

2.2. Momentum

The Momentum optimization method [22] eliminates some problems in classical gradient descent and allows for converging to a minimum faster. The idea of the optimizer is to accumulate gradients, i.e., the algorithm uses information about previous iterations of the algorithm (Algorithm 2).
Algorithm 2 Momentum
g t θ t 1 f ( θ t 1 )
m t β m t 1 + g t
θ t θ t 1 η m t
At the same time, the method may be subject to the overflight of the global minima due to the accumulation of gradients, which can also be strongly affected by explosive gradients. This will greatly increase the accumulation of gradients and the optimizer will jump the global minimum. The problem of sensitivity to the choice of the learning rate remains, just as with classical gradient descent; at high learning rates, the optimizer overshoots the global minima, and at low learning rates, it may converge to them more slowly but still faster than classical gradient descent because of the gradient accumulator. In any case, if the learning rate value is too small, learning may be slow and there is a risk of getting stuck in a local minimum. If the value is too large, oscillation or divergence of the algorithm may occur.

2.3. Adagrad

The optimization algorithm Adagrad [23] applies the idea of the adaptive learning rate, which can solve the problem of the very fast or very slow convergence of the method to a minimum. Adagrad uses an accumulator of gradient squares to accumulate velocities and adjusts the step depending on the gradient vector (Algorithm 3).
Algorithm 3 Adagrad
g t θ t 1 f ( θ t 1 )
n t n t 1 + g t 2
θ t θ t 1 η g t n t + ϵ
The method adjusts the gradient vector by removing the influence of large gradients and increasing the influence of small values. This idea solves the problem of discharged models with a large number of different values of weights and increases the convergence rate by changing the learning rate. But it also introduces the problem of explosive gradients, where the values can greatly increase the accumulator, leading to moving away from the global minimum or converging to a local minimum.
Also, Adagrad tends to have slow convergence and may even stop learning in some cases. This is due to the fact that weights that are rarely updated have small gradient values, as well as being due to the accumulation of the sum of squares of gradients in the denominator. Because of this, the method may "freeze" on the way to a local minimum.

2.4. RMSProp

In order to reduce the influence of explosive and damped gradients, the RMSProp optimization method [24] uses the idea of exponential smoothing. This helps to increase the influence of the accumulator of gradient squares or gradients at the current iteration step. At the same time, a new parameter of the optimizer, the smoothing factor, appears, which is responsible for adjusting the influence of the accumulator of gradients (Algorithm 4).
Algorithm 4 RMSProp
g t θ t 1 f ( θ t 1 )
n t β n t 1 + ( 1 β ) g t 2
θ t θ t 1 η g t n t + ϵ
The algorithm, like Adagrad, uses an adaptive learning step based on accumulators of gradient squares. The improvement of RMSProp compared to Adagrad is there is less influence of outliers on the optimization progress due to exponential smoothing. RMSProp is less prone to overshooting the minima and movement due to accumulated gradients but is also subject to oscillations. There is also a new exponential smoothing hyperparameter that can affect the convergence rate, which requires more parameter experiments when training the model, but can significantly speed up training or reduce oscillations and overshoots of global minima.

2.5. Adam

The Adam optimizer [25] combines the Momentum and RMSProp optimization methods. The method uses an accumulator of gradients, gradient squares, and exponential smoothing while adding a correction to the exponential smoothing coefficients to increase the impact of the first iteration steps for faster convergence to a minimum (Algorithm 5).
At the moment, the Adam optimizer is the most popular in training neural networks; this is due to its high convergence rate and low susceptibility to oscillations. At the same time, Adam is highly dependent on initial parameters, which, if properly selected experimentally, can significantly increase convergence to the global minimum. In any case, Adam does not completely solve the problem of large gradient accumulations and damping, which, in turn, is greatly helped by the correction of the exponential damping coefficients. Generalizing, Adam combines the methods of adaptive step learning, accumulator gradients, exponential smoothing, and coefficient correction, which helps to solve the problems of a low convergence rate, overflight of the global minima, and susceptibility to oscillations.
Algorithm 5 Adam
g t θ t 1 f ( θ t 1 )
m t β 1 m t 1 + ( 1 β 1 ) g t
n t β 2 n t 1 + ( 1 β 2 ) g t 2
m t ^ m t 1 β 1 t
n t ^ n t 1 β 2 t
θ t θ t 1 η m t ^ n t ^ + ϵ

3. Algorithm

The MAMGD optimization method combines the ideas of the Adam algorithm and supplements them with the ideas of exponential damping and discrete acceleration, which is a discrete form of the second derivative of the gradient. These methods are used to reduce the influence of accumulated gradient accumulators when the number of iterations of the algorithm increases. So, if the damping hyperparameter is successfully chosen, the learning rate can be reduced in time and the minimum that can be passed when the accumulated gradients are large cannot be missed (Algorithm 6).
Algorithm 6 MAMGD
g t θ t 1 f ( θ t 1 )
β 1 ^ β 1 e x p ( k t )
β 2 ^ β 2 e x p ( k t )
m t β 1 m t 1 + ( 1 β 1 ) g t
n t β 2 n t 1 + ( 1 β 2 ) g t 2
a ( g t g t 1 θ t θ t 1 + ϵ ) 2 + n t
m t ^ m t 1 β ^ 1 t
a ^ a 1 β ^ 2 t
θ t θ t 1 η m t ^ a ^ + ϵ
The hyperparameters used are the learning rate η (usually taken as η = 0.001 in the initial approximation), the exponential smoothing coefficients β 1 , β 2 [ 0 ; 1 ) , the exponential damping factor k [ 0 , 1 ) , and ϵ = 1 × 10 7 , a small number for the prerotation of the division to zero. In most cases, to determine a successful attenuation coefficient is a separate problem. For very small values, the algorithm can still fly over the global minima as well as other optimization methods, and for very large values, very early damping may occur. As can be seen at each iteration, the smoothing coefficients are updated using the rule β ^ β e x p ( k t ) , and the more iterations that pass, the greater the influence of the current gradient and less that of the accumulator of gradients. This is performed in order to reduce the speed. Because the optimizer cannot know in advance when the global minimum will occur, the task of selecting a successful damping parameter is a purely heuristic problem.
MAMGD achieves a high convergence rate and stability through a combination of several methods. Exponential decay gradually reduces the influence of old gradients and accumulated values, contributing to a more accurate adaptation to current optimization conditions. The use of second-order derivative discretization to estimate the curvature of the loss function allows for a more precise adjustment of the optimization step, avoiding overly large or small corrections, which improves stability. The adaptive learning rate adjusts according to the current state of gradients and accumulated values, enabling more efficient progress toward the minimum of the loss function.
The discrete second derivative of gradients in MAMGD is computed by calculating the difference between the current and previous gradients g t g t 1 , which is then divided by the difference in the corresponding parameters θ t θ t 1 . The result is squared to account for the magnitude of the gradient change and added to the accumulated value of n t , which also accounts for past gradient squares. This approach allows us to take into account not only the current gradient change but also the accumulated effects, which provides a more accurate estimate of the curvature of the loss function and hence a more efficient adaptation of the optimization step.
The parameter correction in MAMGD is performed through normalization of the moments m t and a (analogous to n t ). The correction by dividing by ( 1 β ^ 1 t ) and ( 1 β ^ 2 t ) compensates for the initial bias of the moments m t and a, which avoids their underestimation in the early stages of optimization. This is important to ensure the correct adaptation of the learning rate in the initial iterations.
The MAMGD optimizer is less oscillatory due to the hyperparameter of exponential damping, which reduces the influence of accumulated gradients with increasing iterations. The difference from the other presented optimizers is the finite form of the weights change, which uses the idea of discrete acceleration, which gives some similarity to the Newton–Raphson method, but without searching for the Hessian of the error function; hence, the optimizer can be partially called quasi-Newtonian, with the difference in the formula and in the absence of additional memory and computation costs. On the other hand, due to discreteness, we have to store weights and gradients for the previous iteration, which will have a negative impact when the number of weights of the neural network is large.
The main parameters of MAMGD include the following:
  • The attenuation coefficients β 1 and β 2 : They decrease exponentially with each iteration according to the formula β 1 ^ β 1 exp ( k t ) and β 2 ^ β 2 exp ( k t ) , where k is the attenuation coefficient.
  • The learning rate coefficient η : This parameter is chosen similarly to other optimizers, such as Adam, and can be tuned empirically.
  • The smoothing parameter ϵ : This is used to avoid division by zero and is usually chosen to be very small ( 10 7 ).
The parameters are chosen based on experiments with different datasets to ensure a balance between the convergence speed and stability of the optimization. One option for a parameter search is a grid search.
The advantages of the MAMGD optimization method may be its rather high convergence speed, resistance to oscillations, and accumulations of gradient accumulators, with these factors being closely related to exponential damping and the discrete analog of the second derivative of the gradients. The disadvantages are the possible high computational and memory costs, as it is necessary to store the results of gradient and weight calculations from previous iterations.

4. Experiments and Analysis of Results

For the experiments, the MAMGD optimization method was implemented using the deep machine learning libraries tensorflow and keras. The matplotlib library was used for plotting the function graphs. The Optuna library was used to select the optimal hyperparameters. The experiments included the minimization of multivariate real functions, approximation of functions using multilayer neural networks, and training of neural networks on popular classification and regression datasets.

4.1. Minimization of Functions

In this chapter, we will review the results of the MAMGD gradient method for minimizing various functions and compare it with the other popular methods discussed in the previous chapters. The MSE (mean squared error) metric was used as an error estimate. The MSE is expressed as follows:
M S E = 1 n i = 1 n ( y i y ^ i ) 2
where n is the number of observations, y i is the actual value, and y ^ i is the predicted value.
The MSE (mean squared error) is a metric used to assess model quality in regression problems. It measures the mean square deviation of predicted values from actual values.

4.1.1. Quadratic Function of One Variable

A quadratic function of one variable is one of the most trivial variants for minimization testing, so it was chosen as the first test. In this case, we considered a function of the following form: f ( x ) = x 2 x (Figure 1).
The results of the graph and table (Table 1) of values show that the classical gradient descent and RMSprop method converged to the minimum the fastest, and the speed of the MAMGD and Adam methods were almost the same, but MAMGD was able to reach the minimum faster.

4.1.2. Sphere Function

The sphere function is a function of two variables and shows how gradient methods deal with multivariate optimization. In this case, we considered a function of the following form: f ( x ) = x 2 + y 2 (Figure 2).
As in the previous case, the measurement results show that the MAMGD method quickly reaches a minimum when the exponential decay hyperparameter is properly chosen (Table 2). At the same time, the SGD and RMSprop methods are most effective at the initial stages.

4.1.3. Three-Hump Camel Function

The three-hump camel function was chosen as the next function to be optimized, which is described as follows: f ( x , y ) = 2 x 2 1.05 x 4 + x 6 6 + x y + y 2 . It has several local minima and one global minimum. The main characteristics are as follows:
  • The area of definition is usually considered on ( x , y [ 5 , 5 ] ).
  • The global minimum is ( f ( 0 , 0 ) = 0 ).
  • It has three local minima, which gave the name of the function.
It can be seen from the data that the SGD and RMSprop methods quickly converged to a minimum, as in the previous cases (Figure 3). The Adam method underwent correction. The MAMGD method converged slower than the Adam method but was not corrected and obtained better results (Table 3).

4.1.4. Matyas Function

Further experiments were performed for the Matyas function, which is expressed as follows: f ( x , y ) = 0.26 ( x 2 + y 2 ) 0.48 x y . This function is less difficult to optimize than the three-hump camel function but is still of interest for testing algorithms, especially with respect to its accuracy and convergence rate near the minimum (Figure 4).
Some properties of the function are as follows:
Description:
  • The area of definition is usually considered on ( x , y [ 10 , 10 ] ).
  • The global minimum is ( f ( 0 , 0 ) = 0 ).
  • It is a unimodal function, which means it has only one minimum.
  • The function is symmetric about the origin.
  • Despite the simplicity of the formula, the Matyas function can be difficult for some optimization algorithms because of its flat region near the minimum.
The results show that in this case, SGD did not show the best results, probably due to the flat region near the minimum of the function (Table 4). The best result was shown by the RMSprop method. Adam underwent correction as in the previous case. MAMGD achieved a good result at the 14th iteration, but after that it also underwent a small correction.

4.1.5. Booth Function

Another function chosen for optimization was the Booth function, which is described as follows: f ( x , y ) = ( x + 2 y 7 ) 2 + ( 2 x + y 5 ) 2 . The Booth function represents a good balance between simplicity and complexity, which makes it useful for testing optimization algorithms, especially when it is important to test the algorithm’s ability to find a minimum that is not located in the center of the search area (Figure 5).
Some properties include the following:
  • The area of definition is usually considered on ( x , y [ 10 , 10 ] ).
  • The global minimum is ( f ( 1 , 3 ) = 0 ).
  • It is a unimodal function, which means that it has only one minimum.
As can be seen from the graph and measurement table, MAMGD accelerated sharply to the global minimum after 16 iterations; this result may be related to the choice of the exponential decay hyperparameter (Table 5). The SGD and RMSprop methods showed stable results, and Adam and Adagrad were rather slow to converge to the minimum in this function.

4.1.6. Himmelblau’s Function

The last function used for optimization was Himmelblau’s function, which is described as follows: f ( x , y ) = ( x 2 + y 11 ) 2 + ( x + y 2 7 ) 2 . Himmelblau’s function is a popular choice for testing global optimization algorithms because it combines several complex characteristics: multiple minima, symmetry, and nonlinearity. This makes it an excellent tool for evaluating the efficiency and robustness of optimization algorithms (Figure 6).
Properties:
  • The area of definition is usually considered on ( x , y [ 5 , 5 ] ).
  • For the global minima, the function has four identical local minima, all with the value 0: [ f ( 3.0 , 2.0 ) = f ( 2.805118 , 3.131312 ) = f ( 3.779310 , 3.283186 ) = f ( 3.584428 , 1.848126 ) = 0 ].
  • The function is multimodal, which means it has several local minima.
  • It has one local maximum at the point (−0.270845, −0.923039) with a value of 181.617.
The measurement results show that the SGD method began to move away from the local and global minima and was therefore excluded from the measurement table (Table 6). The RMSprop and MAMGD methods moved steadily toward the minimum, Adagrad also showed weak results, and the Adam method also underwent correction.

4.2. Approximation Using Multilayer Neural Network

A multilayer neural network was chosen to approximate the function f ( x ) = x 2 + x 1 (Figure 7).
As can be seen in the plot, MAMGD quickly converged to a minimum, which may have been influenced by the lucky exponential decay coefficient ( k = 0.0001 ) and the discrete second derivative of the gradients (Table 7).

4.3. MNIST Dataset

The MNIST dataset [42] is one of the most well-known and widely used datasets in the field of computer vision and neural network training. It consists of a set of images of handwritten digits from 0 to 9 that were created from a combination of handwritten samples from college students and staff at the National Institute of Standards and Technology (NIST).
The MNIST dataset contains a total of 60,000 training images and 10,000 test images; each image is 28 × 28 pixels. The images have been pre-processed to unify the size and center the digits in the center of each image.
Each image in the dataset has a label associated with it, which indicates which digit is depicted in the image (from 0 to 9). These labels are used to train the neural network and to evaluate its performance. MNIST is a popular choice for training neural networks because it is relatively small and easily accessible. It is also a convenient dataset for testing and comparing different neural network algorithms and architectures in handwritten digit recognition (Figure 8).
As can be seen, the MAMGD optimizer initially lags behind the Adam and RMSProp optimizer but soon catches up and converges to a minimum. This shows the problem of initial approximations and the choice of hyperparameters, such as the exponential decay factor (Table 8).

4.4. IMDB Dataset

The IMDB (Internet Movie Database) dataset is one of the most widely used and popular datasets for training neural networks in the tasks of text tone analysis and classification of movie reviews.
The IMDB dataset consists of 50,000 movie reviews, which have been divided into two parts: 25,000 reviews are for training and the remaining 25,000 are for testing the classifier. Each review has its own class label indicating whether the review is positive or negative. Positive reviews are labeled 1, while negative reviews are labeled 0.
The IMDB dataset is a good choice for training neural networks in text classification and tone analysis tasks, as it contains balanced data with a large number of examples for each class and also displays real user reviews of movies (Figure 9).
The MAMGD optimization method shows a good and high convergence rate. A successful choice of hyperparameters allowed for fast convergence to the minimum while avoiding oscillations. This could also be influenced by the discrete derivative of the gradients, which accelerated the descent to the minimum of the target function (Table 9).

4.5. Reuters Dataset

The Reuters dataset is one of the most popular datasets for training neural networks in natural language processing. It consists of news articles that were collected by Reuters between 1987 and 1996.
The dataset contains a variety of news categories, such as politics, economics, sports, science, and others.
The overall structure of the Reuters dataset has several key features: 1. Size: The Reuters Dataset consists of more than 10,000 news articles. It includes enough variation in neural network training data to allow researchers and developers to conduct different experiments and testing. 2. Multiple classes: There are 46 different classes of news in the Reuters dataset. Each article belongs to one of these classes, allowing for a multi-class classification task (Figure 10).
As in the MNIST dataset, the optimizer initially lagged behind the other algorithms but later overtook RMSProp, probably due to the choice of hyperparameters (Table 10).

4.6. Boston Housing Price Dataset

Boston Housing Price is a dataset that contains information about housing prices in different neighborhoods of Boston, USA. It consists of various attributes, such as the average number of rooms in a house, percentage of the population with low social status, crime rate, etc.
This dataset is often used to train neural networks as it provides a good set of diverse attributes that can be used to predict real estate prices. Neural networks can be trained on this dataset to learn identification tasks.
It is important to understand the complex relationships between attributes and house prices. Using the Boston Housing Price dataset to train neural networks allows for the creation of a model that will be able to make real estate price predictions based on various factors. This can be useful for real estate agents, real estate market analysts, and other interested parties who want information about expected real estate prices in different Boston neighborhoods (Figure 11).
In this case, the MAMGD optimization method was very similar to the Adam algorithm, which was influenced by the small value of the exponential decay coefficient (Table 11).

5. Discussion

The proposed MAMGD algorithm is an interesting optimization approach that combines several new ideas in the field of gradient descent. However, its performance and applicability deserve further discussion.
First, the use of exponential decay for the coefficients β 1 and β 2 is a novel approach. It potentially allows for the algorithm to better adapt to changing optimization landscapes, especially in problems with non-stationary objective functions. However, reducing these coefficients may lead to unnecessary slowdown in speed at the end of training. Consequently, the choice of parameter k to control the decay rate becomes critical and may require careful tuning.
The inclusion of an estimate of the second derivative in the form of a term ( g t g t 1 θ t θ t 1 + ϵ ) 2 is another novel approach to adapting the learning step. It is reminiscent of the idea behind Newton’s method but implemented in a more computationally efficient form. However, such second-order approximations can be unstable in high-dimensional parameter spaces typical of deep neural networks. This approach, too, requires additional research and experimentation.
In conclusion, it is worth noting that the proposed methods require further empirical verification and discussion of the results.

6. Conclusions

Despite significant progress in neural network optimization methods, the existing algorithms still face challenges, such as slow convergence and instability when training complex architectures. Our research aimed to overcome these limitations, potentially leading to more efficient and robust methods for training neural networks.
In this paper, we have analyzed existing neural network optimization methods, discussed their strengths and weaknesses, and presented a new optimization method based on exponential fading and the adaptive learning rate using a discrete derivative of second-order gradients.
Our experiments demonstrate that the proposed method has the potential to improve the learning process of neural networks, especially in problems requiring fast convergence and robustness to oscillations. However, the results also indicate that in some scenarios, the benefits of the new method may be limited compared to state-of-the-art optimizers.
This study raises the important question of whether current optimizers have reached the limit in improving the idea of gradient descent without increasing computational and memory costs. We hypothesize that future research could focus on exploring methods based on evolutionary algorithms or other approaches that can significantly increase the convergence rate and improve the accuracy of neural network training results.

7. Extensions

AdaExp

We apply exponential fading to Adam’s gradient-based optimization method and then we obtain the following algorithm: (Algorithm 7).
Algorithm 7 AdaExp
g t θ t 1 f ( θ t 1 )
β 1 ^ β 1 e x p ( k t )
β 2 ^ β 2 e x p ( k t )
m t β 1 m t 1 + ( 1 β 1 ) g t
n t β 2 n t 1 + ( 1 β 2 ) g t 2
m t ^ m t 1 β 1 t
n t ^ n t 1 β 2 t
θ t θ t 1 η m t ^ n t ^ + ϵ

Author Contributions

Data curation, Investigation, Software—N.S.; Conceptualization, Methodology, Validation—D.A., N.S.; Formal analysis, Project administration, Writing—original draft—E.P.; Supervision, Writing—review editing—S.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are openly 17 July 2024 https://github.com/NekkittAY/MAMGD_Optimizer (accessed on 27 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Seo, S.; Kim, J. Efficient weights quantization of convolutional neural networks using kernel density estimation based non-uniform quantizer. Appl. Sci. 2019, 9, 2559. [Google Scholar] [CrossRef]
  2. Pan, H.; Pang, Z.; Wang, Y.; Wang, Y.; Chen, L. A new image recognition and classification method combining transfer learning algorithm and mobilenet model for welding defects. IEEE Access 2020, 8, 119951–119960. [Google Scholar] [CrossRef]
  3. Wang, P.; Fan, E.; Wang, P. Comparative analysis of image classification algorithms based on traditional machine learning and deep learning. Pattern Recognit. Lett. 2021, 141, 61–67. [Google Scholar] [CrossRef]
  4. Paramonov, A.A.; Nguyen, V.M.; Nguyen, M.T. Multi-task neural network for solving the problem of recognizing the type of QAM and PSK modulation under parametric a priori uncertainty. Russ. Technol. J. 2023, 11, 49–58. [Google Scholar] [CrossRef]
  5. Hou, F.; Lei, W.; Li, S.; Xi, J. Deep learning-based subsurface target detection from GPR scans. IEEE Sens. J. 2021, 21, 8161–8171. [Google Scholar] [CrossRef]
  6. Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
  7. Ghasemi, Y.; Jeong, H.; Choi, S.H.; Park, K.B.; Lee, J.Y. Deep learning-based object detection in augmented reality: A systematic review. Comput. Ind. 2022, 139, 103661. [Google Scholar] [CrossRef]
  8. Khalid, S.; Oqaibi, H.M.; Aqib, M.; Hafeez, Y. Small pests detection in field crops using deep learning object detection. Sustainability 2023, 15, 6815. [Google Scholar] [CrossRef]
  9. Yang, M.; Wu, C.; Guo, Y.; Jiang, R.; Zhou, F.; Zhang, J.; Yang, Z. Transformer-based deep learning model and video dataset for unsafe action identification in construction projects. Autom. Constr. 2023, 146, 104703. [Google Scholar] [CrossRef]
  10. Priyadarshini, I.; Sharma, R.; Bhatt, D.; Al-Numay, M. Human activity recognition in cyber-physical systems using optimized machine learning techniques. Clust. Comput. 2023, 26, 2199–2215. [Google Scholar] [CrossRef]
  11. Boutros, F.; Struc, V.; Fierrez, J.; Damer, N. Synthetic data for face recognition: Current state and future prospects. Image Vis. Comput. 2023, 135, 104688. [Google Scholar] [CrossRef]
  12. Hwang, R.H.; Lin, J.Y.; Hsieh, S.Y.; Lin, H.Y.; Lin, C.L. Adversarial patch attacks on deep-learning-based face recognition systems using generative adversarial networks. Sensors 2023, 23, 853. [Google Scholar] [CrossRef]
  13. Mercha, E.M.; Benbrahim, H. Machine learning and deep learning for sentiment analysis across languages: A survey. Neurocomputing 2023, 531, 195–216. [Google Scholar] [CrossRef]
  14. Khan, W.; Daud, A.; Khan, K.; Muhammad, S.; Haq, R. Exploring the frontiers of deep learning and natural language processing: A comprehensive overview of key challenges and emerging trends. Nat. Lang. Process. J. 2023, 4, 100026. [Google Scholar] [CrossRef]
  15. Mehrish, A.; Majumder, N.; Bharadwaj, R.; Mihalcea, R.; Poria, S. A review of deep learning techniques for speech processing. Inf. Fusion 2023, 99, 101869. [Google Scholar] [CrossRef]
  16. Andriyanov, N.; Khasanshin, I.; Utkin, D.; Gataullin, T.; Ignar, S.; Shumaev, V.; Soloviev, V. Intelligent System for Estimation of the Spatial Position of Apples Based on YOLOv3 and Real Sense Depth Camera D415. Symmetry 2022, 14, 148. [Google Scholar] [CrossRef]
  17. Osipov, A.V.; Pleshakova, E.S.; Gataullin, S.T. Production processes optimization through machine learning methods based on geophysical monitoring data. Comput. Opt. 2024, 48, 633–642. [Google Scholar] [CrossRef]
  18. Ivanyuk, V. Forecasting of digital financial crimes in Russia based on machine learning methods. J. Comput. Virol. Hacking Tech. 2023, 1–14. [Google Scholar] [CrossRef]
  19. Boltachev, E. Potential cyber threats of adversarial attacks on autonomous driving models. J. Comput. Virol. Hacking Tech. 2023, 1–11. [Google Scholar] [CrossRef]
  20. Efanov, D.; Aleksandrov, P.; Mironov, I. Comparison of the effectiveness of cepstral coefficients for Russian speech synthesis detection. J. Comput. Virol. Hacking Tech. 2023, 1–8. [Google Scholar] [CrossRef]
  21. Pleshakova, E.; Osipov, A.; Gataullin, S.; Gataullin, T.; Vasilakos, A. Next gen cybersecurity paradigm towards artificial general intelligence: Russian market challenges and future global technological trends. J. Comput. Virol. Hacking Tech. 2024, 1–12. [Google Scholar] [CrossRef]
  22. Dozat, T. Incorporating nesterov momentum into adam. In Proceedings of the 4th International Conference on Learning Representations, Workshop Track, San Juan, Puerto Rico, 2–4 May 2016; pp. 1–4. [Google Scholar]
  23. Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
  24. Hinton, G.; Srivastava, N.; Swersky, K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. 2012, 14, 2. Available online: https://www.cs.toronto.edu/~hinton/coursera/lecture6/lec6.pdf (accessed on 9 August 2024).
  25. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  26. Shazeer, N.; Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4596–4604. [Google Scholar]
  27. Massé, P.Y.; Ollivier, Y. Speed learning on the fly. arXiv 2015, arXiv:1511.02540. [Google Scholar]
  28. Smith, L.N. Cyclical learning rates for training neural networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 464–472. [Google Scholar]
  29. Zhu, Z.; Zhu, X.; Tan, Z. An accelerated conjugate gradient method with adaptive two-parameter with applications in image restoration. Comput. Appl. Math. 2024, 43, 116. [Google Scholar] [CrossRef]
  30. Okamoto, K.; Hayashi, N.; Takai, S. Distributed Online Adaptive Gradient Descent With Event-Triggered Communication. IEEE Trans. Control Netw. Syst. 2024, 11, 610–622. [Google Scholar] [CrossRef]
  31. Foret, P.; Kleiner, A.; Mobahi, H.; Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
  32. Sun, H.; Shen, L.; Zhong, Q.; Ding, L.; Chen, S.; Sun, J.; Li, J.; Sun, G.; Tao, D. Adasam: Boosting sharpness-aware minimization with adaptive learning rate and momentum for training deep neural networks. Neural Netw. 2024, 169, 506–519. [Google Scholar] [CrossRef]
  33. Ganesha, T.; Prakash, S.B.; Rani, S.S.; Ajith, B.S.; Patel, G.M.; Samuel, O.D. Biodiesel yield optimization from ternary (animal fat-cotton seed and rice bran) oils using response surface methodology and grey wolf optimizer. Ind. Crops Prod. 2023, 206, 117569. [Google Scholar] [CrossRef]
  34. Kim, S.; Jang, M.G.; Kim, J.K. Process design and optimization of single mixed-refrigerant processes with the application of deep reinforcement learning. Appl. Therm. Eng. 2023, 223, 120038. [Google Scholar] [CrossRef]
  35. Sigue, S.; Abderafi, S.; Vaudreuil, S.; Bounahmidi, T. Design and steady-state simulation of a CSP-ORC power plant using an open-source co-simulation framework combining SAM and DWSIM. Therm. Sci. Eng. Prog. 2023, 37, 101580. [Google Scholar] [CrossRef]
  36. Sheng, Y.; Liu, Y.; Zhang, J.; Yin, W.; Oztireli, A.C.; Zhang, H.; Lin, Z.; Shechtman, E.; Benes, B. Controllable shadow generation using pixel height maps. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 240–256. [Google Scholar]
  37. Izuchukwu, C.; Shehu, Y. A new inertial projected reflected gradient method with application to optimal control problems. Optim. Methods Softw. 2024, 39, 197–226. [Google Scholar] [CrossRef]
  38. Kubentayeva, M.; Yarmoshik, D.; Persiianov, M.; Kroshnin, A.; Kotliarova, E.; Tupitsa, N.; Pasechnyuk, D.; Gasnikov, A.; Shvetsov, V.; Baryshev, L.; et al. Primal-dual gradient methods for searching network equilibria in combined models with nested choice structure and capacity constraints. Comput. Manag. Sci. 2024, 21, 15. [Google Scholar] [CrossRef]
  39. Zhou, Z.; Cai, C.; Tan, B.; Dong, Q. A modified generalized version of projected reflected gradient method in Hilbert spaces. Numer. Algorithms 2024, 95, 117–147. [Google Scholar] [CrossRef]
  40. Yu, Y.; Liu, F. Effective Neural Network Training with a New Weighting Mechanism-Based Optimization Algorithm. IEEE Access 2019, 7, 72403–72410. [Google Scholar] [CrossRef]
  41. Valjarević, A.; Djekić, T.; Stevanović, V.; Ivanović, R.; Jandziković, B. GIS numerical and remote sensing analyses of forest changes in the Toplica region for the period of 1953–2013. Appl. Geogr. 2018, 92, 131–139. [Google Scholar] [CrossRef]
  42. Cohen, G.; Afshar, S.; Tapson, J.; Schaik, A.V. Emnist: Extending mnist to handwritten letters. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2921–2926. [Google Scholar]
Figure 1. Quadratic function of one variable minimization.
Figure 1. Quadratic function of one variable minimization.
Technologies 12 00154 g001
Figure 2. Sphere function minimization.
Figure 2. Sphere function minimization.
Technologies 12 00154 g002
Figure 3. Three-hump camel function minimization.
Figure 3. Three-hump camel function minimization.
Technologies 12 00154 g003
Figure 4. Matyas function minimization.
Figure 4. Matyas function minimization.
Technologies 12 00154 g004
Figure 5. Booth function minimization.
Figure 5. Booth function minimization.
Technologies 12 00154 g005
Figure 6. Himmelblau’s function minimization.
Figure 6. Himmelblau’s function minimization.
Technologies 12 00154 g006
Figure 7. Neural network approximation.
Figure 7. Neural network approximation.
Technologies 12 00154 g007
Figure 8. Training a neural network for digit classification on the MNIST dataset.
Figure 8. Training a neural network for digit classification on the MNIST dataset.
Technologies 12 00154 g008
Figure 9. Training a neural network to classify reviews on the IMDB dataset.
Figure 9. Training a neural network to classify reviews on the IMDB dataset.
Technologies 12 00154 g009
Figure 10. Training a neural network to classify Reuters texts.
Figure 10. Training a neural network to classify Reuters texts.
Technologies 12 00154 g010
Figure 11. Training of the neural network for real estate price regression of the Boston Housing Price database.
Figure 11. Training of the neural network for real estate price regression of the Boston Housing Price database.
Technologies 12 00154 g011
Table 1. Quadratic function.
Table 1. Quadratic function.
EpochMAMGD LossAdam LossAdagrad LossRMSprop LossSGD Loss
21.9597031.6906261.7755730.9634170.921600
41.4398061.2148501.5176690.4920810.377487
60.9998390.8238121.3331520.2547890.154619
80.6398710.5165391.1887140.1280990.063332
100.3599030.2894021.0702710.0610270.025941
120.1599350.1356760.9703150.0270270.010625
140.0399680.0453910.8842940.0109320.004352
160.0000000.0056720.8092040.0039680.001783
180.0000000.0017590.7429430.0012680.000730
200.0000000.0186310.6839740.0003500.000299
220.0000000.0429040.6311330.0000810.000123
240.0000000.0644530.5835160.0000150.000050
260.0000000.0772610.5404030.0000020.000021
280.0000000.0793300.5012130.0000000.000008
300.0000000.0718320.4654650.0000000.000003
Table 2. Spherical function.
Table 2. Spherical function.
EpochMAMGD LossAdam LossAdagrad LossRMSprop LossSGD Loss
21.9597381.6906271.7755730.9634170.921600
41.4398371.2148501.5176690.4920810.377487
60.9998640.8238121.3331510.2547890.154619
80.6398920.5165391.1887140.1280990.063332
100.3599190.2894021.0702710.0610270.025941
120.1599460.1356760.9703150.0270270.010625
140.0399730.0453910.8842940.0109320.004352
160.0000000.0056720.8092040.0039680.001783
180.0000000.0017590.7429420.0012680.000730
200.0000000.0186310.6839730.0003500.000299
220.0000000.0429040.6311330.0000810.000123
240.0000000.0644530.5835160.0000150.000050
260.0000000.0772610.5404030.0000020.000021
280.0000000.0793300.5012130.0000000.000008
300.0000000.0718320.4654650.0000000.000003
Table 3. Three-hump camel function.
Table 3. Three-hump camel function.
EpochMAMGD LossAdam LossAdagrad LossRMSprop LossSGD Loss
20.9195960.6402270.6898220.2181300.304278
40.6881450.3617510.5249670.0488390.052731
60.4284380.1650420.4137280.0153630.007620
80.2115030.0483040.3325940.0065010.001813
100.0703540.0032230.2709840.0028650.000712
120.0075120.0100780.2230120.0011710.000338
140.0017040.0393830.1850100.0004280.000168
160.0003140.0642890.1545230.0001380.000084
180.0000400.0715700.1298230.0000380.000042
200.0010360.0615740.1096530.0000090.000021
220.0005420.0421710.0930730.0000020.000011
240.0011240.0227180.0793670.0000000.000005
260.0007680.0098980.0679810.0000000.000003
280.0013770.0056890.0584780.0000000.000001
300.0009590.0076760.0505140.0000000.000001
Table 4. Matyas function.
Table 4. Matyas function.
EpochMAMGD LossAdam LossAdagrad LossRMSprop LossSGD Loss
20.9357710.6406840.9509240.2489950.984096
40.7388030.3647730.9049230.0745020.968444
60.4954620.1716200.8617250.0208520.953042
80.2700430.0558400.8210940.0050860.937885
100.1066710.0058210.7828210.0010350.922968
120.0208310.0034690.7467210.0001690.908289
140.0000150.0266150.7126280.0000210.893844
160.0064230.0541940.6803950.0000020.879628
180.0031570.0716800.6498890.0000000.865638
200.0002290.0735220.6209890.0000000.851871
220.0001680.0617790.5935880.0000000.838322
240.0000800.0426870.5675860.0000000.824989
260.0001060.0232100.5428930.0000000.811868
280.0008320.0085890.5194270.0000000.798956
300.0002080.0011560.4971130.0000000.786249
Table 5. Booth function.
Table 5. Booth function.
EpochMAMGD LossAdam LossAdagrad LossRMSprop LossSGD Loss
22.3231251.9406162.0235441.2123341.024000
41.8550671.4647501.7651360.7371080.419431
61.3941061.0734391.5802680.4939180.171799
81.0084210.7655551.4355070.3592720.070369
100.7015190.5372511.3167350.2820880.028823
120.4734830.3815181.2164340.2358190.011806
140.3246070.2880521.1300480.2054570.004836
160.2542850.2436421.0545710.1825600.001981
180.1535600.2332710.9879010.1627340.000811
200.0371010.2418220.9285010.1440360.000332
220.0000040.2560360.8752070.1258710.000136
240.0004090.2661100.8271160.1082740.000056
260.0009390.2664700.7835090.0915000.000023
280.0010950.2555780.7438020.0758360.000009
300.0011860.2350050.7075180.0615240.000004
320.0012430.2081570.6742580.0487460.000002
340.0012790.1790210.6436860.0376200.000001
360.0013030.1511690.6155140.0281940.000000
380.0013180.1271240.5894970.0204470.000000
400.0013290.1081110.5654230.0142920.000000
Table 6. Himmelblau’s function.
Table 6. Himmelblau’s function.
EpochMAMGD LossAdam LossAdagrad LossRMSprop Loss
20.9012490.6401860.6899260.224643
40.6337480.3616310.5292650.044212
60.3660140.1648040.4203940.006078
80.1656430.0476710.3399410.000583
100.0451760.0021650.2778450.000066
120.0008910.0095260.2286660.000017
140.0000680.0396350.1890650.000005
160.0022960.0622830.1568120.000001
180.0009050.0632240.1303370.000000
200.0015840.0464050.1084810.000000
220.0017810.0240330.0903670.000000
240.0017660.0070520.0753110.000000
260.0018840.0004630.0627740.000000
280.0019780.0029330.0523200.000000
300.0020480.0093490.0435960.000000
Table 7. Approximation using multilayer neural network.
Table 7. Approximation using multilayer neural network.
EpochMAMGD LossAdam LossAdagrad LossRMSprop LossSGD Loss
1821.636841652.4948731762.779175655.232788160.947144
2111.159012115.0975041577.810669112.07412785.439232
349.19500460.0460241445.56384355.85869272.121544
418.57101427.4683651329.00854524.52509152.972687
55.91559115.1484671219.97522011.51279646.084648
61.0251268.8555101117.4383544.28158942.493931
70.2278944.9742551021.2980961.49847637.489067
80.0939882.775730931.5617070.74287149.106506
90.0604571.615357848.5797730.52469432.434353
100.0524840.944320772.3767700.47335632.483410
110.0482840.557275702.7385860.42353028.475237
120.0633520.338988639.4904790.38335031.728319
130.0562140.207358582.4024050.35227522.955385
140.0561040.138861531.0559690.33288532.714817
150.0585390.100819485.1542050.32714124.102997
Table 8. MNIST dataset.
Table 8. MNIST dataset.
EpochMAMGD LossAdam LossRMSprop Loss
10.4486720.2668090.263463
20.1248260.1070940.105576
30.0777000.0695850.070089
40.0526280.0493280.050346
50.0378090.0376730.037978
60.0264340.0287310.028972
70.0206050.0201120.021667
80.0158160.0156150.016610
90.0116110.0124040.012298
100.0082660.0105040.008964
Table 9. IMDB dataset.
Table 9. IMDB dataset.
EpochMAMGD LossAdam LossAdagrad LossRMSprop Loss
10.6375240.5875730.6915390.571567
20.3737260.3636990.6898560.359916
30.2323060.2502590.6880950.265192
40.1674340.1933030.6861770.214027
50.1272290.1526710.6840510.178158
60.0983170.1237020.6817210.156204
70.0752120.1009800.6791590.133551
80.0574540.0824150.6763370.116825
90.0433900.0667160.6732410.104766
100.0321390.0543320.6698790.087954
110.0233050.0443770.6663310.078029
120.0168890.0348200.6626080.067212
130.0121880.0278160.6587680.058585
140.0086310.0220970.6548410.048537
150.0058940.0169780.6508070.044989
160.0042800.0132700.6467120.034727
170.0026330.0104590.6425880.030603
180.0018160.0082710.6384080.025305
190.0010690.0065900.6342080.023882
200.0007510.0052730.6300400.018804
Table 10. Reuters dataset.
Table 10. Reuters dataset.
EpochMAMGD LossAdam LossRMSprop Loss
13.7850233.3807342.784578
23.4704492.0577701.560803
32.7263031.3715441.194215
41.9001681.0469670.962105
51.3692660.8335540.794977
61.0628850.6626380.655288
70.8456020.5227310.545252
80.6652680.4132260.458509
90.5167120.3283570.386447
100.4020840.2646670.329770
110.3166780.2199690.282453
120.2552550.1868740.249277
130.2095450.1659410.222823
140.1770480.1492830.197969
150.1543960.1380110.187647
160.1408630.1256120.164271
170.1245340.1207470.157349
180.1225450.1090290.147332
190.1095030.1065010.139231
200.1077550.1045260.132018
Table 11. Boston Housing Price dataset.
Table 11. Boston Housing Price dataset.
EpochMAMGD LossAdam LossRMSprop Loss
1350.594757140.978012157.523834
233.53176919.07338125.266592
320.51205114.99482818.173952
416.39650512.53493515.347956
512.35932511.14735213.817107
612.29314210.83220312.820888
712.09895810.00843611.574941
810.61321010.02013311.678750
910.7210799.66299111.273701
109.9703588.85619310.683825
119.4632018.38061610.177731
129.2691628.36410310.316528
138.8989388.0321429.776771
148.5863918.8306259.423545
158.2287657.9309379.180243
168.2879127.8875469.455106
178.1103547.7140128.865941
187.1809108.4585248.996699
197.3704717.8912918.734899
207.3209367.4407348.690137
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sakovich, N.; Aksenov, D.; Pleshakova, E.; Gataullin, S. MAMGD: Gradient-Based Optimization Method Using Exponential Decay. Technologies 2024, 12, 154. https://doi.org/10.3390/technologies12090154

AMA Style

Sakovich N, Aksenov D, Pleshakova E, Gataullin S. MAMGD: Gradient-Based Optimization Method Using Exponential Decay. Technologies. 2024; 12(9):154. https://doi.org/10.3390/technologies12090154

Chicago/Turabian Style

Sakovich, Nikita, Dmitry Aksenov, Ekaterina Pleshakova, and Sergey Gataullin. 2024. "MAMGD: Gradient-Based Optimization Method Using Exponential Decay" Technologies 12, no. 9: 154. https://doi.org/10.3390/technologies12090154

APA Style

Sakovich, N., Aksenov, D., Pleshakova, E., & Gataullin, S. (2024). MAMGD: Gradient-Based Optimization Method Using Exponential Decay. Technologies, 12(9), 154. https://doi.org/10.3390/technologies12090154

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop