What is the Adam Optimization Algorithm?
Adam is an alternative optimization algorithm that provides more efficient neural network weights by running repeated cycles of “adaptive moment estimation.” Adam extends on stochastic gradient descent to solve non-convex problems faster while using fewer resources than many other optimization programs. It’s most effective in extremely large data sets by keeping the gradients “tighter” over many learning iterations.
Adam combines the advantages of two other stochastic gradient techniques, Adaptive Gradients and Root Mean Square Propagation, to create a new learning approach to optimize a variety of neural networks.
Adam vs Classical Stochastic Gradient Descent
With stochastic gradient descent (SGD), a single learning rate (called alpha) is used for all weight updates. In addition, the learning rate for each network parameter (weight) does not change during training.
Now SGD does separately calculate the individual adaptive learning rates for different weights for later human analysis by taking estimates of the first (the mean) and second (the uncentered variance) moments of the gradients.
Adam, on the other hand, adapts the parameter learning rates in real-time based on the average of the first and second moments. In particular, by calculating an exponential moving average of the gradient as well as the squared gradient. Then the parameters beta1 and beta2 can control the decay rates of both moving averages. Finally, bias correction estimates are run before updating the learning parameters.