AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

09/29/2018
by   Zhiming Zhou, et al.
0

Adam is shown not being able to converge to the optimal solution in certain cases. Researchers recently propose several algorithms to avoid the issue of non-convergence of Adam, but their efficiency turns out to be unsatisfactory in practice. In this paper, we provide a new insight into the non-convergence issue of Adam as well as other adaptive learning rate methods. We argue that there exists an inappropriate correlation between gradient g_t and the second moment term v_t in Adam (t is the timestep), which results in that a large gradient is likely to have small step size while a small gradient may have a large step size. We demonstrate that such unbalanced step sizes are the fundamental cause of non-convergence of Adam, and we further prove that decorrelating v_t and g_t will lead to unbiased step size for each gradient, thus solving the non-convergence problem of Adam. Finally, we propose AdaShift, a novel adaptive learning rate method that decorrelates v_t and g_t by temporal shifting, i.e., using temporally shifted gradient g_t-n to calculate v_t. The experiment results demonstrate that AdaShift is able to address the non-convergence issue of Adam, while still maintaining a competitive performance with Adam in terms of both training speed and generalization.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset