FastAdaBelief: Improving Convergence Rate for Belief-based Adaptive Optimizer by Strong Convexity

04/28/2021
by   Yangfan Zhou, et al.
0

The AdaBelief algorithm demonstrates superior generalization ability to the Adam algorithm by viewing the exponential moving average of observed gradients. AdaBelief is proved to have a data-dependent O(√(T)) regret bound when objective functions are convex, where T is a time horizon. However, it remains to be an open problem on how to exploit strong convexity to further improve the convergence rate of AdaBelief. To tackle this problem, we present a novel optimization algorithm under strong convexity, called FastAdaBelief. We prove that FastAdaBelief attains a data-dependant O(log T) regret bound, which is substantially lower than AdaBelief. In addition, the theoretical analysis is validated by extensive experiments performed on open datasets (i.e., CIFAR-10 and Penn Treebank) for image classification and language modeling.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/08/2019

SAdam: A Variant of Adam for Strongly Convex Functions

The Adam algorithm has become extremely popular for large-scale machine ...
research
01/27/2019

Stochastic Approximation of Smooth and Strongly Convex Functions: Beyond the O(1/T) Convergence Rate

Stochastic approximation (SA) is a classical approach for stochastic con...
research
02/21/2017

Stochastic Composite Least-Squares Regression with convergence rate O(1/n)

We consider the minimization of composite objective functions composed o...
research
06/13/2023

Accelerated Convergence of Nesterov's Momentum for Deep Neural Networks under Partial Strong Convexity

Current state-of-the-art analyses on the convergence of gradient descent...
research
01/01/2021

Adam revisited: a weighted past gradients perspective

Adaptive learning rate methods have been successfully applied in many fi...
research
09/29/2021

A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

We study a novel two-time-scale stochastic gradient method for solving o...
research
11/18/2015

On the Global Linear Convergence of Frank-Wolfe Optimization Variants

The Frank-Wolfe (FW) optimization algorithm has lately re-gained popular...

Please sign up or login with your details

Forgot password? Click here to reset