Variance-reduced Clipping for Non-convex Optimization

by   Amirhossein Reisizadeh, et al.

Gradient clipping is a standard training technique used in deep learning applications such as large-scale language modeling to mitigate exploding gradients. Recent experimental studies have demonstrated a fairly special behavior in the smoothness of the training objective along its trajectory when trained with gradient clipping. That is, the smoothness grows with the gradient norm. This is in clear contrast to the well-established assumption in folklore non-convex optimization, a.k.a. L-smoothness, where the smoothness is assumed to be bounded by a constant L globally. The recently introduced (L_0,L_1)-smoothness is a more relaxed notion that captures such behavior in non-convex optimization. In particular, it has been shown that under this relaxed smoothness assumption, SGD with clipping requires O(ϵ^-4) stochastic gradient computations to find an ϵ-stationary solution. In this paper, we employ a variance reduction technique, namely SPIDER, and demonstrate that for a carefully designed learning rate, this complexity is improved to O(ϵ^-3) which is order-optimal. The corresponding learning rate comprises the clipping technique to mitigate the growing smoothness. Moreover, when the objective function is the average of n components, we improve the existing O(nϵ^-2) bound on the stochastic gradient complexity to order-optimal O(√(n)ϵ^-2 + n).


page 1

page 2

page 3

page 4


Convex and Non-Convex Optimization under Generalized Smoothness

Classical analysis of convex and non-convex optimization methods often r...

Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with Variance Reduction and its Application to Optimization

The stochastic gradient Langevin Dynamics is one of the most fundamental...

Adaptive Strategies in Non-convex Optimization

An algorithm is said to be adaptive to a certain parameter (of the probl...

Convergence of Adam Under Relaxed Assumptions

In this paper, we provide a rigorous proof of convergence of the Adaptiv...

On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

The application of stochastic variance reduction to optimization has sho...

COCO Denoiser: Using Co-Coercivity for Variance Reduction in Stochastic Convex Optimization

First-order methods for stochastic optimization have undeniable relevanc...

Optimal Complexity and Certification of Bregman First-Order Methods

We provide a lower bound showing that the O(1/k) convergence rate of the...

Please sign up or login with your details

Forgot password? Click here to reset