Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits

10/10/2021
by   Yan Li, et al.
23

Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains. To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD, largely accredited to their token-dependent learning rate. However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored. We show that incorporating frequency information of tokens in the embedding learning problems leads to provably efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent. Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced. Empirically, we show the proposed algorithms are able to improve or match adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system, closing the performance gap between SGD and adaptive algorithms. Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/22/2019

DTN: A Learning Rate Scheme with Convergence Rate of O(1/t) for SGD

We propose a novel diminishing learning rate scheme, coined Decreasing-T...
research
12/23/2014

ADASECANT: Robust Adaptive Secant Method for Stochastic Gradient

Stochastic gradient algorithms have been the main focus of large-scale l...
research
04/15/2020

On Learning Rates and Schrödinger Operators

The learning rate is perhaps the single most important parameter in the ...
research
10/21/2020

Adaptive Gradient Method with Resilience and Momentum

Several variants of stochastic gradient descent (SGD) have been proposed...
research
06/30/2020

AdaSGD: Bridging the gap between SGD and Adam

In the context of stochastic gradient descent(SGD) and adaptive moment e...
research
09/05/2018

Anytime Hedge achieves optimal regret in the stochastic regime

This paper is about a surprising fact: we prove that the anytime Hedge a...
research
07/06/2020

TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

We investigate whether Jacobi preconditioning, accounting for the bootst...

Please sign up or login with your details

Forgot password? Click here to reset