Characterizing Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods

by   Amirkeivan Mohtashami, et al.

While SGD, which samples from the data with replacement is widely studied in theory, a variant called Random Reshuffling (RR) is more common in practice. RR iterates through random permutations of the dataset and has been shown to converge faster than SGD. When the order is chosen deterministically, a variant called incremental gradient descent (IG), the existing convergence bounds show improvement over SGD but are worse than RR. However, these bounds do not differentiate between a good and a bad ordering and hold for the worst choice of order. Meanwhile, in some cases, choosing the right order when using IG can lead to convergence faster than RR. In this work, we quantify the effect of order on convergence speed, obtaining convergence bounds based on the chosen sequence of permutations while also recovering previous results for RR. In addition, we show benefits of using structured shuffling when various levels of abstractions (e.g. tasks, classes, augmentations, etc.) exists in the dataset in theory and in practice. Finally, relying on our measure, we develop a greedy algorithm for choosing good orders during training, achieving superior performance (by more than 14 percent in accuracy) over RR.


Shuffle SGD is Always Better than SGD: Improved Analysis of SGD with Arbitrary Data Orders

Stochastic Gradient Descent (SGD) algorithms are widely used in optimizi...

Random Reshuffling: Simple Analysis with Vast Improvements

Random Reshuffling (RR) is an algorithm for minimizing finite-sum functi...

Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond

In distributed learning, local SGD (also known as federated averaging) a...

Fast Convergence of Random Reshuffling under Over-Parameterization and the Polyak-Łojasiewicz Condition

Modern machine learning models are often over-parameterized and as a res...

Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds

Stochastic gradient descent (SGD) is perhaps the most prevalent optimiza...

Understanding the Impact of Model Incoherence on Convergence of Incremental SGD with Random Reshuffle

Although SGD with random reshuffle has been widely-used in machine learn...

Toward Understanding Why Adam Converges Faster Than SGD for Transformers

While stochastic gradient descent (SGD) is still the most popular optimi...

Please sign up or login with your details

Forgot password? Click here to reset