Lookaround Optimizer: k steps around, 1 step average

06/13/2023
by   Jiangtao Zhang, et al.
0

Weight Average (WA) is an active research topic due to its simplicity in ensembling deep networks and the effectiveness in promoting generalization. Existing weight average approaches, however, are often carried out along only one training trajectory in a post-hoc manner (i.e., the weights are averaged after the entire training process is finished), which significantly degrades the diversity between networks and thus impairs the effectiveness in ensembling. In this paper, inspired by weight average, we propose Lookaround, a straightforward yet effective SGD-based optimizer leading to flatter minima with better generalization. Specifically, Lookaround iterates two steps during the whole training period: the around step and the average step. In each iteration, 1) the around step starts from a common point and trains multiple networks simultaneously, each on transformed data by a different data augmentation, and 2) the average step averages these trained networks to get the averaged network, which serves as the starting point for the next iteration. The around step improves the functionality diversity while the average step guarantees the weight locality of these networks during the whole training, which is essential for WA to work. We theoretically explain the superiority of Lookaround by convergence analysis, and make extensive experiments to evaluate Lookaround on popular benchmarks including CIFAR and ImageNet with both CNNs and ViTs, demonstrating clear superiority over state-of-the-arts. Our code is available at https://github.com/Ardcy/Lookaround.

READ FULL TEXT
research
05/26/2023

XGrad: Boosting Gradient-Based Optimizers With Weight Prediction

In this paper, we propose a general deep learning training framework XGr...
research
10/15/2020

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Most popular optimizers for deep learning can be broadly categorized as ...
research
06/05/2019

Latent Weights Do Not Exist: Rethinking Binarized Neural Network Optimization

Optimization of Binarized Neural Networks (BNNs) currently relies on rea...
research
07/31/2023

Lookbehind Optimizer: k steps back, 1 step forward

The Lookahead optimizer improves the training stability of deep neural n...
research
11/29/2022

LUMix: Improving Mixup by Better Modelling Label Uncertainty

Modern deep networks can be better generalized when trained with noisy s...
research
06/15/2020

Slowing Down the Weight Norm Increase in Momentum-based Optimizers

Normalization techniques, such as batch normalization (BN), have led to ...
research
01/30/2022

Optimizing Gradient-driven Criteria in Network Sparsity: Gradient is All You Need

Network sparsity receives popularity mostly due to its capability to red...

Please sign up or login with your details

Forgot password? Click here to reset