Improve SGD Training via Aligning Min-batches

02/23/2020
by   Xiangrui Li, et al.
15

Deep neural networks (DNNs) for supervised learning can be viewed as a pipeline of a feature extractor (i.e. last hidden layer) and a linear classifier (i.e. output layer) that is trained jointly with stochastic gradient descent (SGD). In each iteration of SGD, a mini-batch from the training data is sampled and the true gradient of the loss function is estimated as the noisy gradient calculated on this mini-batch. From the feature learning perspective, the feature extractor should be updated to learn meaningful features with respect to the entire data, and reduce the accommodation to noise in the mini-batch. With this motivation, we propose In-Training Distribution Matching (ITDM) to improve DNN training and reduce overfitting. Specifically, along with the loss function, ITDM regularizes the feature extractor by matching the moments of distributions of different mini-batches in each iteration of SGD, which is fulfilled by minimizing the maximum mean discrepancy. As such, ITDM does not assume any explicit parametric form of data distribution in the latent feature space. Extensive experiments are conducted to demonstrate the effectiveness of our proposed strategy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/23/2020

Improve SGD Training via Aligning Mini-batches

Deep neural networks (DNNs) for supervised learning can be viewed as a p...
research
11/23/2022

Learning Compact Features via In-Training Representation Alignment

Deep neural networks (DNNs) for supervised learning can be viewed as a p...
research
12/18/2017

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

Stochastic Gradient Descent (SGD) with small mini-batch is a key compone...
research
05/01/2017

Determinantal Point Processes for Mini-Batch Diversification

We study a mini-batch diversification scheme for stochastic gradient des...
research
10/14/2020

Optimal quantisation of probability measures using maximum mean discrepancy

Several researchers have proposed minimisation of maximum mean discrepan...
research
06/15/2019

RECAL: Reuse of Established CNN classifer Apropos unsupervised Learning paradigm

Recently, clustering with deep network framework has attracted attention...
research
08/09/2021

On the Power of Differentiable Learning versus PAC and SQ Learning

We study the power of learning via mini-batch stochastic gradient descen...

Please sign up or login with your details

Forgot password? Click here to reset