Do We Need Zero Training Loss After Achieving Zero Training Error?

by   Takashi Ishida, et al.

Overparameterized deep networks have the capacity to memorize training data with zero training error. Even after memorization, the training loss continues to approach zero, making the model overconfident and the test performance degraded. Since existing regularizers do not directly aim to avoid zero training loss, they often fail to maintain a moderate level of training loss, ending up with a too small or too large loss. We propose a direct solution called flooding that intentionally prevents further reduction of the training loss when it reaches a reasonably small value, which we call the flooding level. Our approach makes the loss float around the flooding level by doing mini-batched gradient descent as usual but gradient ascent if the training loss is below the flooding level. This can be implemented with one line of code, and is compatible with any stochastic optimizer and other regularizers. With flooding, the model will continue to "random walk" with the same non-zero training loss, and we expect it to drift into an area with a flat loss landscape that leads to better generalization. We experimentally show that flooding improves performance and as a byproduct, induces a double descent curve of the test loss.


page 1

page 2

page 3

page 4


Jitter: Random Jittering Loss Function

Regularization plays a vital role in machine learning optimization. One ...

When does gradient descent with logistic loss find interpolating two-layer networks?

We study the training of finite-width two-layer smoothed ReLU networks f...

Deep Double Descent via Smooth Interpolation

Overparameterized deep networks are known to be able to perfectly fit th...

Geometric Regularization from Overparameterization explains Double Descent and other findings

The volume of the distribution of possible weight configurations associa...

Uncertainty Sampling is Preconditioned Stochastic Gradient Descent on Zero-One Loss

Uncertainty sampling, a popular active learning algorithm, is used to re...

Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience

The ability of overparameterized deep networks to generalize well has be...

Loss Spike in Training Neural Networks

In this work, we study the mechanism underlying loss spikes observed dur...

Please sign up or login with your details

Forgot password? Click here to reset