Rip van Winkle's Razor: A Simple Estimate of Overfit to Test Data

by   Sanjeev Arora, et al.

Traditional statistics forbids use of test data (a.k.a. holdout data) during training. Dwork et al. 2015 pointed out that current practices in machine learning, whereby researchers build upon each other's models, copying hyperparameters and even computer code – amounts to implicitly training on the test set. Thus error rate on test data may not reflect the true population error. This observation initiated adaptive data analysis, which provides evaluation mechanisms with guaranteed upper bounds on this difference. With statistical query (i.e. test accuracy) feedbacks, the best upper bound is fairly pessimistic: the deviation can hit a practically vacuous value if the number of models tested is quadratic in the size of the test set. In this work, we present a simple new estimate, Rip van Winkle's Razor. It relies upon a new notion of “information content” of a model: the amount of information that would have to be provided to an expert referee who is intimately familiar with the field and relevant science/math, and who has been just been woken up after falling asleep at the moment of the creation of the test data (like “Rip van Winkle” of the famous fairy tale). This notion of information content is used to provide an estimate of the above deviation which is shown to be non-vacuous in many modern settings.


page 1

page 2

page 3

page 4


Learning Optimal Linear Regularizers

We present algorithms for efficiently learning regularizers that improve...

Model Similarity Mitigates Test Set Overuse

Excessive reuse of test data has become commonplace in today's machine l...

Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks

Data contamination has become especially prevalent and challenging with ...

Detecting Overfitting via Adversarial Examples

The repeated reuse of test sets in popular benchmark problems raises dou...

Assessing Generalization of SGD via Disagreement

We empirically show that the test error of deep networks can be estimate...

Identifying Model Weakness with Adversarial Examiner

Machine learning models are usually evaluated according to the average c...

Non-Determinism in TensorFlow ResNets

We show that the stochasticity in training ResNets for image classificat...

Please sign up or login with your details

Forgot password? Click here to reset