The Calibration Generalization Gap

10/05/2022
by   A. Michael Carrell, et al.
5

Calibration is a fundamental property of a good predictive model: it requires that the model predicts correctly in proportion to its confidence. Modern neural networks, however, provide no strong guarantees on their calibration – and can be either poorly calibrated or well-calibrated depending on the setting. It is currently unclear which factors contribute to good calibration (architecture, data augmentation, overparameterization, etc), though various claims exist in the literature. We propose a systematic way to study the calibration error: by decomposing it into (1) calibration error on the train set, and (2) the calibration generalization gap. This mirrors the fundamental decomposition of generalization. We then investigate each of these terms, and give empirical evidence that (1) DNNs are typically always calibrated on their train set, and (2) the calibration generalization gap is upper-bounded by the standard generalization gap. Taken together, this implies that models with small generalization gap (|Test Error - Train Error|) are well-calibrated. This perspective unifies many results in the literature, and suggests that interventions which reduce the generalization gap (such as adding data, using heavy augmentation, or smaller model size) also improve calibration. We thus hope our initial study lays the groundwork for a more systematic and comprehensive understanding of the relation between calibration, generalization, and optimization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/15/2020

Mitigating bias in calibration error estimation

Building reliable machine learning systems requires that we correctly un...
research
07/08/2020

Diverse Ensembles Improve Calibration

Modern deep neural networks can produce badly calibrated predictions, es...
research
02/11/2021

When and How Mixup Improves Calibration

In many machine learning applications, it is important for the model to ...
research
02/03/2022

A Note on "Assessing Generalization of SGD via Disagreement"

Jiang et al. (2021) give empirical evidence that the average test error ...
research
02/22/2021

MixUp Training Leads to Reduced Overfitting and Improved Calibration for the Transformer Architecture

MixUp is a computer vision data augmentation technique that uses convex ...
research
11/30/2022

A Unifying Theory of Distance from Calibration

We study the fundamental question of how to define and measure the dista...
research
07/28/2023

Is this model reliable for everyone? Testing for strong calibration

In a well-calibrated risk prediction model, the average predicted probab...

Please sign up or login with your details

Forgot password? Click here to reset