The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial

05/28/2019
by   Benyamin Ghojogh, et al.
0

In this tutorial paper, we first define mean squared error, variance, covariance, and bias of both random variables and classification/predictor models. Then, we formulate the true and generalization errors of the model for both training and validation/test instances where we make use of the Stein's Unbiased Risk Estimator (SURE). We define overfitting, underfitting, and generalization using the obtained true and generalization errors. We introduce cross validation and two well-known examples which are K-fold and leave-one-out cross validations. We briefly introduce generalized cross validation and then move on to regularization where we use the SURE again. We work on both ℓ_2 and ℓ_1 norm regularizations. Then, we show that bootstrap aggregating (bagging) reduces the variance of estimation. Boosting, specifically AdaBoost, is introduced and it is explained as both an additive model and a maximum margin model, i.e., Support Vector Machine (SVM). The upper bound on the generalization error of boosting is also provided to show why boosting prevents from overfitting. As examples of regularization, the theory of ridge and lasso regressions, weight decay, noise injection to input/weights, and early stopping are explained. Random forest, dropout, histogram of oriented gradients, and single shot multi-box detector are explained as examples of bagging in machine learning and computer vision. Finally, boosting tree and SVM models are mentioned as examples of boosting.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/23/2010

Estimating Subagging by cross-validation

In this article, we derive concentration inequalities for the cross-vali...
research
02/28/2013

Estimating the Maximum Expected Value: An Analysis of (Nested) Cross Validation and the Maximum Sample Average

We investigate the accuracy of the two most common estimators for the ma...
research
09/17/2018

Span error bound for weighted SVM with applications in hyperparameter selection

Weighted SVM (or fuzzy SVM) is the most widely used SVM variant owning i...
research
07/04/2019

Subsampling Bias and The Best-Discrepancy Systematic Cross Validation

Statistical machine learning models should be evaluated and validated be...
research
09/05/2018

Deep Bilevel Learning

We present a novel regularization approach to train neural networks that...
research
04/25/2023

Performance Evaluation of Regression Models in Predicting the Cost of Medical Insurance

The study aimed to evaluate the regression models' performance in predic...

Please sign up or login with your details

Forgot password? Click here to reset