How Does Pseudo-Labeling Affect the Generalization Error of the Semi-Supervised Gibbs Algorithm?

by   Haiyun He, et al.

This paper provides an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm. This characterization is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset. It can be applied to obtain distribution-free upper and lower bounds on the gen-error. Our findings offer new insights that the generalization performance of SSL with pseudo-labeling is affected not only by the information between the output hypothesis and input training data but also by the information shared between the labeled and pseudo-labeled data samples. To deepen our understanding, we further explore two examples – mean estimation and logistic regression. In particular, we analyze how the ratio of the number of unlabeled to labeled data λ affects the gen-error under both scenarios. As λ increases, the gen-error for mean estimation decreases and then saturates at a value larger than when all the samples are labeled, and the gap can be quantified exactly with our analysis, and is dependent on the cross-covariance between the labeled and pseudo-labeled data sample. In logistic regression, the gen-error and the variance component of the excess risk also decrease as λ increases.


page 1

page 2

page 3

page 4


Why pseudo label based algorithm is effective? –from the perspective of pseudo labeled data

Recently, pseudo label based semi-supervised learning has achieved great...

Characterizing the Generalization Error of Gibbs Algorithm with Symmetrized KL information

Bounding the generalization error of a supervised learning algorithm is ...

Information-theoretic Characterizations of Generalization Error for the Gibbs Algorithm

Various approaches have been developed to upper bound the generalization...

Improved Generalization of Heading Direction Estimation for Aerial Filming Using Semi-supervised Regression

In the task of Autonomous aerial filming of a moving actor (e.g. a perso...

Pseudo-Labeling for Kernel Ridge Regression under Covariate Shift

We develop and analyze a principled approach to kernel ridge regression ...

In all LikelihoodS: How to Reliably Select Pseudo-Labeled Data for Self-Training in Semi-Supervised Learning

Self-training is a simple yet effective method within semi-supervised le...

Positive-Unlabeled Learning with Uncertainty-aware Pseudo-label Selection

Pseudo-labeling solutions for positive-unlabeled (PU) learning have the ...

Please sign up or login with your details

Forgot password? Click here to reset