A Link between Coding Theory and Cross-Validation with Applications
We study the combinatorics of cross-validation based AUC estimation under the null hypothesis that the binary class labels are exchangeable, that is, the data are randomly assigned into two classes given a fixed class proportion. In particular, we study how the estimators based on leave-pair-out cross-validation (LPOCV), in which every possible pair of data with different class labels is held out from the training set at a time, behave under the null without any prior assumptions of the learning algorithm or the data. It is shown that the maximal number of different fixed proportion label assignments on a sample of data, for which a learning algorithm can achieve zero LPOCV error, is the maximal size of a constant weight error correcting code, whose length is the sample size, weight is the number of data labeled with one, and the Hamming distance between code words is four. We then introduce the concept of a light constant weight code and show similar results for nonzero LPOCV errors. We also prove both upper and lower bounds on the maximal sizes of the light constant weight codes that are similar to the classical results for contant weight codes. These results pave the way towards the design of new LPOCV based statistical tests for the learning algorithms ability of distinguishing two classes from each other that are analogous to the classical Wilcoxon-Mann-Whitney U test for fixed functions. Behavior of some representative examples of learning algorithms and data are simulated in an experimental case study.
READ FULL TEXT