Homeostasis phenomenon in predictive inference when using a wrong learning model: a tale of random split of data into training and test sets

by   Min-ge Xie, et al.

This note uses a conformal prediction procedure to provide further support on several points discussed by Professor Efron (Efron, 2020) concerning prediction, estimation and IID assumption. It aims to convey the following messages: (1) Under the IID (e.g., random split of training and testing data sets) assumption, prediction is indeed an easier task than estimation, since prediction has a 'homeostasis property' in this case – Even if the model used for learning is completely wrong, the prediction results maintain valid. (2) If the IID assumption is violated (e.g., a targeted prediction on specific individuals), the homeostasis property is often disrupted and the prediction results under a wrong model are usually invalid. (3) Better model estimation typically leads to more accurate prediction in both IID and non-IID cases. Good modeling and estimation practices are important and, in many times, crucial for obtaining good prediction results. The discussion also provides one explanation why the deep learning method works so well in academic exercises (with experiments set up by randomly splitting the entire data into training and testing data sets), but fails to deliver many `killer applications' in real world applications.


Optimal Ratio for Data Splitting

It is common to split a dataset into training and testing sets before fi...

SPlit: An Optimal Method for Data Splitting

In this article we propose an optimal method referred to as SPlit for sp...

Equivalence Test in Multi-dimensional Space with Applications in A/B Testing

In this paper, we provide a statistical testing framework to check wheth...

Experimental Design for Bathymetry Editing

We describe an application of machine learning to a real-world computer ...

Evaluating Splitting Approaches in the Context of Student Dropout Prediction

The prediction of academic dropout, with the aim of preventing it, is on...

A Diagnostic Approach to Assess the Quality of Data Splitting in Machine Learning

In machine learning, a routine practice is to split the data into a trai...

Inflation of test accuracy due to data leakage in deep learning-based classification of OCT images

In the application of deep learning on optical coherence tomography (OCT...

Please sign up or login with your details

Forgot password? Click here to reset