Test Set Sizing Via Random Matrix Theory

12/11/2021
by   Alexander Dubbs, et al.
0

This paper uses techniques from Random Matrix Theory to find the ideal training-testing data split for a simple linear regression with m data points, each an independent n-dimensional multivariate Gaussian. It defines "ideal" as satisfying the integrity metric, i.e. the empirical model error is the actual measurement noise, and thus fairly reflects the value or lack of same of the model. This paper is the first to solve for the training and test size for any model in a way that is truly optimal. The number of data points in the training set is the root of a quartic polynomial Theorem 1 derives which depends only on m and n; the covariance matrix of the multivariate Gaussian, the true model parameters, and the true measurement noise drop out of the calculations. The critical mathematical difficulties were realizing that the problems herein were discussed in the context of the Jacobi Ensemble, a probability distribution describing the eigenvalues of a known random matrix model, and evaluating a new integral in the style of Selberg and Aomoto. Mathematical results are supported with thorough computational evidence. This paper is a step towards automatic choices of training/test set sizes in machine learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/25/2017

On the repeated inversion of a covariance matrix

In many cases, the values of some model parameters are determined by max...
research
03/22/2023

No Eigenvalues Outside the Support of the Limiting Spectral Distribution of Large Dimensional noncentral Sample Covariance Matrices

Let _n =1/n(_n + ^1/2_n _n)(_n + ^1/2_n _n)^*, where _n is a p × n matri...
research
06/23/2022

A Diagnostic Approach to Assess the Quality of Data Splitting in Machine Learning

In machine learning, a routine practice is to split the data into a trai...
research
11/17/2018

A Greedy approximation scheme for Sparse Gaussian process regression

In their standard form Gaussian processes (GPs) provide a powerful non-p...
research
09/17/2020

Multivariate binary probability distribution in the Grassmann formalism

We propose a probability distribution for multivariate binary random var...
research
10/11/2019

Fitting a manifold of large reach to noisy data

Let M⊂R^n be a C^2-smooth compact submanifold of dimension d. Assume tha...
research
11/11/2019

Rethinking Generalisation

In this paper, we present a new approach to computing the generalisation...

Please sign up or login with your details

Forgot password? Click here to reset