Optimal Sampling for Generalized Linear Models under Measurement Constraints
Suppose we are using a generalized linear model to predict a scalar outcome Y given a covariate vector X. We consider two related problems and propose a methodology for both. In the first problem, every data point in a large dataset has both Y and X known, but we wish to use a subset of the data to limit computational costs. In the second problem, sometimes call "measurement constraints," Y is expensive to measure and initially is available only for a small portion of the data. The goal is to select another subset of data where Y will also be measured. We focus on the more challenging but less well-studied measurement constraint problem. A popular approach for the first problem is sampling. However, most existing sampling algorithms require Y is measured at all data points, so they cannot be used under measurement constraints. We propose an optimal sampling procedure for massive datasets under measurement constraints (OSUMC). We show consistency and asymptotic normality of estimators from a general class of sampling procedures. An optimal oracle sampling procedure is derived and a two-step algorithm is proposed to approximate the oracle procedure. Numerical results demonstrate the advantages of OSUMC over existing sampling methods.
READ FULL TEXT