On b-bit min-wise hashing for large-scale regression and classification with sparse data

by   Rajen D. Shah, et al.

Large-scale regression problems where both the number of variables, p, and the number of observations, n, may be large and in the order of millions or more, are becoming increasingly more common. Typically the data are sparse: only a fraction of a percent of the entries in the design matrix are non-zero. Nevertheless, often the only computationally feasible approach is to perform dimension reduction to obtain a new design matrix with far fewer columns, and then work with this compressed data. b-bit min-wise hashing (Li and Konig, 2011) is a promising dimension reduction scheme for sparse matrices. In this work we study the prediction error of procedures which perform regression in the new lower-dimensional space after applying the method. For both linear and logistic models we show that the average prediction error vanishes asymptotically as long as q β^*_2^2 /n → 0, where q is the average number of non-zero entries in each row of the design matrix and β^* is the coefficient of the linear predictor. We also show that ordinary least squares or ridge regression applied to the reduced data in a sense amounts to a non-parametric regression and can in fact allow us fit more flexible models. We obtain non-asymptotic prediction error bounds for interaction models and for models where an unknown row normalisation must be applied before the signal is linear in the predictors.


Joint variable and rank selection for parsimonious estimation of high-dimensional matrices

We propose dimension reduction methods for sparse, high-dimensional mult...

Asymptotic results for nonparametric regression estimators after sufficient dimension reduction estimation

Prediction, in regression and classification, is one of the main aims in...

A Reliable Effective Terascale Linear Learning System

We present a system and a set of techniques for learning linear predicto...

Chi-square and normal inference in high-dimensional multi-task regression

The paper proposes chi-square and normal inference methodologies for the...

b-Bit Minwise Hashing for Large-Scale Linear SVM

In this paper, we propose to (seamlessly) integrate b-bit minwise hashin...

Supervised Discrete Hashing with Relaxation

Data-dependent hashing has recently attracted attention due to being abl...

Min-Max Kernels

The min-max kernel is a generalization of the popular resemblance kernel...

Please sign up or login with your details

Forgot password? Click here to reset