A Scalable Approach to Estimating the Rank of High-Dimensional Data
A key challenge to performing effective analyses of high-dimensional data is finding a signal-rich, low-dimensional representation. For linear subspaces, this is generally performed by decomposing a design matrix (via eigenvalue or singular value decomposition) into orthogonal components, and then retaining those components with sufficient variations. This is equivalent to estimating the rank of the matrix and deciding which components to retain is generally carried out using heuristic or ad-hoc approaches such as plotting the decreasing sequence of the eigenvalues and looking for the "elbow" in the plot. While these approaches have been shown to be effective, a poorly calibrated or misjudged elbow location can result in an overabundance of noise or an under-abundance of signal in the low-dimensional representation, making subsequent modeling difficult. In this article, we propose a latent-space-construction procedure to estimate the rank of the detectable signal space of a matrix by retaining components whose variations are significantly greater than random matrices, of which eigenvalues follow a universal Marchĕnko-Pastur (MP) distribution.
READ FULL TEXT