Sparse Coresets for SVD on Infinite Streams
In streaming Singular Value Decomposition (SVD), d-dimensional rows of a possibly infinite matrix arrive sequentially as points in R^d. An ϵ-coreset is a (much smaller) matrix whose sum of square distances of the rows to any hyperplane approximates that of the original matrix to a 1 ±ϵ factor. Our main result is that we can maintain a ϵ-coreset while storing only O(d log^2 d / ϵ^2) rows. Known lower bounds of Ω(d / ϵ^2) rows show that this is nearly optimal. Moreover, each row of our coreset is a weighted subset of the input rows. This is highly desirable since it: (1) preserves sparsity; (2) is easily interpretable; (3) avoids precision errors; (4) applies to problems with constraints on the input. Previous streaming results for SVD that return a subset of the input required storing Ω(d log^3 n / ϵ^2) rows where n is the number of rows seen so far. Our algorithm, with storage independent of n, is the first result that uses finite memory on infinite streams. We support our findings with experiments on the Wikipedia dataset benchmarked against state-of-the-art algorithms.
READ FULL TEXT