Fast computation of the principal components of genotype matrices in Julia

by   Jiahao Chen, et al.
Capital One

Finding the largest few principal components of a matrix of genetic data is a common task in genome-wide association studies (GWASs), both for dimensionality reduction and for identifying unwanted factors of variation. We describe a simple random matrix model for matrices that arise in GWASs, showing that the singular values have a bulk behavior that obeys a Marchenko-Pastur distributed with a handful of large outliers. We also implement Golub-Kahan-Lanczos (GKL) bidiagonalization in the Julia programming language, providing thick restarting and a choice between full and partial reorthogonalization strategies to control numerical roundoff. Our implementation of GKL bidiagonalization is up to 36 times faster than software tools used commonly in genomics data analysis for computing principal components, such as EIGENSOFT and FlashPCA, which use dense LAPACK routines and randomized subspace iteration respectively.


page 1

page 2

page 3

page 4


An iterative coordinate descent algorithm to compute sparse low-rank approximations

In this paper, we describe a new algorithm to build a few sparse princip...

How to Detect and Construct N-matrices

N-matrices are real n× n matrices all of whose principal minors are nega...

Sparse Principal Components Analysis: a Tutorial

The topic of this tutorial is Least Squares Sparse Principal Components ...

SIMPCA: A framework for rotating and sparsifying principal components

We propose an algorithmic framework for computing sparse components from...

On Convolutional Approximations to Linear Dimensionality Reduction Operators for Large Scale Data Processing

In this paper, we examine the problem of approximating a general linear ...

Please sign up or login with your details

Forgot password? Click here to reset